Details

OpenLink Software
Burlington, United States

Subscribe

Post Categories

Recent Articles

Community Member Blogs

Display Settings

articles per page.
order.

Translate

Showing posts in all categories RefreshRefresh
Benchmarks, Redux (part 9): BSBM With Cluster [ Orri Erling ]

This post is dedicated to our brothers in horizontal partitioning (or sharding), Garlik and Bigdata.

At first sight, the BSBM Explore mix appears very cluster-unfriendly, as it contains short queries that access data at random. There is every opportunity for latency and few opportunities for parallelism.

For this reason we had not even run the BSBM mix with Virtuoso Cluster. We were not surprised to learn that Garlik hadn't run BSBM either. We have understood from Systap that their Bigdata BSBM experiments were on a single-process configuration.

But the 4Store results in the recent Berlin report were with a distributed setup, as 4Store always runs a multiprocess configuration, even on a single server, so it seemed interesting to us to compare how Virtuoso Cluster compares with Virtuoso Single with this workload. These tests were run on a different box than the recent BSBM tests, so those 4Store figures are not directly comparable.

The setup here consists of 8 partitions, each managed by its own process, all running on the same box. Any of these processes can have its HTTP and SQL listener and can provide the same service. Most access to data goes over the interconnect, except when the data is co-resident in the process which is coordinating the query. The interconnect is Unix domain sockets since all 8 processes are on the same box.

6 Cluster - Load Rates and Times
Scale Rate
(quads per second)
Load time
(seconds)
Checkpoint time
(seconds)
100 Mt 119,204 749 89
200 Mt 121,607 1486 157
1000 Mt 102,694 8737 979

6 Single - Load Rates and Times
Scale Rate
(quads per second)
Load time
(seconds)
Checkpoint time
(seconds)
100 Mt 74,713 1192 145

The load times are systematically better than for 6 Single. This is also not bad compared to the 7 Single vectored load rates of 220 Kt/s or so. We note that loading is a cluster friendly operation, going at a steady 1400+% CPU utilization with an aggregate message throughput of 40MB/s. 7 Single is faster because of vectoring at the index level, not because the clusters were hitting communication overheads. 6 Cluster is faster than 6 Single because scale-out in this case diminishes contention, even on a single box.

Throughput is as follows:

6 Cluster - Throughput
(QMpH, query mixes per hour)
Scale Single User 16 User
100 Mt 7318 43120
200 Mt 6222 29981
1000 Mt 2526 11156

6 Single - Throughput
(QMpH, query mixes per hour)
Scale Single User 16 User
100 Mt 7641 29433
200 Mt 6017 13335
1000 Mt 1770 2487

Below is a snapshot of status during the 6 Cluster 100 Mt run.

Cluster 8 nodes, 15 s.
       25784 m/s  25682 KB/s  1160% cpu  0% read  740% clw  threads 18r 0w 10i  buffers 1133459  12 d  4 w  0 pfs
cl 1:  10851 m/s   3911 KB/s   597% cpu  0% read  668% clw  threads 17r 0w 10i  buffers  143992   4 d  0 w  0 pfs
cl 2:   2194 m/s   7959 KB/s   107% cpu  0% read    9% clw  threads  1r 0w  0i  buffers  143616   3 d  2 w  0 pfs
cl 3:   2186 m/s   7818 KB/s   107% cpu  0% read    9% clw  threads  0r 0w  0i  buffers  140787   0 d  0 w  0 pfs
cl 4:   2174 m/s   2804 KB/s    77% cpu  0% read   10% clw  threads  0r 0w  0i  buffers  140654   0 d  2 w  0 pfs
cl 5:   2127 m/s   1612 KB/s    71% cpu  0% read    9% clw  threads  0r 0w  0i  buffers  140949   1 d  0 w  0 pfs
cl 6:   2060 m/s    544 KB/s    66% cpu  0% read   10% clw  threads  0r 0w  0i  buffers  141295   2 d  0 w  0 pfs
cl 7:   2072 m/s    517 KB/s    65% cpu  0% read   11% clw  threads  0r 0w  0i  buffers  141111   1 d  0 w  0 pfs
cl 8:   2105 m/s    522 KB/s    66% cpu  0% read   10% clw  threads  0r 0w  0i  buffers  141055   1 d  0 w  0 pfs

The main meters for cluster execution are the messages-per-second (m/s), the message volume (KB/s), and the total CPU% of the processes.

We note that CPU utilization is highly uneven and messages are short, about 1K on the average, compared to about 100K during the load. CPU would be evenly divided between the nodes if each got a share of the HTTP requests. We changed the test driver to round-robin requests between multiple end points. The work does then get evenly divided, but the speed is not affected. Also, this does not improve the message sizes since the workload consists mostly of short lookups. However, with the processes spread over multiple servers, the round-robin would be essential for CPU and especially for interconnect throughput.

Then we try 6 Cluster at 1000 Mt. For Single User, we get 1180 m/s, 6955 KB/s, and 173% cpu. For 16 User, this is 6573 m/s, 44366 KB/s, 1470% cpu.

This is a lot better than the figures with 6 Single, due to lower contention on the index tree, as discussed in A Benchmarking Story. Also Single User throughput on 6 Cluster outperforms 6 Single, due to the natural parallelism of doing the Q5 joins in parallel in each partition. The larger the scale, the more weight this has in the metric. We see this also in the average message size, i.e., the KB/s throughput is almost double while the messages/s is a bit under a third.

The small-scale 6 Cluster run is about even with the 6 Single figure. Looking at the details, we see that the qps for Q1 in 6 Cluster is half of that on 6 Single, whereas the qps for Q5 on 6 Cluster is about double that of the 6 Single. This is as one might expect; longer queries are favored, and single row lookups are penalized.

Looking further at the 6 Cluster status we see the cluster wait (clw) to be 740%. For 16 Users, this means that about half of the execution real time is spent waiting for responses from other partitions. A high figure means uneven distribution between partitions; a low figure means even. This is as expected, since many queries are concerned with just one S and its related objects.

We will update this section once 7 Cluster is ready. This will implement vectored execution and column store inside the cluster nodes.

Benchmarks, Redux Series

# PermaLink Comments [0]
03/09/2011 17:54 GMT Modified: 03/14/2011 19:36 GMT
Benchmarks, Redux (part 9): BSBM With Cluster [ Virtuso Data Space Bot ]

This post is dedicated to our brothers in horizontal partitioning (or sharding), Garlik and Bigdata.

At first sight, the BSBM Explore mix appears very cluster-unfriendly, as it contains short queries that access data at random. There is every opportunity for latency and few opportunities for parallelism.

For this reason we had not even run the BSBM mix with Virtuoso Cluster. We were not surprised to learn that Garlik hadn't run BSBM either. We have understood from Systap that their Bigdata BSBM experiments were on a single-process configuration.

But the 4Store results in the recent Berlin report were with a distributed setup, as 4Store always runs a multiprocess configuration, even on a single server, so it seemed interesting to us to compare how Virtuoso Cluster compares with Virtuoso Single with this workload. These tests were run on a different box than the recent BSBM tests, so those 4Store figures are not directly comparable.

The setup here consists of 8 partitions, each managed by its own process, all running on the same box. Any of these processes can have its HTTP and SQL listener and can provide the same service. Most access to data goes over the interconnect, except when the data is co-resident in the process which is coordinating the query. The interconnect is Unix domain sockets since all 8 processes are on the same box.

6 Cluster - Load Rates and Times
Scale Rate
(quads per second)
Load time
(seconds)
Checkpoint time
(seconds)
100 Mt 119,204 749 89
200 Mt 121,607 1486 157
1000 Mt 102,694 8737 979

6 Single - Load Rates and Times
Scale Rate
(quads per second)
Load time
(seconds)
Checkpoint time
(seconds)
100 Mt 74,713 1192 145

The load times are systematically better than for 6 Single. This is also not bad compared to the 7 Single vectored load rates of 220 Kt/s or so. We note that loading is a cluster friendly operation, going at a steady 1400+% CPU utilization with an aggregate message throughput of 40MB/s. 7 Single is faster because of vectoring at the index level, not because the clusters were hitting communication overheads. 6 Cluster is faster than 6 Single because scale-out in this case diminishes contention, even on a single box.

Throughput is as follows:

6 Cluster - Throughput
(QMpH, query mixes per hour)
Scale Single User 16 User
100 Mt 7318 43120
200 Mt 6222 29981
1000 Mt 2526 11156

6 Single - Throughput
(QMpH, query mixes per hour)
Scale Single User 16 User
100 Mt 7641 29433
200 Mt 6017 13335
1000 Mt 1770 2487

Below is a snapshot of status during the 6 Cluster 100 Mt run.

Cluster 8 nodes, 15 s.
       25784 m/s  25682 KB/s  1160% cpu  0% read  740% clw  threads 18r 0w 10i  buffers 1133459  12 d  4 w  0 pfs
cl 1:  10851 m/s   3911 KB/s   597% cpu  0% read  668% clw  threads 17r 0w 10i  buffers  143992   4 d  0 w  0 pfs
cl 2:   2194 m/s   7959 KB/s   107% cpu  0% read    9% clw  threads  1r 0w  0i  buffers  143616   3 d  2 w  0 pfs
cl 3:   2186 m/s   7818 KB/s   107% cpu  0% read    9% clw  threads  0r 0w  0i  buffers  140787   0 d  0 w  0 pfs
cl 4:   2174 m/s   2804 KB/s    77% cpu  0% read   10% clw  threads  0r 0w  0i  buffers  140654   0 d  2 w  0 pfs
cl 5:   2127 m/s   1612 KB/s    71% cpu  0% read    9% clw  threads  0r 0w  0i  buffers  140949   1 d  0 w  0 pfs
cl 6:   2060 m/s    544 KB/s    66% cpu  0% read   10% clw  threads  0r 0w  0i  buffers  141295   2 d  0 w  0 pfs
cl 7:   2072 m/s    517 KB/s    65% cpu  0% read   11% clw  threads  0r 0w  0i  buffers  141111   1 d  0 w  0 pfs
cl 8:   2105 m/s    522 KB/s    66% cpu  0% read   10% clw  threads  0r 0w  0i  buffers  141055   1 d  0 w  0 pfs

The main meters for cluster execution are the messages-per-second (m/s), the message volume (KB/s), and the total CPU% of the processes.

We note that CPU utilization is highly uneven and messages are short, about 1K on the average, compared to about 100K during the load. CPU would be evenly divided between the nodes if each got a share of the HTTP requests. We changed the test driver to round-robin requests between multiple end points. The work does then get evenly divided, but the speed is not affected. Also, this does not improve the message sizes since the workload consists mostly of short lookups. However, with the processes spread over multiple servers, the round-robin would be essential for CPU and especially for interconnect throughput.

Then we try 6 Cluster at 1000 Mt. For Single User, we get 1180 m/s, 6955 KB/s, and 173% cpu. For 16 User, this is 6573 m/s, 44366 KB/s, 1470% cpu.

This is a lot better than the figures with 6 Single, due to lower contention on the index tree, as discussed in A Benchmarking Story. Also Single User throughput on 6 Cluster outperforms 6 Single, due to the natural parallelism of doing the Q5 joins in parallel in each partition. The larger the scale, the more weight this has in the metric. We see this also in the average message size, i.e., the KB/s throughput is almost double while the messages/s is a bit under a third.

The small-scale 6 Cluster run is about even with the 6 Single figure. Looking at the details, we see that the qps for Q1 in 6 Cluster is half of that on 6 Single, whereas the qps for Q5 on 6 Cluster is about double that of the 6 Single. This is as one might expect; longer queries are favored, and single row lookups are penalized.

Looking further at the 6 Cluster status we see the cluster wait (clw) to be 740%. For 16 Users, this means that about half of the execution real time is spent waiting for responses from other partitions. A high figure means uneven distribution between partitions; a low figure means even. This is as expected, since many queries are concerned with just one S and its related objects.

We will update this section once 7 Cluster is ready. This will implement vectored execution and column store inside the cluster nodes.

Benchmarks, Redux Series

# PermaLink Comments [0]
03/09/2011 17:54 GMT Modified: 03/14/2011 19:36 GMT
Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs [ Orri Erling ]

In the context of database benchmarks we cannot ignore I/O, as pretty much has been done so far by BSBM.

There are two approaches:

  1. run twice or otherwise make sure one runs from memory and forget about I/O, or

  2. make rules and metrics for warm-up.

We will see if the second is possible with BSBM.

From this starting point, we look at various ways of scheduling I/O in Virtuoso using a 1000 Mt BSBM database on sets of each of HDDs (hard disk devices) and SSDs (solid-state storage devices). We will see that SSDs in this specific application can make a significant difference.

In this test we have the same 4 stripes of a 1000 Mt BSBM database on each of two storage arrays.

Storage Arrays
Type Quantity Maker Size Speed Interface speed Controller Drive Cache RAID
SSD 4 Crucial 128 GB N/A 6Gbit SATA RocketRaid 640 128 MB None
HDD 4 Samsung 1000 GB 7200 RPM 3Gbit SATA Intel ICH on Supermicro motherboard 16 MB None

We make sure that the files are not in OS cache by filling it with other big files, reading a total of 120 GB off SSDs with `cat file > /dev/null`.

The configuration files are as in the report on the 1000 Mt run. We note as significant that we have a few file descriptors for each stripe, and that read-ahead for each is handled by its own thread.

Two different read-ahead schemes are used:

  • With 6 Single, if a 2MB extent gets a second read within a given time after the first, the whole extent is scheduled for background read.

  • With 7 Single, as an index search is vectored, we know a large number of values to fetch at one time and these values are sorted into an ascending sequence. Therefore, by looking at a node in an index tree, we can determine which sub-trees will be accessed and schedule these for read-ahead, skipping any that will not be accessed.

In either model, a sequential scan touching more than a couple of consecutive index leaf pages triggers a read-ahead, to the end of the scanned range or to the next 3000 index leaves, whichever comes first. However, there are no sequential scans of significant size in BSBM.

There are a few different possibilities for the physical I/O:

  1. Using a separate read system call for each page. There may be several open file descriptors on a file so that many such calls can proceed concurrently on different threads; the OS will order the operations.

  2. A thread finds it needs a page and reads it.

  3. Using Unix asynchronous I/O, aio.h, with the aio_* and lio_listio functions.

  4. Using single-read system calls for adjacent pages. In this way, the drive sees longer requests and should give better throughput. If there are short gaps in the sequence, the gaps are also read, wasting bandwidth but saving on latency.

The two latter apply only to bulk I/O that are scheduled on background threads, one per independently-addressable device (HDD, SSD, or RAID-set). These bulk-reads operate on an elevator model, keeping a sorted queue of things to read or write and moving through this queue from start to end. At any time, the queue may get more work from other threads.

There is a further choice when seeing single-page random requests. They can either go to the elevator or they can be done in place. Taking the elevator is presumably good for throughput but bad for latency. In general, the elevator should have a notion of fairness; these matters are discussed in the CWI collaborative scan paper. Here we do not have long queries, so we do not have to talk about elevator policies or scan sharing; there are no scans. We may touch on these questions later with the column store, the BSBM BI mix, and TPC-H.

While we may know principles, I/O has always given us surprises; the only way to optimize this is to measure.

The metric we try to optimize here is the time it takes for a multiuser BSBM run starting from cold cache to get to 1200% CPU. When running from memory, the CPU is around 1350% for the system in question.

This depends on getting I/O throughput, which in turn depends on having a lot of speculative reading since the workload itself does not give any long stretches to read.

The test driver is set at 16 clients, and the run continues for 2000 query mixes or until target throughput is reached. Target throughput is deemed reached after the first 20 second stretch with CPU at 1200% or higher.

The meter is a stored procedure that records the CPU time, count of reads, cumulative elapsed time spent waiting for I/O, and other metrics. The code for this procedure (for 7 Single; this file will not work on Virtuoso 6 or earlier) is available here.

The database space allocation gives each index a number of 2MB segments, each with 256 8K pages. When a page splits, the new page is allocated from the same extent if possible, or from a specific second extent which is designated as the overflow extent of this extent. This scheme provides for a sort of pseudo-locality within extents over random insert order. Thus there is a chance that pre-reading an extent will get key values in the same range a the ones on the page being requested in the first place. At least the pre-read pages will be from the same index tree. There are insertion orders that do not create good locality with this allocation scheme, though. In order to generally improve locality, one could shuffle pages of an all-dirty subtree before writing this out so as to have physical order match key order. We will look at some tricks in this vein with the column store.

For the sake of simplicity we only run 7 Single with the 1000 Mt scale.

The first experiment was with SSDs and the vectored read-ahead. The target throughput was reached after 280 seconds.

The next test was with HDDs and extent read-ahead. One hour into the experiment, the CPU was about 70% after processing around 1000 query mixes. It might have been hours before HDD reads became rare enough for hitting 1200% CPU. The test was not worth continuing.

The result with HDDs and vectored read-ahead would be worse since vectored read-ahead leads to smaller read-ahead batches and to less contiguous read patterns. The individual read times here, are over twice the individual read times with per-extent read-ahead. The fact that vectored read-ahead does not read potentially unneeded pages makes no difference. Hence this test is also not worth running to completion.

There are other possibilities for improving HDD I/O. If only 2MB read requests are made, a transfer will be about 20 ms at a sequential transfer speed of 50 MB/s. Then seeking to the next 2MB extent will be a few ms, most often less than 20, so the HDD should give at least half the nominal throughput.

We note that, when reading sequential 8K pages inside a single 2MB (256 page) extent, the seek latency is not 0 as one would expect but an extreme 5 ms. One would think that the drive would buffer a whole track, and a track would hold a large number of 2MB sections, but apparently this is not so.

Therefore, now if we have a sequential read pattern that is more dense than 1 page out of 10, we read all the pages and just keep the ones we want.

So now we set the read-ahead to merge reads that fall within 10 pages. This wastes bandwidth, but supposedly saves on latency. We will see.

So we try, and we find that read-ahead does not account for most pages since it does not get triggered. Thus, we change the triggering condition to be the 2nd read to fall in the extent within 20 seconds of the first.

The HDDs were in all cases 700% busy for 4 HDDs. But with the new setting we get longer requests, most often full extents, which gets a per-HDD transfer rate of about 5 MB/s. With the looser condition for starting read-ahead, 89% of all pages were read in a read-ahead batch. We see the I/O throughput decrease during the run because there are more single-page reads that do not trigger extent read-ahead. So HDDs have 1.7 concurrent operations pending, but the batch size drops, dropping the throughput.

Thus with the best settings, the test with 2000 query mixes finishes in 46 minutes, and the CPU utilization is steadily increasing, hitting 392% for the last minute. In comparison, with SSDs and our worst read-ahead setting we got 1200% CPU in under 5 minutes from cold start. The I/O system can be further tuned; for example, by only reading full extents as long as the buffer pool is not full. In the next post we will measure some more.

BSBM Note

We look at query times with semi-warm cache, with CPU around 400%. We note that Q8-Q12 are especially bad. Q5 runs at about half speed. Q12 runs at under 1/10th speed. The relatively slowest queries appear to be single-instance lookups. Nothing short of the most aggressive speculative reading can help there. Neither query nor workload has any exploitable pattern. Therefore if an I/O component is to be included in a BSBM metric, the only way to score in this is to use speculative read to the maximum.

Some of the queries take consecutive property values of a single instance. One could parallelize this pipeline, but this would be a one-off and would make sense only when reading from storage (whether HDD, SSD, or otherwise). Multithreading for single rows is not worth the overhead.

A metric for BSBM warm-up is not interesting for database science, but may still be of practical interest in the specific case of RDF stores. Specially reading large chunks at startup time is good, so putting a section in BSBM that would force one to implement this would be a service to most end users. Measuring and reporting such I/O performance would favor space efficiency in general. Space efficiency is generally a good thing, especially at larger scales, so we can put an optional section in the report for warm-up. This is also good for comparing HDDs and SSDs, and for testing read-ahead, which is still something a database is expected to do. Implementors have it easy; just speculatively read everything.

Looking at the BSBM fictional use case, anybody running such a portal would do this from RAM only, so it makes sense to define the primary metric as running from warm cache, in practice 100% from memory.

Benchmarks, Redux Series

# PermaLink Comments [0]
03/07/2011 14:17 GMT Modified: 03/14/2011 17:16 GMT
Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs [ Orri Erling ]

In the context of database benchmarks we cannot ignore I/O, as pretty much has been done so far by BSBM.

There are two approaches:

  1. run twice or otherwise make sure one runs from memory and forget about I/O, or

  2. make rules and metrics for warm-up.

We will see if the second is possible with BSBM.

From this starting point, we look at various ways of scheduling I/O in Virtuoso using a 1000 Mt BSBM database on sets of each of HDDs (hard disk devices) and SSDs (solid-state storage devices). We will see that SSDs in this specific application can make a significant difference.

In this test we have the same 4 stripes of a 1000 Mt BSBM database on each of two storage arrays.

Storage Arrays
Type Quantity Maker Size Speed Interface speed Controller Drive Cache RAID
SSD 4 Crucial 128 GB N/A 6Gbit SATA RocketRaid 640 128 MB None
HDD 4 Samsung 1000 GB 7200 RPM 3Gbit SATA Intel ICH on Supermicro motherboard 16 MB None

We make sure that the files are not in OS cache by filling it with other big files, reading a total of 120 GB off SSDs with `cat file > /dev/null`.

The configuration files are as in the report on the 1000 Mt run. We note as significant that we have a few file descriptors for each stripe, and that read-ahead for each is handled by its own thread.

Two different read-ahead schemes are used:

  • With 6 Single, if a 2MB extent gets a second read within a given time after the first, the whole extent is scheduled for background read.

  • With 7 Single, as an index search is vectored, we know a large number of values to fetch at one time and these values are sorted into an ascending sequence. Therefore, by looking at a node in an index tree, we can determine which sub-trees will be accessed and schedule these for read-ahead, skipping any that will not be accessed.

In either model, a sequential scan touching more than a couple of consecutive index leaf pages triggers a read-ahead, to the end of the scanned range or to the next 3000 index leaves, whichever comes first. However, there are no sequential scans of significant size in BSBM.

There are a few different possibilities for the physical I/O:

  1. Using a separate read system call for each page. There may be several open file descriptors on a file so that many such calls can proceed concurrently on different threads; the OS will order the operations.

  2. A thread finds it needs a page and reads it.

  3. Using Unix asynchronous I/O, aio.h, with the aio_* and lio_listio functions.

  4. Using single-read system calls for adjacent pages. In this way, the drive sees longer requests and should give better throughput. If there are short gaps in the sequence, the gaps are also read, wasting bandwidth but saving on latency.

The two latter apply only to bulk I/O that are scheduled on background threads, one per independently-addressable device (HDD, SSD, or RAID-set). These bulk-reads operate on an elevator model, keeping a sorted queue of things to read or write and moving through this queue from start to end. At any time, the queue may get more work from other threads.

There is a further choice when seeing single-page random requests. They can either go to the elevator or they can be done in place. Taking the elevator is presumably good for throughput but bad for latency. In general, the elevator should have a notion of fairness; these matters are discussed in the CWI collaborative scan paper. Here we do not have long queries, so we do not have to talk about elevator policies or scan sharing; there are no scans. We may touch on these questions later with the column store, the BSBM BI mix, and TPC-H.

While we may know principles, I/O has always given us surprises; the only way to optimize this is to measure.

The metric we try to optimize here is the time it takes for a multiuser BSBM run starting from cold cache to get to 1200% CPU. When running from memory, the CPU is around 1350% for the system in question.

This depends on getting I/O throughput, which in turn depends on having a lot of speculative reading since the workload itself does not give any long stretches to read.

The test driver is set at 16 clients, and the run continues for 2000 query mixes or until target throughput is reached. Target throughput is deemed reached after the first 20 second stretch with CPU at 1200% or higher.

The meter is a stored procedure that records the CPU time, count of reads, cumulative elapsed time spent waiting for I/O, and other metrics. The code for this procedure (for 7 Single; this file will not work on Virtuoso 6 or earlier) is available here.

The database space allocation gives each index a number of 2MB segments, each with 256 8K pages. When a page splits, the new page is allocated from the same extent if possible, or from a specific second extent which is designated as the overflow extent of this extent. This scheme provides for a sort of pseudo-locality within extents over random insert order. Thus there is a chance that pre-reading an extent will get key values in the same range a the ones on the page being requested in the first place. At least the pre-read pages will be from the same index tree. There are insertion orders that do not create good locality with this allocation scheme, though. In order to generally improve locality, one could shuffle pages of an all-dirty subtree before writing this out so as to have physical order match key order. We will look at some tricks in this vein with the column store.

For the sake of simplicity we only run 7 Single with the 1000 Mt scale.

The first experiment was with SSDs and the vectored read-ahead. The target throughput was reached after 280 seconds.

The next test was with HDDs and extent read-ahead. One hour into the experiment, the CPU was about 70% after processing around 1000 query mixes. It might have been hours before HDD reads became rare enough for hitting 1200% CPU. The test was not worth continuing.

The result with HDDs and vectored read-ahead would be worse since vectored read-ahead leads to smaller read-ahead batches and to less contiguous read patterns. The individual read times here, are over twice the individual read times with per-extent read-ahead. The fact that vectored read-ahead does not read potentially unneeded pages makes no difference. Hence this test is also not worth running to completion.

There are other possibilities for improving HDD I/O. If only 2MB read requests are made, a transfer will be about 20 ms at a sequential transfer speed of 50 MB/s. Then seeking to the next 2MB extent will be a few ms, most often less than 20, so the HDD should give at least half the nominal throughput.

We note that, when reading sequential 8K pages inside a single 2MB (256 page) extent, the seek latency is not 0 as one would expect but an extreme 5 ms. One would think that the drive would buffer a whole track, and a track would hold a large number of 2MB sections, but apparently this is not so.

Therefore, now if we have a sequential read pattern that is more dense than 1 page out of 10, we read all the pages and just keep the ones we want.

So now we set the read-ahead to merge reads that fall within 10 pages. This wastes bandwidth, but supposedly saves on latency. We will see.

So we try, and we find that read-ahead does not account for most pages since it does not get triggered. Thus, we change the triggering condition to be the 2nd read to fall in the extent within 20 seconds of the first.

The HDDs were in all cases 700% busy for 4 HDDs. But with the new setting we get longer requests, most often full extents, which gets a per-HDD transfer rate of about 5 MB/s. With the looser condition for starting read-ahead, 89% of all pages were read in a read-ahead batch. We see the I/O throughput decrease during the run because there are more single-page reads that do not trigger extent read-ahead. So HDDs have 1.7 concurrent operations pending, but the batch size drops, dropping the throughput.

Thus with the best settings, the test with 2000 query mixes finishes in 46 minutes, and the CPU utilization is steadily increasing, hitting 392% for the last minute. In comparison, with SSDs and our worst read-ahead setting we got 1200% CPU in under 5 minutes from cold start. The I/O system can be further tuned; for example, by only reading full extents as long as the buffer pool is not full. In the next post we will measure some more.

BSBM Note

We look at query times with semi-warm cache, with CPU around 400%. We note that Q8-Q12 are especially bad. Q5 runs at about half speed. Q12 runs at under 1/10th speed. The relatively slowest queries appear to be single-instance lookups. Nothing short of the most aggressive speculative reading can help there. Neither query nor workload has any exploitable pattern. Therefore if an I/O component is to be included in a BSBM metric, the only way to score in this is to use speculative read to the maximum.

Some of the queries take consecutive property values of a single instance. One could parallelize this pipeline, but this would be a one-off and would make sense only when reading from storage (whether HDD, SSD, or otherwise). Multithreading for single rows is not worth the overhead.

A metric for BSBM warm-up is not interesting for database science, but may still be of practical interest in the specific case of RDF stores. Specially reading large chunks at startup time is good, so putting a section in BSBM that would force one to implement this would be a service to most end users. Measuring and reporting such I/O performance would favor space efficiency in general. Space efficiency is generally a good thing, especially at larger scales, so we can put an optional section in the report for warm-up. This is also good for comparing HDDs and SSDs, and for testing read-ahead, which is still something a database is expected to do. Implementors have it easy; just speculatively read everything.

Looking at the BSBM fictional use case, anybody running such a portal would do this from RAM only, so it makes sense to define the primary metric as running from warm cache, in practice 100% from memory.

Benchmarks, Redux Series

# PermaLink Comments [0]
03/07/2011 14:17 GMT Modified: 03/14/2011 17:16 GMT
Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs [ Virtuso Data Space Bot ]

In the context of database benchmarks we cannot ignore I/O, as pretty much has been done so far by BSBM.

There are two approaches:

  1. run twice or otherwise make sure one runs from memory and forget about I/O, or

  2. make rules and metrics for warm-up.

We will see if the second is possible with BSBM.

From this starting point, we look at various ways of scheduling I/O in Virtuoso using a 1000 Mt BSBM database on sets of each of HDDs (hard disk devices) and SSDs (solid-state storage devices). We will see that SSDs in this specific application can make a significant difference.

In this test we have the same 4 stripes of a 1000 Mt BSBM database on each of two storage arrays.

Storage Arrays
Type Quantity Maker Size Speed Interface speed Controller Drive Cache RAID
SSD 4 Crucial 128 GB N/A 6Gbit SATA RocketRaid 640 128 MB None
HDD 4 Samsung 1000 GB 7200 RPM 3Gbit SATA Intel ICH on Supermicro motherboard 16 MB None

We make sure that the files are not in OS cache by filling it with other big files, reading a total of 120 GB off SSDs with `cat file > /dev/null`.

The configuration files are as in the report on the 1000 Mt run. We note as significant that we have a few file descriptors for each stripe, and that read-ahead for each is handled by its own thread.

Two different read-ahead schemes are used:

  • With 6 Single, if a 2MB extent gets a second read within a given time after the first, the whole extent is scheduled for background read.

  • With 7 Single, as an index search is vectored, we know a large number of values to fetch at one time and these values are sorted into an ascending sequence. Therefore, by looking at a node in an index tree, we can determine which sub-trees will be accessed and schedule these for read-ahead, skipping any that will not be accessed.

In either model, a sequential scan touching more than a couple of consecutive index leaf pages triggers a read-ahead, to the end of the scanned range or to the next 3000 index leaves, whichever comes first. However, there are no sequential scans of significant size in BSBM.

There are a few different possibilities for the physical I/O:

  1. Using a separate read system call for each page. There may be several open file descriptors on a file so that many such calls can proceed concurrently on different threads; the OS will order the operations.

  2. A thread finds it needs a page and reads it.

  3. Using Unix asynchronous I/O, aio.h, with the aio_* and lio_listio functions.

  4. Using single-read system calls for adjacent pages. In this way, the drive sees longer requests and should give better throughput. If there are short gaps in the sequence, the gaps are also read, wasting bandwidth but saving on latency.

The two latter apply only to bulk I/O that are scheduled on background threads, one per independently-addressable device (HDD, SSD, or RAID-set). These bulk-reads operate on an elevator model, keeping a sorted queue of things to read or write and moving through this queue from start to end. At any time, the queue may get more work from other threads.

There is a further choice when seeing single-page random requests. They can either go to the elevator or they can be done in place. Taking the elevator is presumably good for throughput but bad for latency. In general, the elevator should have a notion of fairness; these matters are discussed in the CWI collaborative scan paper. Here we do not have long queries, so we do not have to talk about elevator policies or scan sharing; there are no scans. We may touch on these questions later with the column store, the BSBM BI mix, and TPC-H.

While we may know principles, I/O has always given us surprises; the only way to optimize this is to measure.

The metric we try to optimize here is the time it takes for a multiuser BSBM run starting from cold cache to get to 1200% CPU. When running from memory, the CPU is around 1350% for the system in question.

This depends on getting I/O throughput, which in turn depends on having a lot of speculative reading since the workload itself does not give any long stretches to read.

The test driver is set at 16 clients, and the run continues for 2000 query mixes or until target throughput is reached. Target throughput is deemed reached after the first 20 second stretch with CPU at 1200% or higher.

The meter is a stored procedure that records the CPU time, count of reads, cumulative elapsed time spent waiting for I/O, and other metrics. The code for this procedure (for 7 Single; this file will not work on Virtuoso 6 or earlier) is available here.

The database space allocation gives each index a number of 2MB segments, each with 256 8K pages. When a page splits, the new page is allocated from the same extent if possible, or from a specific second extent which is designated as the overflow extent of this extent. This scheme provides for a sort of pseudo-locality within extents over random insert order. Thus there is a chance that pre-reading an extent will get key values in the same range a the ones on the page being requested in the first place. At least the pre-read pages will be from the same index tree. There are insertion orders that do not create good locality with this allocation scheme, though. In order to generally improve locality, one could shuffle pages of an all-dirty subtree before writing this out so as to have physical order match key order. We will look at some tricks in this vein with the column store.

For the sake of simplicity we only run 7 Single with the 1000 Mt scale.

The first experiment was with SSDs and the vectored read-ahead. The target throughput was reached after 280 seconds.

The next test was with HDDs and extent read-ahead. One hour into the experiment, the CPU was about 70% after processing around 1000 query mixes. It might have been hours before HDD reads became rare enough for hitting 1200% CPU. The test was not worth continuing.

The result with HDDs and vectored read-ahead would be worse since vectored read-ahead leads to smaller read-ahead batches and to less contiguous read patterns. The individual read times here, are over twice the individual read times with per-extent read-ahead. The fact that vectored read-ahead does not read potentially unneeded pages makes no difference. Hence this test is also not worth running to completion.

There are other possibilities for improving HDD I/O. If only 2MB read requests are made, a transfer will be about 20 ms at a sequential transfer speed of 50 MB/s. Then seeking to the next 2MB extent will be a few ms, most often less than 20, so the HDD should give at least half the nominal throughput.

We note that, when reading sequential 8K pages inside a single 2MB (256 page) extent, the seek latency is not 0 as one would expect but an extreme 5 ms. One would think that the drive would buffer a whole track, and a track would hold a large number of 2MB sections, but apparently this is not so.

Therefore, now if we have a sequential read pattern that is more dense than 1 page out of 10, we read all the pages and just keep the ones we want.

So now we set the read-ahead to merge reads that fall within 10 pages. This wastes bandwidth, but supposedly saves on latency. We will see.

So we try, and we find that read-ahead does not account for most pages since it does not get triggered. Thus, we change the triggering condition to be the 2nd read to fall in the extent within 20 seconds of the first.

The HDDs were in all cases 700% busy for 4 HDDs. But with the new setting we get longer requests, most often full extents, which gets a per-HDD transfer rate of about 5 MB/s. With the looser condition for starting read-ahead, 89% of all pages were read in a read-ahead batch. We see the I/O throughput decrease during the run because there are more single-page reads that do not trigger extent read-ahead. So HDDs have 1.7 concurrent operations pending, but the batch size drops, dropping the throughput.

Thus with the best settings, the test with 2000 query mixes finishes in 46 minutes, and the CPU utilization is steadily increasing, hitting 392% for the last minute. In comparison, with SSDs and our worst read-ahead setting we got 1200% CPU in under 5 minutes from cold start. The I/O system can be further tuned; for example, by only reading full extents as long as the buffer pool is not full. In the next post we will measure some more.

BSBM Note

We look at query times with semi-warm cache, with CPU around 400%. We note that Q8-Q12 are especially bad. Q5 runs at about half speed. Q12 runs at under 1/10th speed. The relatively slowest queries appear to be single-instance lookups. Nothing short of the most aggressive speculative reading can help there. Neither query nor workload has any exploitable pattern. Therefore if an I/O component is to be included in a BSBM metric, the only way to score in this is to use speculative read to the maximum.

Some of the queries take consecutive property values of a single instance. One could parallelize this pipeline, but this would be a one-off and would make sense only when reading from storage (whether HDD, SSD, or otherwise). Multithreading for single rows is not worth the overhead.

A metric for BSBM warm-up is not interesting for database science, but may still be of practical interest in the specific case of RDF stores. Specially reading large chunks at startup time is good, so putting a section in BSBM that would force one to implement this would be a service to most end users. Measuring and reporting such I/O performance would favor space efficiency in general. Space efficiency is generally a good thing, especially at larger scales, so we can put an optional section in the report for warm-up. This is also good for comparing HDDs and SSDs, and for testing read-ahead, which is still something a database is expected to do. Implementors have it easy; just speculatively read everything.

Looking at the BSBM fictional use case, anybody running such a portal would do this from RAM only, so it makes sense to define the primary metric as running from warm cache, in practice 100% from memory.

Benchmarks, Redux Series

# PermaLink Comments [0]
03/07/2011 14:17 GMT Modified: 03/14/2011 17:56 GMT
Virtuoso RDF: A Getting Started Guide for the Developer [ Orri Erling ]

It is a long standing promise of mine to dispel the false impression that using Virtuoso to work with RDF is complicated.

The purpose of this presentation is to show a programmer how to put RDF into Virtuoso and how to query it. This is done programmatically, with no confusing user interfaces.

You should have a Virtuoso Open Source tree built and installed. We will look at the LUBM benchmark demo that comes with the package. All you need is a Unix shell. Running the shell under emacs (m-x shell) is the best. But the open source isql utility should have command line editing also. The emacs shell is however convenient for cutting and pasting things between shell and files.

To get started, cd into binsrc/tests/lubm.

To verify that this works, you can do

./test_server.sh virtuoso-t

This will test the server with the LUBM queries. This should report 45 tests passed. After this we will do the tests step-by-step.

Loading the Data

The file lubm-load.sql contains the commands for loading the LUBM single university qualification database.

The data files themselves are in lubm_8000, 15 files in RDFXML.

There is also a little ontology called inf.nt. This declares the subclass and subproperty relations used in the benchmark.

So now let's go through this procedure.

Start the server:

$ virtuoso-t -f &

This starts the server in foreground mode, and puts it in the background of the shell.

Now we connect to it with the isql utility.

$ isql 1111 dba dba 

This gives a SQL> prompt. The default username and password are both dba.

When a command is SQL, it is entered directly. If it is SPARQL, it is prefixed with the keyword sparql. This is how all the SQL clients work. Any SQL client, such as any ODBC or JDBC application, can use SPARQL if the SQL string starts with this keyword.

The lubm-load.sql file is quite self-explanatory. It begins with defining an SQL procedure that calls the RDF/XML load function, DB..RDF_LOAD_RDFXML, for each file in a directory.

Next it calls this function for the lubm_8000 directory under the server's working directory.

sparql 
   CLEAR GRAPH <lubm>;

sparql 
   CLEAR GRAPH <inf>;

load_lubm ( server_root() || '/lubm_8000/' );

Then it verifies that the right number of triples is found in the <lubm> graph.

sparql 
   SELECT COUNT(*) 
     FROM <lubm> 
    WHERE { ?x ?y ?z } ;

The echo commands below this are interpreted by the isql utility, and produce output to show whether the test was passed. They can be ignored for now.

Then it adds some implied subOrganizationOf triples. This is part of setting up the LUBM test database.

sparql 
   PREFIX  ub:  <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#>
   INSERT 
      INTO GRAPH <lubm> 
      { ?x  ub:subOrganizationOf  ?z } 
   FROM <lubm> 
   WHERE { ?x  ub:subOrganizationOf  ?y  . 
           ?y  ub:subOrganizationOf  ?z  . 
         };

Then it loads the ontology file, inf.nt, using the Turtle load function, DB.DBA.TTLP. The arguments of the function are the text to load, the default namespace prefix, and the URI of the target graph.

DB.DBA.TTLP ( file_to_string ( 'inf.nt' ), 
              'http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl', 
              'inf' 
            ) ;
sparql 
   SELECT COUNT(*) 
     FROM <inf> 
    WHERE { ?x ?y ?z } ;

Then we declare that the triples in the <inf> graph can be used for inference at run time. To enable this, a SPARQL query will declare that it uses the 'inft' rule set. Otherwise this has no effect.

rdfs_rule_set ('inft', 'inf');

This is just a log checkpoint to finalize the work and truncate the transaction log. The server would also eventually do this in its own time.

checkpoint;

Now we are ready for querying.

Querying the Data

The queries are given in 3 different versions: The first file, lubm.sql, has the queries with most inference open coded as UNIONs. The second file, lubm-inf.sql, has the inference performed at run time using the ontology information in the <inf> graph we just loaded. The last, lubm-phys.sql, relies on having the entailed triples physically present in the <lubm> graph. These entailed triples are inserted by the SPARUL commands in the lubm-cp.sql file.

If you wish to run all the commands in a SQL file, you can type load <filename>; (e.g., load lubm-cp.sql;) at the SQL> prompt. If you wish to try individual statements, you can paste them to the command line.

For example:

SQL> sparql 
   PREFIX ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#>
   SELECT * 
     FROM <lubm>
    WHERE { ?x  a                     ub:Publication                                                . 
            ?x  ub:publicationAuthor  <http://www.Department0.University0.edu/AssistantProfessor0> 
          };

VARCHAR
_______________________________________________________________________

http://www.Department0.University0.edu/AssistantProfessor0/Publication0
http://www.Department0.University0.edu/AssistantProfessor0/Publication1
http://www.Department0.University0.edu/AssistantProfessor0/Publication2
http://www.Department0.University0.edu/AssistantProfessor0/Publication3
http://www.Department0.University0.edu/AssistantProfessor0/Publication4
http://www.Department0.University0.edu/AssistantProfessor0/Publication5

6 Rows. -- 4 msec.

To stop the server, simply type shutdown; at the SQL> prompt.

If you wish to use a SPARQL protocol end point, just enable the HTTP listener. This is done by adding a stanza like —

[HTTPServer]
ServerPort    = 8421
ServerRoot    = .
ServerThreads = 2

— to the end of the virtuoso.ini file in the lubm directory. Then shutdown and restart (type shutdown; at the SQL> prompt and then virtuoso-t -f & at the shell prompt).

Now you can connect to the end point with a web browser. The URL is http://localhost:8421/sparql. Without parameters, this will show a human readable form. With parameters, this will execute SPARQL.

We have shown how to load and query RDF with Virtuoso using the most basic SQL tools. Next you can access RDF from, for example, PHP, using the PHP ODBC interface.

To see how to use Jena or Sesame with Virtuoso, look at Native RDF Storage Providers. To see how RDF data types are supported, see Extension datatype for RDF

To work with large volumes of data, you must add memory to the configuration file and use the row-autocommit mode, i.e., do log_enable (2); before the load command. Otherwise Virtuoso will do the entire load as a single transaction, and will run out of rollback space. See documentation for more.

# PermaLink Comments [0]
12/17/2008 12:31 GMT Modified: 12/17/2008 12:41 GMT
Virtuoso RDF: A Getting Started Guide for the Developer [ Virtuso Data Space Bot ]

It is a long standing promise of mine to dispel the false impression that using Virtuoso to work with RDF is complicated.

The purpose of this presentation is to show a programmer how to put RDF into Virtuoso and how to query it. This is done programmatically, with no confusing user interfaces.

You should have a Virtuoso Open Source tree built and installed. We will look at the LUBM benchmark demo that comes with the package. All you need is a Unix shell. Running the shell under emacs (m-x shell) is the best. But the open source isql utility should have command line editing also. The emacs shell is however convenient for cutting and pasting things between shell and files.

To get started, cd into binsrc/tests/lubm.

To verify that this works, you can do

./test_server.sh virtuoso-t

This will test the server with the LUBM queries. This should report 45 tests passed. After this we will do the tests step-by-step.

Loading the Data

The file lubm-load.sql contains the commands for loading the LUBM single university qualification database.

The data files themselves are in lubm_8000, 15 files in RDFXML.

There is also a little ontology called inf.nt. This declares the subclass and subproperty relations used in the benchmark.

So now let's go through this procedure.

Start the server:

$ virtuoso-t -f &

This starts the server in foreground mode, and puts it in the background of the shell.

Now we connect to it with the isql utility.

$ isql 1111 dba dba 

This gives a SQL> prompt. The default username and password are both dba.

When a command is SQL, it is entered directly. If it is SPARQL, it is prefixed with the keyword sparql. This is how all the SQL clients work. Any SQL client, such as any ODBC or JDBC application, can use SPARQL if the SQL string starts with this keyword.

The lubm-load.sql file is quite self-explanatory. It begins with defining an SQL procedure that calls the RDF/XML load function, DB..RDF_LOAD_RDFXML, for each file in a directory.

Next it calls this function for the lubm_8000 directory under the server's working directory.

sparql 
   CLEAR GRAPH <lubm>;

sparql 
   CLEAR GRAPH <inf>;

load_lubm ( server_root() || '/lubm_8000/' );

Then it verifies that the right number of triples is found in the <lubm> graph.

sparql 
   SELECT COUNT(*) 
     FROM <lubm> 
    WHERE { ?x ?y ?z } ;

The echo commands below this are interpreted by the isql utility, and produce output to show whether the test was passed. They can be ignored for now.

Then it adds some implied subOrganizationOf triples. This is part of setting up the LUBM test database.

sparql 
   PREFIX  ub:  <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#>
   INSERT 
      INTO GRAPH <lubm> 
      { ?x  ub:subOrganizationOf  ?z } 
   FROM <lubm> 
   WHERE { ?x  ub:subOrganizationOf  ?y  . 
           ?y  ub:subOrganizationOf  ?z  . 
         };

Then it loads the ontology file, inf.nt, using the Turtle load function, DB.DBA.TTLP. The arguments of the function are the text to load, the default namespace prefix, and the URI of the target graph.

DB.DBA.TTLP ( file_to_string ( 'inf.nt' ), 
              'http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl', 
              'inf' 
            ) ;
sparql 
   SELECT COUNT(*) 
     FROM <inf> 
    WHERE { ?x ?y ?z } ;

Then we declare that the triples in the <inf> graph can be used for inference at run time. To enable this, a SPARQL query will declare that it uses the 'inft' rule set. Otherwise this has no effect.

rdfs_rule_set ('inft', 'inf');

This is just a log checkpoint to finalize the work and truncate the transaction log. The server would also eventually do this in its own time.

checkpoint;

Now we are ready for querying.

Querying the Data

The queries are given in 3 different versions: The first file, lubm.sql, has the queries with most inference open coded as UNIONs. The second file, lubm-inf.sql, has the inference performed at run time using the ontology information in the <inf> graph we just loaded. The last, lubm-phys.sql, relies on having the entailed triples physically present in the <lubm> graph. These entailed triples are inserted by the SPARUL commands in the lubm-cp.sql file.

If you wish to run all the commands in a SQL file, you can type load <filename>; (e.g., load lubm-cp.sql;) at the SQL> prompt. If you wish to try individual statements, you can paste them to the command line.

For example:

SQL> sparql 
   PREFIX ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#>
   SELECT * 
     FROM <lubm>
    WHERE { ?x  a                     ub:Publication                                                . 
            ?x  ub:publicationAuthor  <http://www.Department0.University0.edu/AssistantProfessor0> 
          };

VARCHAR
_______________________________________________________________________

http://www.Department0.University0.edu/AssistantProfessor0/Publication0
http://www.Department0.University0.edu/AssistantProfessor0/Publication1
http://www.Department0.University0.edu/AssistantProfessor0/Publication2
http://www.Department0.University0.edu/AssistantProfessor0/Publication3
http://www.Department0.University0.edu/AssistantProfessor0/Publication4
http://www.Department0.University0.edu/AssistantProfessor0/Publication5

6 Rows. -- 4 msec.

To stop the server, simply type shutdown; at the SQL> prompt.

If you wish to use a SPARQL protocol end point, just enable the HTTP listener. This is done by adding a stanza like —

[HTTPServer]
ServerPort    = 8421
ServerRoot    = .
ServerThreads = 2

— to the end of the virtuoso.ini file in the lubm directory. Then shutdown and restart (type shutdown; at the SQL> prompt and then virtuoso-t -f & at the shell prompt).

Now you can connect to the end point with a web browser. The URL is http://localhost:8421/sparql. Without parameters, this will show a human readable form. With parameters, this will execute SPARQL.

We have shown how to load and query RDF with Virtuoso using the most basic SQL tools. Next you can access RDF from, for example, PHP, using the PHP ODBC interface.

To see how to use Jena or Sesame with Virtuoso, look at Native RDF Storage Providers. To see how RDF data types are supported, see Extension datatype for RDF

To work with large volumes of data, you must add memory to the configuration file and use the row-autocommit mode, i.e., do log_enable (2); before the load command. Otherwise Virtuoso will do the entire load as a single transaction, and will run out of rollback space. See documentation for more.

# PermaLink Comments [0]
12/17/2008 12:31 GMT Modified: 12/17/2008 12:41 GMT
See the Lite: Embeddable/Background Virtuoso starts at 25MB [ Orri Erling ]

We have received many requests for an embeddable-scale Virtuoso. In response to this, we have added a Lite mode, where the initial size of a server process is a tiny fraction of what the initial size would be with default settings. With 2MB of disk cache buffers (ini file setting, NumberOfBuffers = 256), the process size stays under 30MB on 32-bit Linux.

The value of this is that one can now have RDF and full text indexing on the desktop without running a Java VM or any other memory-intensive software. And of course, all of SQL (transactions, stored procedures, etc.) is in the same embeddably-sized container.

The Lite executable is a full Virtuoso executable; the Lite mode is controlled by a switch in the configuration file. The executable size is about 10MB for 32-bit Linux. A database created in the Lite mode will be converted into a fully-featured database (tables and indexes are added, among other things) if the server is started with the Lite setting "off"; functionality can be reverted to Lite mode, though it will now consume somewhat more memory, etc.

Lite mode offers full SQL and SPARQL/SPARUL (via SPASQL), but disables all HTTP-based services (WebDAV, application hosting, etc.). Clients can still use all typical database access mechanisms (i.e., ODBC, JDBC, OLE-DB, ADO.NET, and XMLA) to connect, including the Jena and Sesame frameworks for RDF. ODBC now offers full support of RDF data types for C-based clients. A Redland-compatible API also exists, for use with Redland v1.0.8 and later.

Especially for embedded use, we now allow restricting the listener to be a Unix socket, which allows client connections only from the localhost.

Shipping an embedded Virtuoso is easy. It just takes one executable and one configuration file. Performance is generally comparable to "normal" mode, except that Lite will be somewhat less scalable on multicore systems.

The Lite mode will be included in the next Virtuoso 5 Open Source release.

# PermaLink Comments [0]
12/17/2008 09:34 GMT Modified: 12/17/2008 12:03 GMT
See the Lite: Embeddable/Background Virtuoso starts at 25MB [ Virtuso Data Space Bot ]

We have received many requests for an embeddable-scale Virtuoso. In response to this, we have added a Lite mode, where the initial size of a server process is a tiny fraction of what the initial size would be with default settings. With 2MB of disk cache buffers (ini file setting, NumberOfBuffers = 256), the process size stays under 30MB on 32-bit Linux.

The value of this is that one can now have RDF and full text indexing on the desktop without running a Java VM or any other memory-intensive software. And of course, all of SQL (transactions, stored procedures, etc.) is in the same embeddably-sized container.

The Lite executable is a full Virtuoso executable; the Lite mode is controlled by a switch in the configuration file. The executable size is about 10MB for 32-bit Linux. A database created in the Lite mode will be converted into a fully-featured database (tables and indexes are added, among other things) if the server is started with the Lite setting "off"; functionality can be reverted to Lite mode, though it will now consume somewhat more memory, etc.

Lite mode offers full SQL and SPARQL/SPARUL (via SPASQL), but disables all HTTP-based services (WebDAV, application hosting, etc.). Clients can still use all typical database access mechanisms (i.e., ODBC, JDBC, OLE-DB, ADO.NET, and XMLA) to connect, including the Jena and Sesame frameworks for RDF. ODBC now offers full support of RDF data types for C-based clients. A Redland-compatible API also exists, for use with Redland v1.0.8 and later.

Especially for embedded use, we now allow restricting the listener to be a Unix socket, which allows client connections only from the localhost.

Shipping an embedded Virtuoso is easy. It just takes one executable and one configuration file. Performance is generally comparable to "normal" mode, except that Lite will be somewhat less scalable on multicore systems.

The Lite mode will be included in the next Virtuoso 5 Open Source release.

# PermaLink Comments [0]
12/17/2008 09:34 GMT Modified: 12/17/2008 12:03 GMT
Virtuoso Cluster Stage 1 [ Virtuso Data Space Bot ]
Virtuoso Cluster Stage 1

I recall a quote from a stock car racing movie.

"What is the necessary prerequisite for winning a race?" asked the racing team boss.

"Being the fastest," answered the hotshot driver, after yet another wrecked engine.

"No. It is finishing the race."

In the interest of finishing, we'll now leave optimizing the cluster traffic and scheduling and move to completing functionality. Our next stop is TPC-D. After this TPC-C, which adds the requirement of handling distributed deadlocks. After this we add RDF-specific optimizations.

This will be Virtuoso 6 with the first stage of clustering support. This is with fixed partitions, which is just like a single database, except it runs on multiple machines. The stage after this is Virtuoso Cloud, the database with all the space filling properties of foam, expanding and contracting to keep an even data density as load and resource availability change.

Right now, we have a pretty good idea of the final form of evaluating loop joins in a cluster, which after all is the main function of the thing. It makes sense to tune this to a point before going further. You want the pipes and pumps and turbines to have known properties and fittings before building a power plant.

To test this, we took a table of a million short rows and made one copy partitioned over 4 databases and one copy with all rows in one database. We ran all the instances in a 4 core Xeon box. We used Unix sockets for communication.

We joined the table to itself, like SELECT COUNT (*) FROM ct a, ct b WHERE b.row_no = a.row_no + 3. The + 3 causes the joined rows never to be on the same partition.

With cluster, the single operation takes 3s and with a single process it takes 4s. The overall CPU time for cluster is about 30% higher, some of which is inevitable since it must combine results, serialize them, and so forth. Some real time is gained by doing multiple iterations of the inner loop (getting the row for b) in parallel. This can be further optimized to maybe 2x better with cluster but this can wait a little.

Then we make a stream of 10 such queries. The stream with cluster is 14s; with the single process, it is 22s. Then we run 4 streams in parallel. The time with cluster is 39s and with a single process 36s. With 16 streams in parallel, cluster gets 2m51 and single process 3m21.

The conclusion is that clustering overhead is not significant in a CPU-bound situation. Note that all the runs were at 4 cores at 98-100%, except for the first, single-client run, which had one process at 98% and 3 at 32%.

The SMP single process loses by having more contention for mutexes serializing index access. Each wait carries an entirely ridiculous penalty of up to 6µs or so, as discussed earlier on this blog. The cluster wins by less contention due to distributed data and loses due to having to process messages and remember larger intermediate results. These balance out, or close enough.

For the case with a single client, we can cut down on the coordination overhead by simply optimizing the code some more. This is quite possible, so we could get one process at 100% and 3 at 50%.

The numbers are only relevant as ballpark figures and the percentages will vary between different queries. The point is to prove that we actually win and do not jump from the frying pan into the fire by splitting queries across processes. As a point of comparison, running the query clustered just as one would run it locally took 53s.

We will later look at the effects of different networks, as we get to revisit the theme with some real benchmarks.

# PermaLink Comments [0]
09/06/2007 07:11 GMT Modified: 04/25/2008 15:57 GMT
 <<     | 1 | 2 | 3 | 4 |     >>
Powered by OpenLink Virtuoso Universal Server
Running on Linux platform