Details

Orri Erling

Subscribe

Post Categories

Recent Articles

Display Settings

articles per page.
order.
RDF and Transactions

I will here talk about RDF and transactions for developers in general. The next one talks about specifics and is for specialists.

Transactions are certainly not the first thing that comes to mind when one hears "RDF". We have at times used a recruitment questionnaire where we ask applicants to define a transaction. Many vaguely remember that it is a unit of work, but usually not more than that. We sometimes get questions from users about why they get an error message that says "deadlock". "Deadlock" is what happens when multiple users concurrently update balances on multiple bank accounts in the wrong order. What does this have to do with RDF?

There are in fact users who even use XA with a Virtuoso-based RDF application. Franz also has publicized their development of full ACID capabilities for AllegroGraph. RDF is a database schema model, and transactions will inevitably become an issue in databases.

At the same time, the developer population trained with MySQL and PHP is not particularly transaction-aware. Transactions have gone out of style, declares the No-SQL crowd. Well, it is not so much SQL they object to but ACID, i.e., transactional guarantees. We will talk more about this in the next post. The SPARQL language and protocol do not go into transactions, except for expressing the wish that an UPDATE request to an end-point be atomic. But beware -- atomicity is a gateway drug, and soon one finds oneself on full ACID.

If one says that a thing will either happen in its entirety or not at all, which is what (A) atomicity means, then the question arises of (I) isolation; that is, what happens if somebody else does something to the same data at the same time? Then comes the question of whether a thing, once having happened, will stay that way; i.e., (D) durability. Finally, there is (C) consistency, which means that the transaction's result must not contradict restrictions the database is supposed to enforce. RDF usually has no restrictions; thus consistency mostly means that the internal state of the DBMS must be consistent, e.g., different indices on triples/quads should contain the same data.

There are, of course, database-like consistency criteria that one can express in RDF Schema and OWL, concerning data types, mandatory presence of properties, or restrictions on cardinality (i.e., one may only have one spouse at a time, and the like).

If one indeed did enforce them all, then RDF would be very like the relational model -- with all the restrictions, but without the 40 years of work on RDBMS performance. For this reason, RDF use tends to involve data that is not structured enough to be a good fit for RDBMS.

There is of course the OWL side, where consistency is important but is defined in such complex ways that they again are not a good fit for RDBMS. RDF could be seen to be split between the schema-last world and the knowledge representation world. I will here focus on the schema-last side.

Transactions are relevant in RDF in two cases: 1. If data is trickle loaded in small chunks, one likes to know that the chunks do not get lost or corrupted; 2. If the application has any semantics that reserve resources, then these operations need transactions. The latter is not so common with RDF but examples include read-write situations, like checking if a seat is available and then reserving it. Transactionality guarantees that the same seat does not get reserved twice.

Web people argue with some justification that since the four cardinal virtues of database never existed on the web to begin with, applying strict ACID to web data is beside the point, like locking the stable after the horse has long since run away. This may be so; yet the systems used for processing data, whether that data is dirty or not, benefit from predictable operation under concurrency and from not losing data.

Analytics workloads are not primarily about transactions, but still need to specify what happens with updates. Analyzing data from measurements may not have concurrent updates, but there the transaction issue is replaced by the question of making explicit how the data was acquired and what processing has been applied to it before storage.

As mentioned before, the LOD2 project is at the crossroads of RDF and database. I construe its mission to be the making of RDF into a respectable database discipline. Database respectability in turn is as good as inconceivable without addressing the very bedrock on which this science was founded: transactions.

As previously argued, we need well-defined and auditable benchmarks. This again brings up the topic of transactions. Once we embark on the database benchmark route, there is no way around this. TPC-H mandates that the system under test support transactions, and the audit involves a test for this. We can do no less.

This has led me to more closely examine the issue of RDF and transactions, and whether there exist differences between transactions applied to RDF and to relational data.

As concerns Virtuoso, our position has been that one can get full ACID in Virtuoso, whether in SQL or SPARQL, by using a connected client (e.g., ODBC, JDBC, or the Jena or Sesame frameworks), and setting the isolation options on the connection. Having taken this step, one then must take the next step, which consists of dealing with deadlocks; i.e., with concurrent utilization, it may happen that the database at any time notifies the client that the transaction got aborted and the client must retry.

Web developers especially do not like this, because this is not what MySQL has taught them to expect. MySQL does have transactional back-ends like InnoDB, but often gets used without transactions.

With the March 2011 Virtuoso releases, we have taken a closer look at transactions with RDF. It is more practical to reduce the possibility of errors than to require developers to pay attention. For this reason we have automated isolation settings for RDF, greatly reduced the incidence of deadlocks, and even incorporated automatic deadlock retries where applicable.

If all users lock resources they need in the same order, there will be no deadlocks. This is what we do with RDF load in Virtuoso 7; thus any mix of concurrent INSERTs and DELETEs, if these are under a certain size (normally 10000 quads) are guaranteed never to fail due to locking. These could still fail due to running out of space, though. With previous versions, there always was a possibility of having an INSERT or DELETE fail because of deadlock with multiple users. Vectored INSERT and DELETE are sufficient for making web crawling or archive maintenance practically deadlock free, since there the primary transaction is the INSERT or DELETE of a small graph.

Furthermore, since the SPARQL protocol has no way of specifying transactions consisting of multiple client-server exchanges, the SPARQL end-point may deal with deadlocks by itself. If all else fails, it can simply execute requests one after the other, thus eliminating any possibility of locking. We note that many statements will be intrinsically free of deadlocks by virtue of always locking in key order, but this cannot be universally guaranteed with arbitrary size operations; thus concurrent operations might still sometimes deadlock. Anyway, vectored execution as introduced in Virtuoso 7, besides getting easily double-speed random access, also greatly reduces deadlocks by virtue of ordering operations.

In the next post we will talk about what transactions mean with RDF and whether there is any difference with the relational model.

# PermaLink Comments [0]
03/22/2011 18:52 GMT Modified: 03/22/2011 17:44 GMT
Benchmarks, Redux (part 7): What Does BSBM Explore Measure?

We will here analyze what the BSBM Explore workload does. This is necessary in order to compare benchmark results at different scales. Historically, BSBM had a Query 6 whose share of the metric approached 100% as scale increased. The present mix does not have this query, but different queries still have different relative importance at different scales.

We will here look at database-running statistics for BSBM at different scales. Finally, we look at CPU profiles.

But first, let us see what BSBM reads in general. The system is in steady state after around 1500 query mixes; after this the working set does not shift much. After several thousand query mixes, we have:

SELECT TOP 10 * FROM sys_d_stat ORDER BY reads DESC;

KEY_TABLE          INDEX_NAME                       TOUCHES    READS  READ_PCT  N_DIRTY  N_BUFFERS
=================  ============================  ==========  =======  ========  =======  =========
DB.DBA.RDF_OBJ     RDF_OBJ                        114105938  3302150         2        0    3171275
DB.DBA.RDF_QUAD    RDF_QUAD                       977426773  2041156         0        0    1970712
DB.DBA.RDF_IRI     DB_DBA_RDF_IRI_UNQC_RI_ID        8250414   509239         6       15     491631
DB.DBA.RDF_QUAD    RDF_QUAD_POGS                 3677233812   183860         0        0     175386
DB.DBA.RDF_IRI     RDF_IRI                               32    99710    302151        5      95353
DB.DBA.RDF_QUAD    RDF_QUAD_OP                        30597    51593       168        0      48941
DB.DBA.RDF_QUAD    RDF_QUAD_SP                       265474    47210        17        0      46078
DB.DBA.RDF_PREFIX  DB_DBA_RDF_PREFIX_UNQC_RP_ID        6020      212         3        0        212
DB.DBA.RDF_PREFIX  RDF_PREFIX                             0      167     16700        0        157

The first column is the table, then the index, then the number of times a row was found. The fourth number is the count of disk pages read. The last number is the count of 8K buffer pool pages in use for caching pages of the index in question. Note that the index is clustered, i.e., there is no table data structure separate from the index. Most of the reads are for strings or RDF literals. After this comes the PSOG index for getting a property value given the subject. After this, but much lower, we have lookups of IRI strings given the ID. The index from object value to subject is used the most but the number of pages is small; only a few properties seem to be concerned. The rest is minimal in comparison.

Now let us reset the counts and see what the steady state I/O profile is.

SELECT key_stat (key_table, name_part (key_name, 2), 'reset') FROM sys_keys WHERE key_migrate_to IS NULL;

SELECT TOP 10 * FROM sys_d_stat ORDER BY reads DESC;

KEY_TABLE          INDEX_NAME                       TOUCHES    READS  READ_PCT  N_DIRTY  N_BUFFERS
=================  ============================  ==========  =======  ========  =======  =========
DB.DBA.RDF_OBJ     RDF_OBJ                         30155789    79659         0        0    3191391
DB.DBA.RDF_QUAD    RDF_QUAD                       259008064     8904         0        0    1948707
DB.DBA.RDF_QUAD    RDF_QUAD_SP                        68002     7730        11        0      53360
DB.DBA.RDF_IRI     RDF_IRI                               12     5415     41653        6      98804
DB.DBA.RDF_QUAD    RDF_QUAD_POGS                  975147136     1597         0        0     173459
DB.DBA.RDF_IRI     DB_DBA_RDF_IRI_UNQC_RI_ID        2213525     1286         0       17     485093
DB.DBA.RDF_QUAD    RDF_QUAD_OP                         7999      904        11        0      48568
DB.DBA.RDF_PREFIX  DB_DBA_RDF_PREFIX_UNQC_RP_ID        1494        1         0        0        213

Literal strings dominate. The SP index is used only for situations where the P is not specified, i.e., the DESCRIBE query. Based on this, I/O seems to be attributable mostly to this. The first RDF_IRI represents translations from string to IRI id; the second represents translations from IRI id to string. The touch count for the first RDF_IRI is not properly recorded, hence the miss % is out of line. We see SP missing the cache the most since its use is infrequent in the mix.

We will next look at query processing statistics. For this we introduce a new meter.

The db_activity SQL function provides a session-by-session cumulative statistic of activity. The fields are:

  • rnd - Count of random index lookups. Each first row of a select or insert counts as one, regardless of whether something was found.
  • seq - Count of sequential rows. Every move to next row on a cursor counts as 1, regardless of whether conditions match.
  • same seg - For column store only; counts how many times the next row in a vectored join using an index falls in the same segment as the previous random access. A segment is the stretch of rows between entries in the sparse top level index on the column projection.
  • same pg - Counts how many times a vectored index join finds the next match on the same page as the previous one.
  • same par - Counts how many times the next lookup in a vectored index join falls on a different page than the previous but still under the same parent.
  • disk - Counts how many disk reads were made, including any speculative reads initiated.
  • spec disk - Counts speculative disk reads.
  • messages - Counts cluster interconnect messages
  • B (KB, MB, GB) - is the total length of the cluster interconnect messages.
  • fork - Counts how many times a thread was forked (started) for query parallelization.

The numbers are given with 4 significant digits and a scale suffix. G is 10^9 (1,000,000,000); M is 10^6 (1,000,000), K is 10^3 (1,000).

We run 2000 query mixes with 16 Users. The special http account keeps a cumulative account of all activity on web server threads.

SELECT db_activity (2, 'http');

1.674G rnd  3.223G seq      0 same seg  1.286G same pg  314.8M same par  6.186M disk  6.461M spec disk      0B /     0 messages  298.6K fork

We see that random access dominates. The seq number is about twice the rnd number, meaning that the average random lookup gets two rows. Getting a row at random obviously takes more time than getting the next row. Since the index used is row-wise, the same seg is 0; the same pg indicates that 77% of the random accesses fall on the same page as the previous random access; most of the remaining random accesses fall under the same parent as the previous one.

There are more speculative reads than disk reads which is an artifact of counting some concurrently speculated reads twice. This does indicate that speculative reads dominate. This is because a large part of the run was in the warm-up state with aggressive speculative reading. We reset the counts and run another 2000 mixes.

Now let us look at the same reading after 2000 mixes, 16 user at 100Mt.

234.3M rnd  420.5M seq      0 same seg   188.8M same pg  29.09M same par  808.9K disk  919.9K spec disk      0B /      0 messages  76K fork

We note that the ratios between the random and sequential and same page/parent counts are about the same. The sequential number looks to be even a bit smaller in proportion. The count of random accesses for the 100Mt run is 14% of the count for the 1000Mt run. The count of query parallelization threads is also much lower since it is worthwhile to schedule a new thread only if there are at least a few thousand operations to perform on it. The precise criterion for making a thread is that according to the cost model guess, the thread must have at least 5ms worth of work.

We note that the 100 Mt throughput is a little over three-times that of the 1000 Mt throughput, as reported before. We might justifiably ask why the 100 Mt run is not seven-times faster instead, for this much less work.

We note that for one-off random access, it makes no real difference whether the tree has 100 M or 1000 M rows; this translates to roughly 27 vs 30 comparisons, so the depth of the tree is not a factor per se. Besides, vectoring makes the tree often look only one or two levels deep, so the total row count matters even less there.

To elucidate this last question, we look at the CPU profiles. We take an oprofile of 100 Single User mixes at both scales.

For 100 Mt:
61161    10.1723  cmpf_iri64n_iri64n_anyn_gt_lt
31321     5.2093  box_equal
19027     3.1646  sqlo_parse_tree_has_node
15905     2.6453  dk_alloc
15647     2.6024  itc_next_set_neq
12702     2.1126  itc_vec_split_search
12487     2.0768  itc_dive_transit
11450     1.9044  itc_bm_vec_row_check
10646     1.7706  itc_page_rcf_search
 9223     1.5340  id_hash_get
 9215     1.5326  gen_qsort
 8867     1.4748  sqlo_key_part_best
 8807     1.4648  itc_param_cmp
 8062     1.3409  cmpf_iri64n_iri64n
 6820     1.1343  sqlo_in_list
 6005     0.9987  dc_iri_id_cmp
 5905     0.9821  dk_free_tree
 5801     0.9648  box_hash
 5509     0.9163  dks_esc_write
 5444     0.9054  sql_tree_hash_1
For 1000 Mt
754331   31.4149  cmpf_iri64n_iri64n_anyn_gt_lt
146165    6.0872  itc_vec_split_search
144795    6.0301  itc_next_set_neq
131671    5.4836  itc_dive_transit
110870    4.6173  itc_page_rcf_search
 66780    2.7811  gen_qsort
 66434    2.7667  itc_param_cmp
 58450    2.4342  itc_bm_vec_row_check
 55213    2.2994  dk_alloc
 47793    1.9904  cmpf_iri64n_iri64n
 44277    1.8440  dc_iri_id_cmp
 39489    1.6446  cmpf_int64n
 36880    1.5359  dc_append_bytes
 36601    1.5243  dv_compare
 31286    1.3029  dc_any_value_prefetch
 25457    1.0602  itc_next_set
 20852    0.8684  box_equal
 19895    0.8285  dk_free_tree
 19698    0.8203  itc_page_insert_search
 19367    0.8066  dc_copy

The top function in both is the compare for an equality of two leading IRIs and a range for the trailing any. This corresponds to the range check in Q5. At the larger scale this is three times more important. At the smaller scale, the share of query optimization is about 6.5 times greater. The top function in this category is box_equal with 5.2% vs 0.87%. The remaining SQL compiler functions are all in proportion to this, totaling 14.3% of the 100 Mt top-20 profile.

From this sample it appears ten times more scale is seven times more database operations. This is not taken into account in the metric. Query compilation is significant at the small end, and no longer significant at 1000 Mt. From these numbers, we could say that Virtuoso is about two times more efficient in terms of database operation throughput at 1000 Mt than at 100 Mt.

We may conclude that different BSBM scales measure different things. The TPC workloads are relatively better in that they have a balance between metric components that stay relatively constant across a large range of scales.

This is not necessarily something that should be fixed in the BSBM Explore mix. We must however take these factors better into account in developing the BI mix.

Let us also remember that BSBM Explore is a relational workload. Future posts in this series will outline how we propose to make RDF-friendlier benchmarks.

Benchmarks, Redux Series

# PermaLink Comments [0]
03/07/2011 18:39 GMT Modified: 08/21/2014 10:45 GMT
Benchmarks, Redux (part 6): BSBM and I/O, continued

In the words of Jim Gray, disks have become tapes. By this he means that a disk is really only good for sequential access. For this reason, the SSD extent read ahead was incomparably better. We note that in the experiment, every page in the general area of the database the experiment touched would in time be touched, and that the whole working set would end up in memory. Therefore no speculative read would be wasted. Therefore it stands to reason to read whole extents.

So I changed the default behavior to use a very long window for triggering read-ahead as long as the buffer pool was not full. After the initial filling of the buffer pool, the read ahead would require more temporal locality before kicking in.

Still, the scheme was not really good since the rest of the extent would go for background-read and the triggering read would be done right then, leading to extra seeks. Well, this is good for latency but bad for throughput. So I changed this too, going to an "elevator only" scheme where reads that triggered read-ahead would go with the read-ahead batch. Reads that did not trigger read-ahead would still be done right in place, thus favoring latency but breaking any sequentiality with its attendant 10+ ms penalty.

We keep in mind that the test we target is BSBM warm-up time, which is purely a throughput business. One could have timeouts and could penalize queries that sacrificed too much latency to throughput.

We note that even for this very simple metric, just reading the allocated database pages from start to end is not good since a large number of pages in fact never get read during a run.

We further note that the vectored read-ahead without any speculation will be useful as-is for cases with few threads and striping, since at least one thread's random I/Os get to go to multiple threads. The benefit is less in multiuser situations where disks are randomly busy anyhow.

In the previous I/O experiments, we saw that with vectored read ahead and no speculation, there were around 50 pages waiting for I/O at all times. With an easily-triggered extent read-ahead, there were around 4000 pages waiting. The more pages are waiting for I/O, the greater the benefit from the elevator algorithm of servicing I/O in order of file offset.

In Virtuoso 5 we had a trick that would, if the buffer pool was not full, speculatively read every uncached sibling of every index tree node it visited. This filled the cache quite fast, but was useless after the cache was full. The extent read ahead first implemented in 6 was less aggressive, but would continue working with full cache and did in fact help with shifts in the working set.

The next logical step is to combine the vector and extent read-ahead modes. We see what pages we will be getting, then take the distinct extents; if we have been to this extent within the time window, we just add all the uncached allocated pages of the extent to the batch.

With this setting, especially at the start of the run, we get large read-ahead batches and maintain I/O queues of 5000 to 20000 pages. The SSD starting time drops to about 120 seconds from cold start to reach 1200% CPU. We see transfer rates of up to 150 MB/s per SSD. With HDDs, we see transfer rates around 14 MB/s per drive, mostly reading chunks of an average of seventy-one (71) 8K pages.

The BSBM workload does not offer better possibilities for optimization, short of pre-reading the whole database, which is not practical at large scales.

Some Details

First we start from cold disk, with and without mandatory read of the whole extent on the touch.

Without any speculation but with vectored read-ahead, here are the times for the first 11 query mixes:

 0: 151560.82 ms, total: 151718 ms
 1: 179589.08 ms, total: 179648 ms
 2:  71974.49 ms, total:  72017 ms
 3: 102701.73 ms, total: 102729 ms
 4:  58834.41 ms, total:  58856 ms
 5:  65926.34 ms, total:  65944 ms
 6:  68244.69 ms, total:  68274 ms
 7:  39197.15 ms, total:  39215 ms
 8:  45654.93 ms, total:  45674 ms
 9:  34850.30 ms, total:  34878 ms
10: 100061.30 ms, total: 100079 ms

The average CPU during this time was 5%. The best read throughput was 2.5 MB/s; the average was 1.35 MB/s. The average disk read was 16 ms.

With vectored read-ahead and full extents only, i.e., max speculation:

 0: 178854.23 ms, total: 179034 ms
 1: 110826.68 ms, total: 110887 ms
 2:  19896.11 ms, total:  19941 ms
 3:  36724.43 ms, total:  36753 ms
 4:  21253.70 ms, total:  21285 ms
 5:  18417.73 ms, total:  18439 ms
 6:  21668.92 ms, total:  21690 ms
 7:  12236.49 ms, total:  12267 ms
 8:  14922.74 ms, total:  14945 ms
 9:  11502.96 ms, total:  11523 ms
10:  15762.34 ms, total:  15792 ms
...

90:   1747.62 ms, total:   1761 ms
91:   1701.01 ms, total:   1714 ms
92:   1300.62 ms, total:   1318 ms
93:   1873.15 ms, total:   1886 ms
94:   1508.24 ms, total:   1524 ms
95:   1748.15 ms, total:   1761 ms
96:   2076.92 ms, total:   2090 ms
97:   2199.38 ms, total:   2212 ms
98:   2305.75 ms, total:   2319 ms
99:   1771.91 ms, total:   1784 ms

Scale factor:              2848260
Number of warmup runs:     0
Seed:                      808080
Number of query mix runs 
  (without warmups):       100 times
min/max Querymix runtime:  1.3006s / 178.8542s
Elapsed runtime:           872.993 seconds
QMpH:                      412.374 query mixes per hour

The peak throughput is 91 MB/s, with average around 50 MB/s; CPU average around 50%.

We note that the latency of the first query mix is hardly greater than in the non-speculative run, but starting from mix 3 the speed is clearly better.

Then the same with cold SSDs. First with no speculation:

 0:   5177.68 ms, total:   5302 ms
 1:   2570.16 ms, total:   2614 ms
 2:   1353.06 ms, total:   1391 ms
 3:   1957.63 ms, total:   1978 ms
 4:   1371.13 ms, total:   1386 ms
 5:   1765.55 ms, total:   1781 ms
 6:   1658.23 ms, total:   1673 ms
 7:   1273.87 ms, total:   1289 ms
 8:   1355.19 ms, total:   1380 ms
 9:   1152.78 ms, total:   1167 ms
10:   1787.91 ms, total:   1802 ms
...

90:   1116.25 ms, total:   1128 ms
91:    989.50 ms, total:   1001 ms
92:    833.24 ms, total:    844 ms
93:   1137.83 ms, total:   1150 ms
94:    969.47 ms, total:    982 ms
95:   1138.04 ms, total:   1149 ms
96:   1155.98 ms, total:   1168 ms
97:   1178.15 ms, total:   1193 ms
98:   1120.18 ms, total:   1132 ms
99:   1013.16 ms, total:   1025 ms

Scale factor:              2848260
Number of warmup runs:     0
Seed:                      808080
Number of query mix runs 
  (without warmups):       100 times
min/max Querymix runtime:  0.8201s / 5.1777s
Elapsed runtime:           127.555 seconds
QMpH:                      2822.321 query mixes per hour

The peak I/O is 45 MB/s, with average 28.3 MB/s; CPU average is 168%.

Now, SSDs with max speculation.

 0:  44670.34 ms, total:  44809 ms
 1:  18490.44 ms, total:  18548 ms
 2:   7306.12 ms, total:   7353 ms
 3:   9452.66 ms, total:   9485 ms
 4:   5648.56 ms, total:   5668 ms
 5:   5493.21 ms, total:   5511 ms
 6:   5951.48 ms, total:   5970 ms
 7:   3815.59 ms, total:   3834 ms
 8:   4560.71 ms, total:   4579 ms
 9:   3523.74 ms, total:   3543 ms
10:   4724.04 ms, total:   4741 ms
...

90:    673.53 ms, total:    685 ms
91:    534.62 ms, total:    545 ms
92:    730.81 ms, total:    742 ms
93:   1358.14 ms, total:   1370 ms
94:   1098.64 ms, total:   1110 ms
95:   1232.20 ms, total:   1243 ms
96:   1259.57 ms, total:   1273 ms
97:   1298.95 ms, total:   1310 ms
98:   1156.01 ms, total:   1166 ms
99:   1025.45 ms, total:   1034 ms

Scale factor:              2848260
Number of warmup runs:     0
Seed:                      808080
Number of query mix runs 
  (without warmups):       100 times
min/max Querymix runtime:  0.4725s / 44.6703s
Elapsed runtime:           269.323 seconds
QMpH:                      1336.683 query mixes per hour

The peak I/O is 339 MB/s, with average 192 MB/s; average CPU is 121%.

The above was measured with the read-ahead thread doing single-page reads. We repeated the test with merging reads with small differences. The max IO was 353 MB/s, and average 173 MB/s; average CPU 113%.

We see that the start latency is quite a bit longer than without speculation and the CPU % is lower due to higher latency of individual I/O. The I/O rate is fair. We would expect more throughput however.

We find that a supposedly better use of the API, doing single requests of up to 100 pages instead of consecutive requests of 1 page, does not make a lot of difference. The peak I/O is a bit higher; overall throughput is a bit lower.

We will have to retry these experiments with a better controller. We have at no point seen anything like the 50K 4KB random I/Os promised for the SSDs by the manufacturer. We know for a fact that the controller gives about 700 MB/s sequential read with cat file /dev/null and two drives busy. With 4 drives busy, this does not get better. The best 30 second stretch we saw in a multiuser BSBM warm-up was 590 MB/s, which is consistent with the cat to /dev/null figure. We will later test with 8 SSDs with better controllers.

Note that the average I/O and CPU are averages over 30 second measurement windows; thus for short running tests, there is some error from the window during which the activity ended.

Let us now see if we can make a BSBM instance warm up from disk in a reasonable time. We run 16 users with max speculation. We note that after reading 7,500,000 buffers we are not entirely free of disk. The max speculation read-ahead filled the cache in 17 minutes, with an average of 58 MB/s. After the cache is filled, the system shifts to a more conservative policy on extent read-ahead; one which in fact never gets triggered with the BSBM Explore in steady state. The vectored read-ahead is kept on since this by itself does not read pages that are not needed. However, the vectored read-ahead does not run either, because the data that is accessed in larger batches is already in memory. Thus there remains a trickle of an average 0.49 MB/s from disk. This keeps CPU around 350%. With SSDs, the trickle is about 1.5 MB/s and CPU is around 1300% in steady state. Thus SSDs give approximately triple the throughput in a situation where there is a tiny amount of continuous random disk access. The disk access in question is 80% for retrieving RDF literal strings, presumably on behalf of the DESCRIBE query in the mix. This query touches things no other query touches and does so one subject at a time, in a way that can neither be anticipated nor optimized.

The Virtuoso 7 column store will deal with this better because it is more space efficient overall. If we apply stream-compression to literals, these will go in under half the space, while quads will go in maybe one-quarter the space. Thus 3000 Mt all from memory should be possible with 72 GB RAM. 1000 Mt row-wise did fit in in 72 GB RAM except for the random literals accessed by the the DESCRIBE. This alone drops throughput to under a third of the memory-only throughput if using HDDs. SSDs, on the other hand, can largely neutralize this effect.

Conclusions

We have looked at basics of I/O. SSDs have been found to be a readily available solution to I/O bottlenecks without need for reconfiguration or complex I/O policies. We have been able to get a decent read rate under conditions of server warm-up or shift of working set even with HDDs.

More advanced I/O matters will be covered with the column store. We note that the techniques discussed here apply identically to rows and columns.

As concerns BSBM, it seems appropriate to include a warm-up time. In practice, this means that the store just must eagerly pre-read. This is not hard to do and can be quite useful.

Benchmarks, Redux Series

# PermaLink Comments [0]
03/07/2011 17:36 GMT Modified: 08/21/2014 10:45 GMT
Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs

In the context of database benchmarks we cannot ignore I/O, as pretty much has been done so far by BSBM.

There are two approaches:

  1. run twice or otherwise make sure one runs from memory and forget about I/O, or

  2. make rules and metrics for warm-up.

We will see if the second is possible with BSBM.

From this starting point, we look at various ways of scheduling I/O in Virtuoso using a 1000 Mt BSBM database on sets of each of HDDs (hard disk devices) and SSDs (solid-state storage devices). We will see that SSDs in this specific application can make a significant difference.

In this test we have the same 4 stripes of a 1000 Mt BSBM database on each of two storage arrays.

Storage Arrays
Type Quantity Maker Size Speed Interface speed Controller Drive Cache RAID
SSD 4 Crucial 128 GB N/A 6Gbit SATA RocketRaid 640 128 MB None
HDD 4 Samsung 1000 GB 7200 RPM 3Gbit SATA Intel ICH on Supermicro motherboard 16 MB None

We make sure that the files are not in OS cache by filling it with other big files, reading a total of 120 GB off SSDs with `cat file > /dev/null`.

The configuration files are as in the report on the 1000 Mt run. We note as significant that we have a few file descriptors for each stripe, and that read-ahead for each is handled by its own thread.

Two different read-ahead schemes are used:

  • With 6 Single, if a 2MB extent gets a second read within a given time after the first, the whole extent is scheduled for background read.

  • With 7 Single, as an index search is vectored, we know a large number of values to fetch at one time and these values are sorted into an ascending sequence. Therefore, by looking at a node in an index tree, we can determine which sub-trees will be accessed and schedule these for read-ahead, skipping any that will not be accessed.

In either model, a sequential scan touching more than a couple of consecutive index leaf pages triggers a read-ahead, to the end of the scanned range or to the next 3000 index leaves, whichever comes first. However, there are no sequential scans of significant size in BSBM.

There are a few different possibilities for the physical I/O:

  1. Using a separate read system call for each page. There may be several open file descriptors on a file so that many such calls can proceed concurrently on different threads; the OS will order the operations.

  2. A thread finds it needs a page and reads it.

  3. Using Unix asynchronous I/O, aio.h, with the aio_* and lio_listio functions.

  4. Using single-read system calls for adjacent pages. In this way, the drive sees longer requests and should give better throughput. If there are short gaps in the sequence, the gaps are also read, wasting bandwidth but saving on latency.

The two latter apply only to bulk I/O that are scheduled on background threads, one per independently-addressable device (HDD, SSD, or RAID-set). These bulk-reads operate on an elevator model, keeping a sorted queue of things to read or write and moving through this queue from start to end. At any time, the queue may get more work from other threads.

There is a further choice when seeing single-page random requests. They can either go to the elevator or they can be done in place. Taking the elevator is presumably good for throughput but bad for latency. In general, the elevator should have a notion of fairness; these matters are discussed in the CWI collaborative scan paper. Here we do not have long queries, so we do not have to talk about elevator policies or scan sharing; there are no scans. We may touch on these questions later with the column store, the BSBM BI mix, and TPC-H.

While we may know principles, I/O has always given us surprises; the only way to optimize this is to measure.

The metric we try to optimize here is the time it takes for a multiuser BSBM run starting from cold cache to get to 1200% CPU. When running from memory, the CPU is around 1350% for the system in question.

This depends on getting I/O throughput, which in turn depends on having a lot of speculative reading since the workload itself does not give any long stretches to read.

The test driver is set at 16 clients, and the run continues for 2000 query mixes or until target throughput is reached. Target throughput is deemed reached after the first 20 second stretch with CPU at 1200% or higher.

The meter is a stored procedure that records the CPU time, count of reads, cumulative elapsed time spent waiting for I/O, and other metrics. The code for this procedure (for 7 Single; this file will not work on Virtuoso 6 or earlier) is available here.

The database space allocation gives each index a number of 2MB segments, each with 256 8K pages. When a page splits, the new page is allocated from the same extent if possible, or from a specific second extent which is designated as the overflow extent of this extent. This scheme provides for a sort of pseudo-locality within extents over random insert order. Thus there is a chance that pre-reading an extent will get key values in the same range a the ones on the page being requested in the first place. At least the pre-read pages will be from the same index tree. There are insertion orders that do not create good locality with this allocation scheme, though. In order to generally improve locality, one could shuffle pages of an all-dirty subtree before writing this out so as to have physical order match key order. We will look at some tricks in this vein with the column store.

For the sake of simplicity we only run 7 Single with the 1000 Mt scale.

The first experiment was with SSDs and the vectored read-ahead. The target throughput was reached after 280 seconds.

The next test was with HDDs and extent read-ahead. One hour into the experiment, the CPU was about 70% after processing around 1000 query mixes. It might have been hours before HDD reads became rare enough for hitting 1200% CPU. The test was not worth continuing.

The result with HDDs and vectored read-ahead would be worse since vectored read-ahead leads to smaller read-ahead batches and to less contiguous read patterns. The individual read times here, are over twice the individual read times with per-extent read-ahead. The fact that vectored read-ahead does not read potentially unneeded pages makes no difference. Hence this test is also not worth running to completion.

There are other possibilities for improving HDD I/O. If only 2MB read requests are made, a transfer will be about 20 ms at a sequential transfer speed of 50 MB/s. Then seeking to the next 2MB extent will be a few ms, most often less than 20, so the HDD should give at least half the nominal throughput.

We note that, when reading sequential 8K pages inside a single 2MB (256 page) extent, the seek latency is not 0 as one would expect but an extreme 5 ms. One would think that the drive would buffer a whole track, and a track would hold a large number of 2MB sections, but apparently this is not so.

Therefore, now if we have a sequential read pattern that is more dense than 1 page out of 10, we read all the pages and just keep the ones we want.

So now we set the read-ahead to merge reads that fall within 10 pages. This wastes bandwidth, but supposedly saves on latency. We will see.

So we try, and we find that read-ahead does not account for most pages since it does not get triggered. Thus, we change the triggering condition to be the 2nd read to fall in the extent within 20 seconds of the first.

The HDDs were in all cases 700% busy for 4 HDDs. But with the new setting we get longer requests, most often full extents, which gets a per-HDD transfer rate of about 5 MB/s. With the looser condition for starting read-ahead, 89% of all pages were read in a read-ahead batch. We see the I/O throughput decrease during the run because there are more single-page reads that do not trigger extent read-ahead. So HDDs have 1.7 concurrent operations pending, but the batch size drops, dropping the throughput.

Thus with the best settings, the test with 2000 query mixes finishes in 46 minutes, and the CPU utilization is steadily increasing, hitting 392% for the last minute. In comparison, with SSDs and our worst read-ahead setting we got 1200% CPU in under 5 minutes from cold start. The I/O system can be further tuned; for example, by only reading full extents as long as the buffer pool is not full. In the next post we will measure some more.

BSBM Note

We look at query times with semi-warm cache, with CPU around 400%. We note that Q8-Q12 are especially bad. Q5 runs at about half speed. Q12 runs at under 1/10th speed. The relatively slowest queries appear to be single-instance lookups. Nothing short of the most aggressive speculative reading can help there. Neither query nor workload has any exploitable pattern. Therefore if an I/O component is to be included in a BSBM metric, the only way to score in this is to use speculative read to the maximum.

Some of the queries take consecutive property values of a single instance. One could parallelize this pipeline, but this would be a one-off and would make sense only when reading from storage (whether HDD, SSD, or otherwise). Multithreading for single rows is not worth the overhead.

A metric for BSBM warm-up is not interesting for database science, but may still be of practical interest in the specific case of RDF stores. Specially reading large chunks at startup time is good, so putting a section in BSBM that would force one to implement this would be a service to most end users. Measuring and reporting such I/O performance would favor space efficiency in general. Space efficiency is generally a good thing, especially at larger scales, so we can put an optional section in the report for warm-up. This is also good for comparing HDDs and SSDs, and for testing read-ahead, which is still something a database is expected to do. Implementors have it easy; just speculatively read everything.

Looking at the BSBM fictional use case, anybody running such a portal would do this from RAM only, so it makes sense to define the primary metric as running from warm cache, in practice 100% from memory.

Benchmarks, Redux Series

# PermaLink Comments [0]
03/07/2011 14:17 GMT Modified: 08/21/2014 10:45 GMT
Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire

Below is a questionnaire I sent to the BSBM participants in order to get tuning instructions for the runs we were planning. I have filled in the answers for Virtuoso, here. This can be a checklist for pretty much any RDF database tuning.

  1. Threading - What settings should be used (e.g., for query parallelization, I/O parallelization [e.g., prefetch, flush of dirty], thread pools [e,.g. web server], any other thread related)? We will run with 8 and 32 cores, so if there are settings controlling number of read/write (R/W) locks or mutexes or such for serializing diverse things, these should be set accordingly to minimize contention.

    The following three settings are all in the [Parameters] section of the virtuoso.ini file.

    • AsyncQueueMaxThreads controls the size of a pool of extra threads that can be used for query parallelization. This should be set to either 1.5 * the number of cores or 1.5 * the number of core threads; see which works better.

    • ThreadsPerQuery is the maximum number of threads a single query will take. This should be set to either the number of cores or the number of core threads; see which works better.

    • IndexTreeMaps is the number of mutexes over which control for buffering an index tree is split. This can generally be left at default (256 in normal operation; valid settings are powers of 2 from 2 to 1024), but setting to 64, 128, or 512 may be beneficial.

      A low number will lead to frequent contention; upwards of 64 will have little contention. We have sometimes seen a multiuser workload go 10% faster when setting this to 64 (down from 256), which seems counter-intuitive. This may be a cache artifact.

    In the [HTTPServer] section of the virtuoso.ini file, the ServerThreads setting is the number of web server threads, i.e., the maximum number of concurrent SPARQL protocol requests. Having a value larger than the number of concurrent clients is OK; for large numbers of concurrent clients a lower value may be better, which will result in requests waiting for a thread to be available.

    Note — The [HTTPServer] ServerThreads are taken from the total pool made available by the [Parameters] ServerThreads. Thus, the [Parameters] ServerThreads should always be at least as large as (and is best set greater than) the [HTTPServer] ServerThreads, and if using the closed-source Commercial Version, [Parameters] ServerThreads cannot exceed the licensed thread count.

  2. File layout - Are there settings for striping over multiple devices? Settings for other file access parallelism? Settings for SSDs (e.g., SSD based cache of hot set of larger db files on disk)? The target config is for 4 independent disks and 4 independent SSDs. If you depend on RAID, are there settings for this? If you need RAID to be set up, please provide the settings/script for doing this with 4 SSDs on Linux (RH and Debian). This will be software RAID, as we find the hardware RAID to be much worse than an independent disk setup on the system in question.

    It is best to stripe database files over all available disks, and to not use RAID. If RAID is desired, then stripe database files across many RAID sets. Use the segment declaration in the virtuoso.ini file. It is very important to give each independently seekable device its own I/O queue thread. See the documentation on the TPC-C sample for examples.

    in the [Parameters] section of the virtuoso.ini file, set FDsPerFile to be (the number of concurrent threads * 1.5) ÷ the number of distinct database files.

    There are no SSD specific settings.

  3. Loading - How many parallel streams work best? We are looking for non-transactional bulk load, with no inference materialization. For partitioned cluster settings, do we divide the load streams over server processes?

    Use one stream per core (not per core thread). In the case of a cluster, divide load streams evenly across all processes. The total number of streams on a cluster can equal the total number of cores; adjust up or down depending on what is observed.

    Use the built-in bulk load facility, i.e.,

    ld_dir ('<source-filename-or-directory>', '<file name pattern>', '<destination graph iri>');

    For example,

    SQL> ld_dir ('/path/to/files', '*.n3', 'http://dbpedia.org');

    Then do a rdf_loader_run () on enough connections. For example, you can use the shell command

    isql rdf_loader_run () &

    to start one in a background isql process. When starting background load commands from the shell, you can use the shell wait command to wait for completion. If starting from isql, use the wait_for_children; command (see isql documentation for details).

    See the BSBM disclosure report for an example load script.

  4. What command should be used after non-transactional bulk load, to ensure a consistent persistent state on disk, like a log checkpoint or similar? Load and checkpoint will be timed separately, load being CPU-bound and checkpoint being I/O-bound. No roll-forward log or similar is required; the load does not have to recover if it fails before the checkpoint.

    Execute

    CHECKPOINT;

    through a SQL client, e.g., isql. This is not a SPARQL statement and cannot be executed over the SPARQL protocol.

  5. What settings should be used for trickle load of small triple sets into a pre-existing graph? This should be as transactional as supported; at least there should be a roll forward log, unlike the case for the bulk load.

    No special settings are needed for load testing; defaults will produce transactional behavior with a roll forward log. Default transaction isolation is REPEATABLE READ, but this may be altered via SQL session settings or at Virtuoso server start-up through the [Parameters] section of the virtuoso.ini file, with

    DefaultIsolation = 4

    Transaction isolation cannot be set over the SPARQL protocol.

    NOTE: When testing full CRUD operations, other isolation settings may be preferable, due to ACID considerations. See answer #12, below, and detailed discussion in part 8 of this series, BSBM Explore and Update.

  6. What settings control allocation of memory for database caching? We will be running mostly from memory, so we need to make sure that there is enough memory configured.

    In the [Parameters] section of the virtuoso.ini file, NumberOfBuffers controls the amount of RAM used by Virtuoso to cache database files. One buffer caches an 8KB database page. In practice, count 10KB of memory per page. If "swappiness" on Linux is low (e.g., 2), two-thirds or more of physical memory can be used for database buffers. If swapping occurs, decrease the setting.

  7. What command gives status on memory allocation (e.g., number of buffers, number of dirty buffers, etc.) so that we can verify that things are indeed in server memory and not, for example, being served from OS disk cache. If the cached format is different from the disk layout (e.g., decompression after disk read), is there a command for space statistics for database cache?

    In an isql session, execute

    STATUS ( ? ? );

    The second result paragraph gives counts of total, used, and dirty buffers. If used buffers is steady and less than total, and if the disk read count on the line below does not increase, the system is running from memory. The cached format is the same as the disk based format.

  8. What command gives information on disk allocation for different things? We are looking for the total size of allocated database pages for quads (including table, indices, anything else associated with quads) and dictionaries for literals, IRI names, etc. If there is a text index on literals, what command gives space stats for this? We count used pages, excluding any preallocated unused pages or other gaps. There is one number for quads and another for the dictionaries or other such structures, optionally a third for text index.

    Execute on an isql session:

    CHECKPOINT;
    SELECT TOP 20 * FROM sys_index_space_stats ORDER BY iss_pages DESC;
    

    The iss_pages column is the total pages for each index, including blob pages. Pages are 8KB. Only used pages are reported, gaps and unused pages are not counted. The rows pertaining to RDF_QUAD are for quads; RDF_IRI, RDF_PREFIX, RO_START, RDF_OBJ are for dictionaries; RDF_OBJ_RO_FLAGS_WORDS and VTLOG_DB_DBA_RDF_OBJ are for text index.

  9. If there is a choice between triples and quads, we will run with quads. How do we ascertain that the run is with quads? How do we find out the index scheme? Should be use an alternate index scheme? Most of the data will be in a single big graph.

    The default scheme uses quads. The default index layout is PSOG, POGS, GS, SP, OP. To see the current index scheme, use an isql session to execute

    STATISTICS DB.DBA.RDF_QUAD;
  10. For partitioned cluster settings, are there partitioning-related settings to control even distribution of data between partitions? For example, is there a way to set partitioning by S or O depending on which is first in key order for each index?

    The default partitioning settings are good, i.e., partitioning is on O or S, whichever is first in key order.

  11. For partitioned clusters, are there settings to control message batching or similar? What are the statistics available for checking interconnect operation, e.g. message counts, latencies, total aggregate throughput of interconnect?

    In the [Cluster] section of the cluster.ini file, ReqBatchSize is the number of query states dispatched between cluster nodes per message round trip. This may be incremented from the default of 10000 to 50000 or so if this is seen to be useful.

    To change this on the fly, the following can be issued through an isql session:

    cl_exec ( ' __dbf_set (''cl_request_batch_size'', 50000) ' );

    The commands below may be executed through an isql session to get a summary of CPU and message traffic for the whole cluster or process-by-process, respectively. The documentation details the fields.

     STATUS ('cluster')      ;; whole cluster 
    STATUS ('cluster_d') ;; process-by-process
  12. Other settings - Are there settings for limiting query planning, when appropriate? For example, the BSBM Explore mix has a large component of unnecessary query optimizer time, since the queries themselves access almost no data. Any other relevant settings?

    • For BSBM, needless query optimization should be capped at Virtuoso server start-up through the [Parameters] section of the virtuoso.ini, with

      StopCompilerWhenXOverRun = 1
    • When testing full CRUD operations (not simply CREATE, i.e., load, as discussed in #5, above), it is essential to make queries run with transaction isolation of READ COMMITTED, to remove most lock contention. Transaction isolation cannot be adjusted via SPARQL. This can be changed through SQL session settings, or at Virtuoso server start-up through the [Parameters] section of the virtuoso.ini file, with

      DefaultIsolation = 2

Benchmarks, Redux Series

# PermaLink Comments [0]
03/04/2011 15:28 GMT Modified: 08/21/2014 10:46 GMT
Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire

Below is a questionnaire I sent to the BSBM participants in order to get tuning instructions for the runs we were planning. I have filled in the answers for Virtuoso, here. This can be a checklist for pretty much any RDF database tuning.

  1. Threading - What settings should be used (e.g., for query parallelization, I/O parallelization [e.g., prefetch, flush of dirty], thread pools [e,.g. web server], any other thread related)? We will run with 8 and 32 cores, so if there are settings controlling number of read/write (R/W) locks or mutexes or such for serializing diverse things, these should be set accordingly to minimize contention.

    The following three settings are all in the [Parameters] section of the virtuoso.ini file.

    • AsyncQueueMaxThreads controls the size of a pool of extra threads that can be used for query parallelization. This should be set to either 1.5 * the number of cores or 1.5 * the number of core threads; see which works better.

    • ThreadsPerQuery is the maximum number of threads a single query will take. This should be set to either the number of cores or the number of core threads; see which works better.

    • IndexTreeMaps is the number of mutexes over which control for buffering an index tree is split. This can generally be left at default (256 in normal operation; valid settings are powers of 2 from 2 to 1024), but setting to 64, 128, or 512 may be beneficial.

      A low number will lead to frequent contention; upwards of 64 will have little contention. We have sometimes seen a multiuser workload go 10% faster when setting this to 64 (down from 256), which seems counter-intuitive. This may be a cache artifact.

    In the [HTTPServer] section of the virtuoso.ini file, the ServerThreads setting is the number of web server threads, i.e., the maximum number of concurrent SPARQL protocol requests. Having a value larger than the number of concurrent clients is OK; for large numbers of concurrent clients a lower value may be better, which will result in requests waiting for a thread to be available.

    Note — The [HTTPServer] ServerThreads are taken from the total pool made available by the [Parameters] ServerThreads. Thus, the [Parameters] ServerThreads should always be at least as large as (and is best set greater than) the [HTTPServer] ServerThreads, and if using the closed-source Commercial Version, [Parameters] ServerThreads cannot exceed the licensed thread count.

  2. File layout - Are there settings for striping over multiple devices? Settings for other file access parallelism? Settings for SSDs (e.g., SSD based cache of hot set of larger db files on disk)? The target config is for 4 independent disks and 4 independent SSDs. If you depend on RAID, are there settings for this? If you need RAID to be set up, please provide the settings/script for doing this with 4 SSDs on Linux (RH and Debian). This will be software RAID, as we find the hardware RAID to be much worse than an independent disk setup on the system in question.

    It is best to stripe database files over all available disks, and to not use RAID. If RAID is desired, then stripe database files across many RAID sets. Use the segment declaration in the virtuoso.ini file. It is very important to give each independently seekable device its own I/O queue thread. See the documentation on the TPC-C sample for examples.

    in the [Parameters] section of the virtuoso.ini file, set FDsPerFile to be (the number of concurrent threads * 1.5) ÷ the number of distinct database files.

    There are no SSD specific settings.

  3. Loading - How many parallel streams work best? We are looking for non-transactional bulk load, with no inference materialization. For partitioned cluster settings, do we divide the load streams over server processes?

    Use one stream per core (not per core thread). In the case of a cluster, divide load streams evenly across all processes. The total number of streams on a cluster can equal the total number of cores; adjust up or down depending on what is observed.

    Use the built-in bulk load facility, i.e.,

    ld_dir ('<source-filename-or-directory>', '<file name pattern>', '<destination graph iri>');

    For example,

    SQL> ld_dir ('/path/to/files', '*.n3', 'http://dbpedia.org');

    Then do a rdf_loader_run () on enough connections. For example, you can use the shell command

    isql rdf_loader_run () &

    to start one in a background isql process. When starting background load commands from the shell, you can use the shell wait command to wait for completion. If starting from isql, use the wait_for_children; command (see isql documentation for details).

    See the BSBM disclosure report for an example load script.

  4. What command should be used after non-transactional bulk load, to ensure a consistent persistent state on disk, like a log checkpoint or similar? Load and checkpoint will be timed separately, load being CPU-bound and checkpoint being I/O-bound. No roll-forward log or similar is required; the load does not have to recover if it fails before the checkpoint.

    Execute

    CHECKPOINT;

    through a SQL client, e.g., isql. This is not a SPARQL statement and cannot be executed over the SPARQL protocol.

  5. What settings should be used for trickle load of small triple sets into a pre-existing graph? This should be as transactional as supported; at least there should be a roll forward log, unlike the case for the bulk load.

    No special settings are needed for load testing; defaults will produce transactional behavior with a roll forward log. Default transaction isolation is REPEATABLE READ, but this may be altered via SQL session settings or at Virtuoso server start-up through the [Parameters] section of the virtuoso.ini file, with

    DefaultIsolation = 4

    Transaction isolation cannot be set over the SPARQL protocol.

    NOTE: When testing full CRUD operations, other isolation settings may be preferable, due to ACID considerations. See answer #12, below, and detailed discussion in part 8 of this series, BSBM Explore and Update.

  6. What settings control allocation of memory for database caching? We will be running mostly from memory, so we need to make sure that there is enough memory configured.

    In the [Parameters] section of the virtuoso.ini file, NumberOfBuffers controls the amount of RAM used by Virtuoso to cache database files. One buffer caches an 8KB database page. In practice, count 10KB of memory per page. If "swappiness" on Linux is low (e.g., 2), two-thirds or more of physical memory can be used for database buffers. If swapping occurs, decrease the setting.

  7. What command gives status on memory allocation (e.g., number of buffers, number of dirty buffers, etc.) so that we can verify that things are indeed in server memory and not, for example, being served from OS disk cache. If the cached format is different from the disk layout (e.g., decompression after disk read), is there a command for space statistics for database cache?

    In an isql session, execute

    STATUS ( ? ? );

    The second result paragraph gives counts of total, used, and dirty buffers. If used buffers is steady and less than total, and if the disk read count on the line below does not increase, the system is running from memory. The cached format is the same as the disk based format.

  8. What command gives information on disk allocation for different things? We are looking for the total size of allocated database pages for quads (including table, indices, anything else associated with quads) and dictionaries for literals, IRI names, etc. If there is a text index on literals, what command gives space stats for this? We count used pages, excluding any preallocated unused pages or other gaps. There is one number for quads and another for the dictionaries or other such structures, optionally a third for text index.

    Execute on an isql session:

    CHECKPOINT;
    SELECT TOP 20 * FROM sys_index_space_stats ORDER BY iss_pages DESC;
    

    The iss_pages column is the total pages for each index, including blob pages. Pages are 8KB. Only used pages are reported, gaps and unused pages are not counted. The rows pertaining to RDF_QUAD are for quads; RDF_IRI, RDF_PREFIX, RO_START, RDF_OBJ are for dictionaries; RDF_OBJ_RO_FLAGS_WORDS and VTLOG_DB_DBA_RDF_OBJ are for text index.

  9. If there is a choice between triples and quads, we will run with quads. How do we ascertain that the run is with quads? How do we find out the index scheme? Should be use an alternate index scheme? Most of the data will be in a single big graph.

    The default scheme uses quads. The default index layout is PSOG, POGS, GS, SP, OP. To see the current index scheme, use an isql session to execute

    STATISTICS DB.DBA.RDF_QUAD;
  10. For partitioned cluster settings, are there partitioning-related settings to control even distribution of data between partitions? For example, is there a way to set partitioning by S or O depending on which is first in key order for each index?

    The default partitioning settings are good, i.e., partitioning is on O or S, whichever is first in key order.

  11. For partitioned clusters, are there settings to control message batching or similar? What are the statistics available for checking interconnect operation, e.g. message counts, latencies, total aggregate throughput of interconnect?

    In the [Cluster] section of the cluster.ini file, ReqBatchSize is the number of query states dispatched between cluster nodes per message round trip. This may be incremented from the default of 10000 to 50000 or so if this is seen to be useful.

    To change this on the fly, the following can be issued through an isql session:

    cl_exec ( ' __dbf_set (''cl_request_batch_size'', 50000) ' );

    The commands below may be executed through an isql session to get a summary of CPU and message traffic for the whole cluster or process-by-process, respectively. The documentation details the fields.

     STATUS ('cluster')      ;; whole cluster 
    STATUS ('cluster_d') ;; process-by-process
  12. Other settings - Are there settings for limiting query planning, when appropriate? For example, the BSBM Explore mix has a large component of unnecessary query optimizer time, since the queries themselves access almost no data. Any other relevant settings?

    • For BSBM, needless query optimization should be capped at Virtuoso server start-up through the [Parameters] section of the virtuoso.ini, with

      StopCompilerWhenXOverRun = 1
    • When testing full CRUD operations (not simply CREATE, i.e., load, as discussed in #5, above), it is essential to make queries run with transaction isolation of READ COMMITTED, to remove most lock contention. Transaction isolation cannot be adjusted via SPARQL. This can be changed through SQL session settings, or at Virtuoso server start-up through the [Parameters] section of the virtuoso.ini file, with

      DefaultIsolation = 2

Benchmarks, Redux Series

# PermaLink Comments [0]
03/04/2011 15:28 GMT Modified: 08/21/2014 10:46 GMT
Benchmarks, Redux (part 2): A Benchmarking Story
Caeterum censeo, benchmarks are for vendors...

This is an edifying story about benchmarks and how databases work. I will show how one detail makes a 5+x difference, and how one really must understand how things work in order to make sense of benchmarks.

We begin right after the publication of the recent Berlin report. This report gives us OK performance for queries and very bad performance for loading. Trickle updates were not measurable. This comes as a consequence of testing intermediate software cuts and having incomplete instructions for operating them. I will cover the whole BSBM matter and the general benchmarking question in forthcoming posts; for now, let's talk about specifics.

In the course of the discussion to follow, we talk about 3 different kinds of Virtuoso:

  • 6 Single is the generally available single-instance-server configuration of Virtuoso. Whether this is open source or not does not make a difference.

  • 6 Cluster is the generally available, commercial-only, cluster-capable Virtuoso.

  • 7 Single is the next-generation single-instance-server Virtuoso, about to be released as a preview.

We began by running the various parts of BSBM at different scales with different Virtuoso variants. In so doing, we noticed that the BSBM Explore mix at one scale got better throughput as we added more clients, approximately as one would expect based on CPU usage and number of cores, while at another scale this was not so.

At the 1-billion-triple scale (1000 Mt; 1 Mt = 1 Megatriple, or one million triples) we saw CPU going from 200% with 1 client to 1400% with 16 clients but throughput increased by less than 20%.

When we ran the same scale with our shared-nothing 6 Cluster, running 8 processes on the same box, throughput increased normally with the client count. We have not previously tried BSBM with 6 Cluster simply because there is little to gain and a lot to lose by distributing this workload. But here we got a multiuser throughput with 6 Cluster that is easily 3 times that of the single server, even with a cluster-unfriendly workload.

See, sometimes scaling out even within a shared memory multiprocessor pays! Still, what we saw was rather anomalous.

Over the years we have looked at performance any number of times and have a lot of built-in meters. For cases of high CPU with no throughput, the prime suspect is contention on critical sections. Quite right, when building with the mutex meter enabled, counting how many times each mutex is acquired and how many times this results in a wait, we found a mutex which gets acquired 600M times in the run, of which an insane 450M result in a wait. One can count a microsecond of real time each time a mutex wait results in the kernel switching tasks. The run took 500 s or so, of which 450 s of real time were attributable to the overhead of waiting for this one mutex.

Waiting for a mutex is a real train wreck. We have tried spinning a few times before it, which the OS does anyhow, but this does not help. Using spin locks is good only if waits are extremely rare; with any frequency of waiting, even for very short waits, a mutex is still a lot better.

Now, the mutex in question happens to serialize the buffer cache for one specific page of data, one level down from the root of the index for RDF PSOG. By the luck of the draw, the Ps falling on that page are commonly accessed Ps pertaining to product features. In order to get any product feature value, one must pass via this page. At the smaller scale, the different properties web their different ways based on the index root.

One might here ask why the problem is one level down from the root and not in the root. The index root is already handled specially, so the read-write locks for buffers usually apply only for the first level down. One might also ask why have a mutex in the first place. Well, unless one is read-only and all in memory, there simply must be a way to say that a buffer must not get written to by one thread while another is reading it. Same for cache replacement. Some in-memory people fork a whole copy of the database process to do a large query and so can forget about serialization. But one must have long queries for this and have all in memory. One can make writes less frequent by keeping deltas, but this does not remove the need to merge the deltas at some point, which cannot happen without serializing this with the readers.

Most of the time the offending mutex is acquired for getting a property of a product in Q5, the one that looks for products with similar values of a numeric property. We retrieve this property for a number of products in one go, due to vectoring. Vectoring is supposed to save us from constantly hitting the index tree top when getting the next match. So how come there is contention in the index tree top? As it happens, the vectored index lookup checks for locality only when all search conditions on key parts are equalities. Here however there is equality on P and S and a range on O; hence, the lookup starts from the index root every time.

So I changed this. The effect was Q5 getting over twice as fast, with the single user throughput at 1000 Mt going from 2000 to 5200 QMpH (Query Mixes per Hour) and the 16-user throughput going from 3800 to over 21000 QMpH. The previously "good" throughput of 40K QMpH at 100 Mt went to 66K QMpH.

Vectoring can make a real difference. The throughputs for the same workload on 6 Single, without vectoring, thus unavoidably hitting the page with the crazy contention, are 1770 QMpH single user and 2487 QMpH with 16 users. The 6 Cluster throughput, avoiding the contention but without the increased locality from vectoring and with the increased latency of going out-of-process for most of the data, was about 11.5K QMpH with 16 users. Each partition had a page getting the hits but since the partitioning was on S and S was about-evenly distributed, each partition got 1/8 of the load; thus waiting on the mutex did not become a killer issue.

We see how detailed analysis of benchmarks can lead to almost an order of magnitude improvements in a short time. This analysis is however both difficult and tedious. It is not readily delegable; one needs real knowledge of how things work and of how they ought to work in order to get anywhere with this. Experience tends to show that a competitive situation is needed in order to motivate one to go to the trouble. Unless something really sticks out in an obvious manner, one is most likely not going to look deep enough. Of course, this is seen in applications too but application optimization tends to stop at a point where the application is usable. Also stored procedures and specially-tweaked queries will usually help. In most application scenarios, we are not simultaneously looking at multiple different implementations, except maybe at the start of development but then this falls under benchmarking and evaluation.

So, the usefulness of benchmarks is again confirmed. There is likely great unexplored space for improvement as we move to more interesting and diverse scenarios.

Benchmarks, Redux Series

# PermaLink Comments [0]
02/28/2011 16:12 GMT Modified: 08/21/2014 10:46 GMT
Benchmarks, Redux (part 2): A Benchmarking Story
Caeterum censeo, benchmarks are for vendors...

This is an edifying story about benchmarks and how databases work. I will show how one detail makes a 5+x difference, and how one really must understand how things work in order to make sense of benchmarks.

We begin right after the publication of the recent Berlin report. This report gives us OK performance for queries and very bad performance for loading. Trickle updates were not measurable. This comes as a consequence of testing intermediate software cuts and having incomplete instructions for operating them. I will cover the whole BSBM matter and the general benchmarking question in forthcoming posts; for now, let's talk about specifics.

In the course of the discussion to follow, we talk about 3 different kinds of Virtuoso:

  • 6 Single is the generally available single-instance-server configuration of Virtuoso. Whether this is open source or not does not make a difference.

  • 6 Cluster is the generally available, commercial-only, cluster-capable Virtuoso.

  • 7 Single is the next-generation single-instance-server Virtuoso, about to be released as a preview.

We began by running the various parts of BSBM at different scales with different Virtuoso variants. In so doing, we noticed that the BSBM Explore mix at one scale got better throughput as we added more clients, approximately as one would expect based on CPU usage and number of cores, while at another scale this was not so.

At the 1-billion-triple scale (1000 Mt; 1 Mt = 1 Megatriple, or one million triples) we saw CPU going from 200% with 1 client to 1400% with 16 clients but throughput increased by less than 20%.

When we ran the same scale with our shared-nothing 6 Cluster, running 8 processes on the same box, throughput increased normally with the client count. We have not previously tried BSBM with 6 Cluster simply because there is little to gain and a lot to lose by distributing this workload. But here we got a multiuser throughput with 6 Cluster that is easily 3 times that of the single server, even with a cluster-unfriendly workload.

See, sometimes scaling out even within a shared memory multiprocessor pays! Still, what we saw was rather anomalous.

Over the years we have looked at performance any number of times and have a lot of built-in meters. For cases of high CPU with no throughput, the prime suspect is contention on critical sections. Quite right, when building with the mutex meter enabled, counting how many times each mutex is acquired and how many times this results in a wait, we found a mutex which gets acquired 600M times in the run, of which an insane 450M result in a wait. One can count a microsecond of real time each time a mutex wait results in the kernel switching tasks. The run took 500 s or so, of which 450 s of real time were attributable to the overhead of waiting for this one mutex.

Waiting for a mutex is a real train wreck. We have tried spinning a few times before it, which the OS does anyhow, but this does not help. Using spin locks is good only if waits are extremely rare; with any frequency of waiting, even for very short waits, a mutex is still a lot better.

Now, the mutex in question happens to serialize the buffer cache for one specific page of data, one level down from the root of the index for RDF PSOG. By the luck of the draw, the Ps falling on that page are commonly accessed Ps pertaining to product features. In order to get any product feature value, one must pass via this page. At the smaller scale, the different properties web their different ways based on the index root.

One might here ask why the problem is one level down from the root and not in the root. The index root is already handled specially, so the read-write locks for buffers usually apply only for the first level down. One might also ask why have a mutex in the first place. Well, unless one is read-only and all in memory, there simply must be a way to say that a buffer must not get written to by one thread while another is reading it. Same for cache replacement. Some in-memory people fork a whole copy of the database process to do a large query and so can forget about serialization. But one must have long queries for this and have all in memory. One can make writes less frequent by keeping deltas, but this does not remove the need to merge the deltas at some point, which cannot happen without serializing this with the readers.

Most of the time the offending mutex is acquired for getting a property of a product in Q5, the one that looks for products with similar values of a numeric property. We retrieve this property for a number of products in one go, due to vectoring. Vectoring is supposed to save us from constantly hitting the index tree top when getting the next match. So how come there is contention in the index tree top? As it happens, the vectored index lookup checks for locality only when all search conditions on key parts are equalities. Here however there is equality on P and S and a range on O; hence, the lookup starts from the index root every time.

So I changed this. The effect was Q5 getting over twice as fast, with the single user throughput at 1000 Mt going from 2000 to 5200 QMpH (Query Mixes per Hour) and the 16-user throughput going from 3800 to over 21000 QMpH. The previously "good" throughput of 40K QMpH at 100 Mt went to 66K QMpH.

Vectoring can make a real difference. The throughputs for the same workload on 6 Single, without vectoring, thus unavoidably hitting the page with the crazy contention, are 1770 QMpH single user and 2487 QMpH with 16 users. The 6 Cluster throughput, avoiding the contention but without the increased locality from vectoring and with the increased latency of going out-of-process for most of the data, was about 11.5K QMpH with 16 users. Each partition had a page getting the hits but since the partitioning was on S and S was about-evenly distributed, each partition got 1/8 of the load; thus waiting on the mutex did not become a killer issue.

We see how detailed analysis of benchmarks can lead to almost an order of magnitude improvements in a short time. This analysis is however both difficult and tedious. It is not readily delegable; one needs real knowledge of how things work and of how they ought to work in order to get anywhere with this. Experience tends to show that a competitive situation is needed in order to motivate one to go to the trouble. Unless something really sticks out in an obvious manner, one is most likely not going to look deep enough. Of course, this is seen in applications too but application optimization tends to stop at a point where the application is usable. Also stored procedures and specially-tweaked queries will usually help. In most application scenarios, we are not simultaneously looking at multiple different implementations, except maybe at the start of development but then this falls under benchmarking and evaluation.

So, the usefulness of benchmarks is again confirmed. There is likely great unexplored space for improvement as we move to more interesting and diverse scenarios.

Benchmarks, Redux Series

# PermaLink Comments [0]
02/28/2011 16:12 GMT Modified: 08/21/2014 10:46 GMT
Virtuoso Directions for 2011

At the start of 2010, I wrote that 2010 would be the year when RDF became performance- and cost-competitive with relational technology for data warehousing and analytics. More specifically, RDF would shine where data was heterogenous and/or where there was a high frequency of schema change.

I will now discuss what we have done towards this end in 2010 and how you will gain by this in 2011.

At the start of 2010, we had internally demonstrated 4x space efficiency gains from column-wise compression and 3x loop join speed gains from vectored execution. To recap, column-wise compression means a column-wise storage layout where values of consecutive rows of a single column are consecutive in memory/disk and are compressed in a manner that benefits from the homogenous data type and possible sort order of the column. Vectored execution means passing large numbers of query variable bindings between query operators and possibly sorting inputs to joins for improving locality. Furthermore, always operating on large sets of values gives extra opportunities for parallelism, from instruction level to threads to scale out.

So, during 2010, we integrated these technologies into Virtuoso, for relational- and graph-based applications alike. Further, even if we say that RDF will be close to relational speed in Virtuoso, the point is moot if Virtuoso's relational speed is not up there with the best of analytics-oriented RDBMS. RDF performance does rest on the basis of general-purpose database performance; what is sauce for the goose is sauce for the gander. So we reimplemented HASH JOIN and GROUP BY, and fine-tuned many of the tricks required by TPC-H. TPC-H is not the sole final destination, but it is a step on the way and a valuable checklist for what a database ought to do.

At the Semdata workshop of VLDB 2010 we presented some results of our column store applied to RDF and relational tasks. As noted in the paper, the implementation did demonstrate significant gains over the previous row-wise architecture but was not yet well optimized, so not ready to be compared with the best of the relational analytics world. A good part of the fall of 2010 went into optimizing the column store and completing functionality such as transaction support with columns.

A lot of this work is not specifically RDF oriented, but all of this work is constantly informed by the specific requirements of RDF. For example, the general idea of vectored execution is to eliminate overheads and optimize CPU cache and other locality by doing single query operations on arrays of operands so that the whole batch runs more or less in CPU cache. Are the gains not lost if data is typed at run time, as in RDF? In fact, the cost of run-time-typing turns out to be small, since data in practice tends to be of homogenous type and with locality of reference in values. Virtuoso's column store implementation resembles in broad outline other column stores like Vertica or VectorWise, the main difference being the built-in support for run-time heterogenous types.

The LOD2 EU FP 7 project started in September 2010. In this project OpenLink and the celebrated heroes of the column store, CWI of MonetDB and VectorWise fame, represent the database side.

The first database task of LOD2 is making a survey of the state of the art and a round of benchmarking of RDF stores. The Berlin SPARQL Benchmark (BSBM) has accordingly evolved to include a business intelligence section and an update stream. Initial results from running these will become available in February/March, 2011. The specifics of this process merit another post; let it for now be said that benchmarking is making progress. In the end, it is our conviction that we need a situation where vendors may publish results as and when they are available and where there exists a well defined process for documenting and checking results.

LOD2 will continue by linking the universe, as I half-facetiously put it on a presentation slide. This means alignment of anything from schema to instance identifiers, with and without supervision, and always with provenance, summarization, visualization, and so forth. In fact, putting it this way, this gets to sound like the old chimera of generating applications from data or allowing users to derive actionable intelligence from data of which they do not even know the structure. No, we are not that unrealistic. But we are moving toward more ad-hoc discovery and faster time to answer. And since we provide an infrastructure element under all this, we want to do away with the "RDF tax," by which we mean any significant extra cost of RDF compared to an alternate technology. To put it another way, you ought to pay for unpredictable heterogeneity or complex inference only when you actually use them, not as a fixed up-front overhead.

So much for promises. When will you see something? It is safe to say that we cannot very well publish benchmarks of systems that are not generally available in some form. This places an initial technology preview cut of Virtuoso 7 with vectored execution somewhere in January or early February. The column store feature will be built in, but more than likely the row-wise compressed RDF format of Virtuoso 6 will still be the default. Version 6 and 7 databases will be interchangeable unless column-store structures are used.

For now, our priority is to release the substantial gains that have already been accomplished.

After an initial preview cut, we will return to the agenda of making sure Virtuoso is up there with the best in relational analytics, and that the equivalent workload with an RDF data model runs as close as possible to relational performance. As a first step this means taking TPC-H as is, and then converting the data and queries to the trivially equivalent RDF and SPARQL and seeing how it goes. In the September paper we dabbled a little with the data at a small scale but now we must run the full set of queries at 100GB and 300GB scales, which come to about 14 billion and 42 billion triples, respectively. A well done analysis of the issues encountered, covering similarities and dissimilarities of the implementation of the workload as SQL and SPARQL, should make a good VLDB paper.

Database performance is an entirely open-ended quest and the bag of potentially applicable tricks is as good as infinite. Having said this, it seems that the scales comfortably reached in the TPC benchmarks are more than adequate for pretty much anything one is likely to encounter in real world applications involving comparable workloads. Businesses getting over 6 million new order transactions per minute (the high score of TPC-C) or analyzing a warehouse of 60 billion orders shipped to 6 billion customers over 7 years (10000GB or 10TB TPC-H) are not very common if they exist at all.

The real world frontier has moved on. Scaling up the TPC workloads remains a generally useful exercise that continues to contribute to the state of the art but the applications requiring this advance are changing.

Someone once said that for a new technology to become mainstream, it needs to solve a new class of problem. Yes, while it is a preparatory step to run TPC-H translated to SPARQL without dying of overheads, there is little point in doing this in production since SQL is anyway likely better and already known, proven, and deployed.

The new class of problem, as LOD2 sees it, is the matter of web-wide cross-organizational data integration. Web-wide does not necessarily mean crawling the whole web, but does tend to mean running into significant heterogeneity of sources, both in terms of modeling and in terms of usage of more-or-less standard data models. Around this topic we hear two messages. The database people say that inference beyond what you can express in SQL views is theoretically nice but practically not needed; on the other side, we hear that the inference now being standardized in efforts like RIF and OWL is not expressive enough for the real world. As one expert put it, if enterprise data integration in the 1980s was between a few databases, today it is more like between 1000 databases, which makes this matter similar to searching the web. How can one know in such a situation that the data being aggregated is in fact meaningfully aggregate-able?

Add to this the prevalence of unstructured data in the world and the need to mine it for actionable intelligence. Think of combining data from CRM, worldwide media coverage of own and competitive brands, and in-house emails for assessing organizational response to events on the market.

These are the actual use cases for which we need RDF at relational DW performance and scale. This is not limited to RDF and OWL profiles, since we fully believe that inference needs are more diverse. The reason why this is RDF and not SQL plus some extension of Datalog, is the widespread adoption of RDF and linked data as a data publishing format, with all the schema-last and open world aspects that have been there from the start.

Stay tuned for more news later this month!

Related

# PermaLink Comments [0]
01/19/2011 11:29 GMT Modified: 01/20/2011 12:54 GMT
VLDB Semdata Workshop

I will begin by extending my thanks to the organizers, in specific Reto Krummenacher of STI and Atanas Kiryakov of Ontotext for inviting me to give a position paper at the workshop. Indeed, it is the builders of bridges, the pontifs (pontifex) amongst us who shall be remembered by history. The idea of organizing a semantic data management workshop at VLDB is a laudable attempt at rapprochement between two communities to the advantage of all concerned.

Franz, Ontotext, and OpenLink were the vendors present at the workshop. To summarize very briefly, Jans Aasman of Franz talked about the telco call center automation solution by Amdocs, where the AllegroGraph RDF store is integrated. On the technical side, AllegroGraph has Javascript as a stored procedure language, which is certainly a good idea. Naso of Ontotext talked about the BBC FIFA World Cup site. The technical proposition was that materialization is good and data partitioning is not needed; a set of replicated read-only copies is good enough.

I talked about making RDF cost competitive with relational for data integration and BI. The crux is space efficiency and column store techniques.

One question that came up was that maybe RDF could approach relational in some things, but what about string literals being stored in a separate table? Or URI strings being stored in a separate table?

The answer is that if one accesses a lot of these literals the access will be local and fairly efficient. If one accesses just a few, it does not matter. For user-facing reports, there is no point in returning a million strings that the user will not read anyhow. But then it turned out that there in fact exist reports in bioinformatics where there are 100,000 strings. Now taking the worst abuse of SPARQL, a regexp over all literals in a property of a given class. With a column store this is a scan of the column; with RDF, a three table join. The join is about 10x slower than the column scan. Quite OK, considering that a full text index is the likely solution for such workloads anyway. Besides, a sensible relational schema will also not use strings for foreign keys, and will therefore incur a similar burden from fetching the strings before returning the result.

Another question was about whether the attitude was one of confrontation between RDF and relational and whether it would not be better to join forces. Well, as said in my talk, sauce for the goose is sauce for the gander and generally speaking relational techniques apply equally to RDF. There are a few RDB tricks that have no RDF equivalent, like clustering a fact table on dimension values, e.g., sales ordered by country, manufacturer, month. But by and large, column-store techniques apply. The execution engine can be essentially identical, just needing a couple of extra data types and some run-time typing and in some cases producing nulls instead of errors. Query optimization is much the same, except that RDB stats are not applicable as such; one needs to sample the data in the cost model. All in all, these adaptations to a RDB are not so large, even though they do require changes to source code.

Another question was about combining data models, e.g., relational (rows and columns), RDF (graph), XML (tree), and full text. Here I would say that it is a fault of our messaging that we do not constantly repeat the necessity of this combining, as we take it for granted. Most RDF stores have a full text index on literal values. OWLIM and a CWI prototype even have it for URIs. XML is a valid data type for an RDF literal, even though this does not get used very much. So doing SPARQL to select the values, and then doing XPath and XSLT on the values, is entirely possible, at least in Virtuoso which has an XPath/XSLT engine built in. Same for invoking SPARQL from an XSLT sheet. Colocating a native RDBMS with local and federated SQL is what Virtuoso has always done. One can, for example, map tables in heterogenous remote RDBs into tables in Virtuoso, then map these into RDF, and run SPARQL queries that get translated into SQL against the original tables, thereby getting SPARQL access without any materialization. Alongside this, one can ETL relational data into RDF via the same declarative mapping.

Further, there are RDF extensions for geospatial queries in Virtuoso and AllegroGraph, and soon also in others.

With all this cross-model operation, RDF is definitely not a closed island. We'll have to repeat this more.

Of the academic papers, the SpiderStore (paper is not yet available at time of writing, but should be soon) and Webpie that should be specially noted.

Let us talk about SpiderStore first.

SpiderStore

The SpiderStore from the University of Innsbruck is a main-memory-only system that has a record for each distinct IRI. The IRI record has one array of pointers to all IRI records that are objects where the referencing record is the subject, and a similar array of pointers to all records where the referencing record is the object. Both sets of pointers are clustered based on the predicate labeling the edge.

According to the authors (Robert Binna, Wolfgang Gassler, Eva Zangerle, Dominic Pacher, and Günther Specht), a distinct IRI is 5 pointers and each triple is 3 pointers. This would make about 4 pointers per triple, i.e., 32 bytes with 64-bit pointers.

This is not particularly memory efficient, since one must count unused space after growing the lists, fragmentation, etc., which will make the space consumption closer to 40 bytes per triple, plus should one add a graph to the mix one would need another pointer per distinct predicate, adding another 1-4 bytes per triple. Supporting non-IRI types in the object position is not a problem, as long as all distinct values have a chunk of memory to them with a type tag.

We get a few times better memory efficiency with column compressed quads, plus we are not limited to main memory.

But SpiderStore has a point. Making the traversal of an edge in the graph into a pointer dereference is not such a bad deal, especially if the data set is not that big. Furthermore, compiling the queries into C procedures playing with the pointers alone would give performance to match or exceed any hard coded graph traversal library and would not be very difficult. Supporting multithreaded updates would spoil much of the gain but allowing single threaded updates and forking read-only copies for reading would be fine.

SpiderStore as such is not attractive for what we intend to do, this being aggregating RDF quads in volumes far exceeding main memory and scaling to clusters. We note that SpiderStore hits problems with distributed memory, since SpiderStore executes depth first, which is manifestly impossible if significant latencies are involved. In other words, if there can be latency, one must amortize by having a lot of other possible work available. Running with long vectors of values is one way, as in MonetDB or Virtuoso Cluster. The other way is to have a massively multithreaded platform which favors code with few instructions but little memory locality. SpiderStore could be a good fit for massive multithreading, specially if queries were compiled to C, dramatically cutting down on the count of instructions to execute.

We too could adopt some ideas from SpiderStore. Namely, if running vectored, one just in passing, without extra overhead, generates an array of links to the next IRI, a bit like the array that SpiderStore has for each predicate for the incoming and outgoing edges of a given IRI. Of course, here these would be persistent IDs and not pointers, but a hash from one to the other takes almost no time. So, while SpiderStore alone may not be what we are after for data warehousing, Spiderizing parts of the working set would not be so bad. This is especially so since the Spiderizable data structure almost gets made as a by-product of query evaluation.

If an algorithm made several passes over a relatively small subgraph of the whole database, Spiderizing it would accelerate things. The memory overhead could have a fixed cap so as not to ruin the working set if locality happened not to hold.

Running a SpiderStore-like execution model on vectors instead of single values would likely do no harm and might even result in better cache behavior. The exception is in the event of completely unpredictable patterns of connections which may only be amortized by massive multithreading.

Webpie

Webpie from VU Amsterdam and the LarKC EU FP 7 project is, as it were, the opposite of SpiderStore. This is a map-reduce-based RDFS and OWL Horst inference engine which is all about breadth-first passes over the data in a map-reduce framework with intermediate disk-based storage.

Webpie is not however a database. After the inference result has been materialized, it must be loaded into a SPARQL engine in order to evaluate a query against the result.

The execution plan of Webpie is made from the ontology whose consequences must be materialized. The steps are sorted and run until a fixed point is reached for each. This is similar to running SPARQL INSERT … SELECT statements until no new inserts are produced. The only requirement is that the INSERT statement should report whether new inserts were actually made. This is easy to do. In this way, a comparison between map-reduce plus memory-based joining and a parallel RDF database could be made.

We have suggested such an experiment to the LarKC people. We will see.

# PermaLink Comments [0]
09/21/2010 17:14 GMT Modified: 09/21/2010 16:22 GMT
 <<     | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |     >>
Powered by OpenLink Virtuoso Universal Server
Running on Linux platform