Benchmarks, Redux (part 7): What Does BSBM Explore Measure?

We will here analyze what the BSBM Explore workload does. This is necessary in order to compare benchmark results at different scales. Historically, BSBM had a Query 6 whose share of the metric approached 100% as scale increased. The present mix does not have this query, but different queries still have different relative importance at different scales.

We will here look at database-running statistics for BSBM at different scales. Finally, we look at CPU profiles.

But first, let us see what BSBM reads in general. The system is in steady state after around 1500 query mixes; after this the working set does not shift much. After several thousand query mixes, we have:

SELECT TOP 10 * FROM sys_d_stat ORDER BY reads DESC;

KEY_TABLE          INDEX_NAME                       TOUCHES    READS  READ_PCT  N_DIRTY  N_BUFFERS
=================  ============================  ==========  =======  ========  =======  =========
DB.DBA.RDF_OBJ     RDF_OBJ                        114105938  3302150         2        0    3171275
DB.DBA.RDF_QUAD    RDF_QUAD                       977426773  2041156         0        0    1970712
DB.DBA.RDF_IRI     DB_DBA_RDF_IRI_UNQC_RI_ID        8250414   509239         6       15     491631
DB.DBA.RDF_QUAD    RDF_QUAD_POGS                 3677233812   183860         0        0     175386
DB.DBA.RDF_IRI     RDF_IRI                               32    99710    302151        5      95353
DB.DBA.RDF_QUAD    RDF_QUAD_OP                        30597    51593       168        0      48941
DB.DBA.RDF_QUAD    RDF_QUAD_SP                       265474    47210        17        0      46078
DB.DBA.RDF_PREFIX  DB_DBA_RDF_PREFIX_UNQC_RP_ID        6020      212         3        0        212
DB.DBA.RDF_PREFIX  RDF_PREFIX                             0      167     16700        0        157

The first column is the table, then the index, then the number of times a row was found. The fourth number is the count of disk pages read. The last number is the count of 8K buffer pool pages in use for caching pages of the index in question. Note that the index is clustered, i.e., there is no table data structure separate from the index. Most of the reads are for strings or RDF literals. After this comes the PSOG index for getting a property value given the subject. After this, but much lower, we have lookups of IRI strings given the ID. The index from object value to subject is used the most but the number of pages is small; only a few properties seem to be concerned. The rest is minimal in comparison.

Now let us reset the counts and see what the steady state I/O profile is.

SELECT key_stat (key_table, name_part (key_name, 2), 'reset') FROM sys_keys WHERE key_migrate_to IS NULL;

SELECT TOP 10 * FROM sys_d_stat ORDER BY reads DESC;

KEY_TABLE          INDEX_NAME                       TOUCHES    READS  READ_PCT  N_DIRTY  N_BUFFERS
=================  ============================  ==========  =======  ========  =======  =========
DB.DBA.RDF_OBJ     RDF_OBJ                         30155789    79659         0        0    3191391
DB.DBA.RDF_QUAD    RDF_QUAD                       259008064     8904         0        0    1948707
DB.DBA.RDF_QUAD    RDF_QUAD_SP                        68002     7730        11        0      53360
DB.DBA.RDF_IRI     RDF_IRI                               12     5415     41653        6      98804
DB.DBA.RDF_QUAD    RDF_QUAD_POGS                  975147136     1597         0        0     173459
DB.DBA.RDF_IRI     DB_DBA_RDF_IRI_UNQC_RI_ID        2213525     1286         0       17     485093
DB.DBA.RDF_QUAD    RDF_QUAD_OP                         7999      904        11        0      48568
DB.DBA.RDF_PREFIX  DB_DBA_RDF_PREFIX_UNQC_RP_ID        1494        1         0        0        213

Literal strings dominate. The SP index is used only for situations where the P is not specified, i.e., the DESCRIBE query. Based on this, I/O seems to be attributable mostly to this. The first RDF_IRI represents translations from string to IRI id; the second represents translations from IRI id to string. The touch count for the first RDF_IRI is not properly recorded, hence the miss % is out of line. We see SP missing the cache the most since its use is infrequent in the mix.

We will next look at query processing statistics. For this we introduce a new meter.

The db_activity SQL function provides a session-by-session cumulative statistic of activity. The fields are:

rnd - Count of random index lookups. Each first row of a select or insert counts as one, regardless of whether something was found.
seq - Count of sequential rows. Every move to next row on a cursor counts as 1, regardless of whether conditions match.
same seg - For column store only; counts how many times the next row in a vectored join using an index falls in the same segment as the previous random access. A segment is the stretch of rows between entries in the sparse top level index on the column projection.
same pg - Counts how many times a vectored index join finds the next match on the same page as the previous one.
same par - Counts how many times the next lookup in a vectored index join falls on a different page than the previous but still under the same parent.
disk - Counts how many disk reads were made, including any speculative reads initiated.
spec disk - Counts speculative disk reads.
messages - Counts cluster interconnect messages
B (KB, MB, GB) - is the total length of the cluster interconnect messages.
fork - Counts how many times a thread was forked (started) for query parallelization.

The numbers are given with 4 significant digits and a scale suffix. G is 10^9 (1,000,000,000); M is 10^6 (1,000,000), K is 10^3 (1,000).

We run 2000 query mixes with 16 Users. The special http account keeps a cumulative account of all activity on web server threads.

SELECT db_activity (2, 'http');

1.674G rnd 3.223G seq 0 same seg 1.286G same pg 314.8M same par 6.186M disk 6.461M spec disk 0B / 0 messages 298.6K fork

We see that random access dominates. The seq number is about twice the rnd number, meaning that the average random lookup gets two rows. Getting a row at random obviously takes more time than getting the next row. Since the index used is row-wise, the same seg is 0; the same pg indicates that 77% of the random accesses fall on the same page as the previous random access; most of the remaining random accesses fall under the same parent as the previous one.

There are more speculative reads than disk reads which is an artifact of counting some concurrently speculated reads twice. This does indicate that speculative reads dominate. This is because a large part of the run was in the warm-up state with aggressive speculative reading. We reset the counts and run another 2000 mixes.

Now let us look at the same reading after 2000 mixes, 16 user at 100Mt.

234.3M rnd 420.5M seq 0 same seg 188.8M same pg 29.09M same par 808.9K disk 919.9K spec disk 0B / 0 messages 76K fork

We note that the ratios between the random and sequential and same page/parent counts are about the same. The sequential number looks to be even a bit smaller in proportion. The count of random accesses for the 100Mt run is 14% of the count for the 1000Mt run. The count of query parallelization threads is also much lower since it is worthwhile to schedule a new thread only if there are at least a few thousand operations to perform on it. The precise criterion for making a thread is that according to the cost model guess, the thread must have at least 5ms worth of work.

We note that the 100 Mt throughput is a little over three-times that of the 1000 Mt throughput, as reported before. We might justifiably ask why the 100 Mt run is not seven-times faster instead, for this much less work.

We note that for one-off random access, it makes no real difference whether the tree has 100 M or 1000 M rows; this translates to roughly 27 vs 30 comparisons, so the depth of the tree is not a factor per se. Besides, vectoring makes the tree often look only one or two levels deep, so the total row count matters even less there.

To elucidate this last question, we look at the CPU profiles. We take an oprofile of 100 Single User mixes at both scales.

For 100 Mt:

61161    10.1723  cmpf_iri64n_iri64n_anyn_gt_lt
31321     5.2093  box_equal
19027     3.1646  sqlo_parse_tree_has_node
15905     2.6453  dk_alloc
15647     2.6024  itc_next_set_neq
12702     2.1126  itc_vec_split_search
12487     2.0768  itc_dive_transit
11450     1.9044  itc_bm_vec_row_check
10646     1.7706  itc_page_rcf_search
 9223     1.5340  id_hash_get
 9215     1.5326  gen_qsort
 8867     1.4748  sqlo_key_part_best
 8807     1.4648  itc_param_cmp
 8062     1.3409  cmpf_iri64n_iri64n
 6820     1.1343  sqlo_in_list
 6005     0.9987  dc_iri_id_cmp
 5905     0.9821  dk_free_tree
 5801     0.9648  box_hash
 5509     0.9163  dks_esc_write
 5444     0.9054  sql_tree_hash_1

For 1000 Mt

754331   31.4149  cmpf_iri64n_iri64n_anyn_gt_lt
146165    6.0872  itc_vec_split_search
144795    6.0301  itc_next_set_neq
131671    5.4836  itc_dive_transit
110870    4.6173  itc_page_rcf_search
 66780    2.7811  gen_qsort
 66434    2.7667  itc_param_cmp
 58450    2.4342  itc_bm_vec_row_check
 55213    2.2994  dk_alloc
 47793    1.9904  cmpf_iri64n_iri64n
 44277    1.8440  dc_iri_id_cmp
 39489    1.6446  cmpf_int64n
 36880    1.5359  dc_append_bytes
 36601    1.5243  dv_compare
 31286    1.3029  dc_any_value_prefetch
 25457    1.0602  itc_next_set
 20852    0.8684  box_equal
 19895    0.8285  dk_free_tree
 19698    0.8203  itc_page_insert_search
 19367    0.8066  dc_copy

The top function in both is the compare for an equality of two leading IRIs and a range for the trailing any. This corresponds to the range check in Q5. At the larger scale this is three times more important. At the smaller scale, the share of query optimization is about 6.5 times greater. The top function in this category is box_equal with 5.2% vs 0.87%. The remaining SQL compiler functions are all in proportion to this, totaling 14.3% of the 100 Mt top-20 profile.

From this sample it appears ten times more scale is seven times more database operations. This is not taken into account in the metric. Query compilation is significant at the small end, and no longer significant at 1000 Mt. From these numbers, we could say that Virtuoso is about two times more efficient in terms of database operation throughput at 1000 Mt than at 100 Mt.

We may conclude that different BSBM scales measure different things. The TPC workloads are relatively better in that they have a balance between metric components that stay relatively constant across a large range of scales.

This is not necessarily something that should be fixed in the BSBM Explore mix. We must however take these factors better into account in developing the BI mix.

Let us also remember that BSBM Explore is a relational workload. Future posts in this series will outline how we propose to make RDF-friendlier benchmarks.