Below is a list of possible extensions to the Berlin SPARQL Benchmark. Our previous critique of BSBM consists of:

  1. The queries touch very little data, to the point where compilation is a large fraction of execution time. This is not representative of the data integration/analytics orientation of RDF.

  2. Most queries are logarithmic to scale factor, but some are linear. The linear ones come to dominate the metric at larger scales.

  3. An update stream would make the workload more realistic.

We could rectify this all with almost no changes to the data generator or test driver by adding one or two more metrics.

So I am publishing the below as a starting point for discussion.

BSBM Analytics Mix

Below is a set of business questions that can be answered with the BSBM data set. These are more complex and touch a greater percentage of the data than the initial mix. Their evaluation is between linear and n * log(n) to the data size. The TPC-H rules can be used for a power (single user) and a throughput (multi-user, where each submits queries from the mix with different parameters and in different order). The TPC-H score formula and executive summary formats are directly applicable.

This can be a separate metric from the "restricted" BSBM score. Restricted means "without a full scan with regexp" which will dominate the whole metric at larger scales.

Vendor specific variations in syntax will occur, hence these are allowed but disclosure of specific query text should accompany results. Hints for JOIN order and the like are not allowed; queries must be declarative. We note that both SPARQL and SQL implementations of the queries are possible.

The queries are ordered so that the first ones fill the cache. Running the analytics mix immediately after backup post initial load is allowed, resulting in semi-warm cache. Steady-state rules will be defined later, seeing the characteristics of the actual workload.

  1. For each country, list the top 10 product categories, ordered by the count of reviews from the country.

  2. Product with the most reviews during its first month on the market

  3. 10 products most similar to X, with similarity score based on the count of features in common

  4. Top 10 reviewers of category X

  5. Product with largest increase in reviews in month X compared to month X-minus-1.

  6. Product of category X with largest change in mean price in the last month

  7. Most active American reviewer of Japanese cameras last year

  8. Correlation of price and average review

  9. Features with greatest impact on price — for features occurring in category X, find the top 10 features where the mean price with the feature is most above the mean price without the feature

  10. Country with greatest popularity of products in category X — reviews of category X from country Y divided by total reviews

  11. Leading product of category X by country, mentioning mean price in each country and number of offers, sort by number of offers

  12. Fans of manufacturer — find top reviewers who score manufacturer above their mean score

  13. Products sold only in country X

BSBM IR

Since RDF stores often implement a full text index, and since a full scan with regexp matching would never be used in an online E-commerce portal, it is meaningful to extend the benchmark to have some full text queries.

For the SPARQL implementation, text indexing should be enabled for all string-valued literals even though only some of them will be queried in the workload.

  • Q6 from the original mix, now allowing use of text index.

  • Reviews of products of category X where the review contains the names of 1 to 3 product features that occur in said category of products; e.g., MP3 players with support for mp4 and ogg.

  • ibid but now specifying review author. The intent is that structured criteria are here more selective than text.

  • Difference in the frequency of use of "awesome", "super", and "suck(s)" by American vs. European vs. Asian review authors.

Changes to Test Driver

For full text queries, the search terms have to be selected according to a realistic distribution. DERI has offered to provide a definition and possibly an implementation for this.

The parameter distribution for the analytics queries will be defined when developing the queries; the intent is that one run will touch 90% of the values in the properties mentioned in the queries.

The result report will have to be adapted to provide a TPC-H executive summary-style report and appropriate metrics.

Changes to Data Generation

For supporting the IR mix, reviews should, in addition to random text, contain the following:

  • For each feature in the product concerned, add the label of said feature to 60% of the reviews.

  • Add the names of review author, product, product category, and manufacturer.

  • The review score should be expressed in the text by adjectives (e.g., awesome, super, good, dismal, bad, sucky). Every 20th word can be an adjective from the list correlating with the score in 80% of uses of the word and random in 20%. For 90% of adjectives, pick the adjectives from lists of idiomatic expressions corresponding to the country of the reviewer. In 10% of cases, use a random list of idioms.

  • Skew the review scores so that comparatively expensive products have a smaller chance for a bad review.

Update Stream

During the benchmark run:

  • 1% of products are added;

  • 3% of initial offers are deleted and 3% are added; and

  • 5% of reviews are added.

Updates may be divided into transactions and run in series or in parallel in a manner specified by the test sponsor. The code for loading the update stream is vendor specific but must be disclosed.

The initial bulk load does not have to be transactional in any way.

Loading the update stream must be transactional, guaranteeing that all information pertaining to a product or an offer constitutes a transaction. Multiple offers or products may be combined in a transaction. Queries should run at least in READ COMMITTED isolation, so that half-inserted products or offers are not seen.

Full text indices do not have to be updated transactionally; the update can lag up to 2 minutes behind the insertion of the literal being indexed.

The test data generator generates the update stream together with the initial data. The update stream is a set of files containing Turtle-serialized data for the updates, with all triples belonging to a transaction in consecutive order. The possible transaction boundaries are marked with a comment distinguishable from the text. The test sponsor may implement a special load program if desired. The files must be loaded in sequence but a single file may be loaded on any number of parallel threads.

The data generator should generate multiple files for the initial dump in order to facilitate parallel loading.

The same update stream can be used during all tests, starting each run from a backup containing only the initial state. In the original run, the update stream is applied starting at the measurement interval, after the SUT is in steady state.