This is the first in a series of blog posts analyzing the Interactive workload of the LDBC Social Network Benchmark. This is written from the dual perspective of participating in the benchmark design, and of building the OpenLink Virtuoso implementation of same.

With two implementations of SNB Interactive at four different scales, we can take a first look at what the benchmark is really about. The hallmark of a benchmark implementation is that its performance characteristics are understood; even if these do not represent the maximum of the attainable, there are no glaring mistakes; and the implementation represents a reasonable best effort by those who ought to know such, namely the system vendors.

The essence of a benchmark is a set of trick questions or "choke points," as LDBC calls them. A number of these were planned from the start. It is then the role of experience to tell whether addressing these is really the key to winning the race. Unforeseen ones will also surface.

So far, we see that SNB confronts the implementor with choices in the following areas:

  • Data model — Tabular relational (commonly known as SQL), graph relational (including RDF), property graph, etc.

  • Physical storage model — Row-wise vs. column-wise, for instance.

  • Ordering of materialized data — Sorted projections, composite keys, replicating columns in auxiliary data structures, etc.

  • Persistence of intermediate results —  Materialized views, triggers, precomputed temporary tables, etc.

  • Query optimization — join order/type, interesting physical data orderings, late projection, top k, etc.

  • Parameters vs. literals — Sometimes different parameter values result in different optimal query plans.

  • Predictable, uniform latency — Measurement rules stipulate the the SUT (system under test) must not fall behind the simulated workload.

  • Durability — How to make data durable while maintaining steady throughput, e.g., logging, checkpointing, etc.

In the process of making a benchmark implementation, one naturally encounters questions about the validity, reasonability, and rationale of the benchmark definition itself. Additionally, even though the benchmark might not directly measure certain aspects of a system, making an implementation will take a system past its usual envelope and highlight some operational aspects.

  • Data generation — Generating a mid-size dataset takes time, e.g., 8 hours for 300G. In a cloud situation, keeping the dataset in S3 or similar is necessary; re-generating every time is not an option.

  • Query mix — Are the relative frequencies of the operations reasonable? What bias does this introduce?

  • Uniformity of parameters — Due to non-uniform data distributions in the dataset, there is easily a 100x difference between "fast" and "slow" cases of a single query template. How long does one need to run to balance these fluctuations?

  • Working set — Experience shows that there is a large difference between almost-warm and steady-state of working set. This can be a factor of 1.5 in throughput.

  • Reasonability of latency constraints — In the present case, a qualifying run must have no more than 5% of all query executions starting over 1 second late. Each execution is scheduled beforehand and done at the intended time. If the SUT does not keep up, it will have all available threads busy and must finish some work before accepting new work, so some queries will start late. Is this a good criterion for measuring consistency of response time? There are some obvious possibilities for abuse.

  • Ease of benchmark implementation/execution — Perfection is open-ended and optimization possibilities infinite, albeit with diminishing returns. Still, getting started should not be too hard. Since systems will be highly diverse, testing that these in fact do the same thing is important. The SNB validation suite is good for this and, given publicly available reference implementations, the effort of getting started is not unreasonable.

  • Ease of adjustment — Since a qualifying run must meet latency constraints while going as fast as possible, setting the performance target involves trial and error. Does the tooling make this easy?

  • Reasonability of durability rule — Right now, one is not required to do checkpoints but must report the time to roll forward from the last checkpoint or initial state. Inspiring vendors to build faster recovery is certainly good, but we are not through with all the implications. What about redundant clusters?

The following posts will look at the above in light of actual experience.

SNB Interactive Series