Not logged in : Login

About: SNB Interactive, Part 3: Choke Points and Initial Run on Virtuoso     Goto   Sponge   NotDistinct   Permalink

An Entity of Type : schema:BlogPosting, within Data Space : www.openlinksw.com associated with source document(s)
QRcode icon
http://www.openlinksw.com/describe/?url=http%3A%2F%2Fwww.openlinksw.com%2Fdataspace%2Foerling%2Fweblog%2FOrri%2520Erling%2527s%2520Blog%2F1841&graph=http%3A%2F%2Fwww.openlinksw.com%2Fdataspace&graph=http%3A%2F%2Fwww.openlinksw.com%2Fdataspace

AttributesValues
has container
Date Created
maker
Date Modified
link
id
  • 6d955c5ef7644d2a4f4c4ecc80a23581
content
  • In this post we will look at running the LDBC SNB on Virtuoso.

    First, let's recap what the benchmark is about:

    1. fairly frequent short updates, with no update contention worth mentioning
    2. short random lookups
    3. medium complex queries centered around a person's social environment

    The updates exist so as to invalidate strategies that rely too heavily on precomputation. The short lookups exist for the sake of realism; after all, an online social application does lookups for the most part. The medium complex queries are to challenge the DBMS.

    The DBMS challenges have to do firstly with query optimization, and secondly with execution with a lot of non-local random access patterns. Query optimization is not a requirement, per se, since imperative implementations are allowed, but we will see that these are no more free of the laws of nature than the declarative ones.

    The workload is arbitrarily parallel, so intra-query parallelization is not particularly useful, if also not harmful. There are latency constraints on operations which strongly encourage implementations to stay within a predictable time envelope regardless of specific query parameters. The parameters are a combination of person and date range, and sometimes tags or countries. The hardest queries have the potential to access all content created by people within 2 steps of a central person, so possibly thousands of people, times 2000 posts per person, times up to 4 tags per post. We are talking in the millions of key lookups, aiming for sub-second single-threaded execution.

    The test system is the same as used in the TPC-H series: dual Xeon E5-2630, 2x6 cores x 2 threads, 2.3GHz, 192 GB RAM. The software is the feature/analytics branch of v7fasttrack, available from www.github.com.

    The dataset is the SNB 300G set, with:

    1,136,127persons
    125,249,604knows edges
    847,886,644posts, including replies
    1,145,893,841tags of posts or replies
    1,140,226,235likes of posts or replies

    As an initial step, we run the benchmark as fast as it will go. We use 32 threads on the driver side for 24 hardware threads.

    Below are the numerical quantities for a 400K operation run after 150K operations worth of warmup.

    Duration:10:41.251
    Throughput:623.71 (op/s)

    The statistics that matter are detailed below, with operations ranked in order of descending client-side wait-time. All times are in milliseconds.

    % of totaltotal_waitnamecountmeanminmax
    20     %4,231,130LdbcQuery5 6566,449.89    24510,311
    11     %2,272,954LdbcQuery8 18,354 123.84    14 2,240
    10     %2,200,718LdbcQuery3 3885,671.95    46817,368
    7.3   %1,561,382LdbcQuery14 1,1241,389.13    4 5,724
    6.7   %1,441,575LdbcQuery12 1,2521,151.42    15 3,273
    6.5   %1,396,932LdbcQuery10 1,2521,115.76    13 4,743
    5     %1,064,457LdbcShortQuery3PersonFriends 46,285 22.9979  0 2,287
    4.9   %1,047,536LdbcShortQuery2PersonPosts 46,285 22.6323  0 2,156
    4.1   % 885,102LdbcQuery6 1,721 514.295   8 5,227
    3.3   % 707,901LdbcQuery1 2,117 334.389   28 3,467
    2.4   % 521,738LdbcQuery4 1,530 341.005   49 2,774
    2.1   % 440,197LdbcShortQuery4MessageContent 46,302 9.50708 0 2,015
    1.9   % 407,450LdbcUpdate5AddForumMembership 14,338 28.4175  0 2,008
    1.9   % 405,243LdbcShortQuery7MessageReplies 46,302 8.75217 0 2,112
    1.9   % 404,002LdbcShortQuery6MessageForum 46,302 8.72537 0 1,968
    1.8   % 387,044LdbcUpdate3AddCommentLike 12,659 30.5746  0 2,060
    1.7   % 361,290LdbcShortQuery1PersonProfile 46,285 7.80577 0 2,015
    1.6   % 334,409LdbcShortQuery5MessageCreator 46,302 7.22234 0 2,055
    1     % 220,740LdbcQuery2 1,488 148.347   2 2,504
    0.96  % 205,910LdbcQuery7 1,721 119.646   11 2,295
    0.93  % 198,971LdbcUpdate2AddPostLike 5,974 33.3062  0 1,987
    0.88  % 189,871LdbcQuery11 2,294 82.7685  4 2,219
    0.85  % 182,964LdbcQuery13 2,898 63.1346  1 2,201
    0.74  % 158,188LdbcQuery9 782,028.05   1,108 4,183
    0.67  % 143,457LdbcUpdate7AddComment 3,986 35.9902  1 1,912
    0.26  % 54,947LdbcUpdate8AddFriendship 571 96.2294  1 988
    0.2   % 43,451LdbcUpdate6AddPost 1,386 31.3499  1 2,060
    0.0086% 1,848LdbcUpdate4AddForum 103 17.9417  1 65
    0.0002% 44LdbcUpdate1AddPerson 2 22       10 34

    At this point we have in-depth knowledge of the choke points the benchmark stresses, and we can give a first assessment of whether the design meets its objectives for setting an agenda for the coming years of graph database development.

    The implementation is well optimized in general but still has maybe 30% room for improvement. We note that this is based on a compressed column store. One could think that alternative data representations, like in-memory graphs of structs and pointers between them, are better for the task. This is not necessarily so; at the least, a compressed column store is much more space efficient. Space efficiency is the root of cost efficiency, since as soon as the working set is not in memory, a random access workload is badly hit.

    The set of choke points (technical challenges) actually revealed by the benchmark is so far as follows:

    • Cardinality estimation under heavy data skew — Many queries take a tag or a country as a parameter. The cardinalities associated with tags vary from 29M posts for the most common to 1 for the least common. Q6 has a common tag (in top few hundred) half the time and a random, most often very infrequent, one the rest of the time. A declarative implementation must recognize the cardinality implications from the literal and plan accordingly. An imperative one would have to count. Missing this makes Q6 take about 40% of the time instead of 4.1% when adapting.

    • Covering indices — Being able to make multi-column indices that duplicate some columns from the table often saves an entire table lookup. For example, an index on post by author can also contain the post's creation date.

    • Multi-hop graph traversal — Most queries access a two-hop environment starting at a person. Two queries look for shortest paths of unbounded length. For the two-hop case, it makes almost no difference whether this is done as a union or a special graph traversal operator. For shortest paths, this simply must be built into the engine; doing this client-side incurs prohibitive overheads. A bidirectional shortest path operation is a requirement for the benchmark.

    • Top K Most queries returning posts order results by descending date. Once there are at least k results, anything older than the kth can be dropped, adding a date selection as early as possible in the query. This interacts with vectored execution, so that starting with a short vector size more rapidly produces an initial top k.

    • Late projection — Many queries access several columns and touch millions of rows but only return a few. The columns that are not used in sorting or selection can be retrieved only for the rows that are actually returned. This is especially useful with a column store, as this removes many large columns (e.g., text of a post) from the working set.

    • Materialization — Q14 accesses an expensive-to-compute edge weight, the number of post-reply pairs between two people. Keeping this precomputed drops Q14 from the top place. Other materialization would be possible, for example Q2 (top 20 posts by friends), but since Q2 is just 1% of the load, there is no need. One could of course argue that this should be 20x more frequent, in which case there could be a point to this.

    • Concurrency control — Read-write contention is rare, as updates are randomly spread over the database. However, some pages get read very frequently, e.g., some middle level index pages in the post table. Keeping a count of reading threads requires a mutex, and there is significant contention on this. Since the hot set can be one page, adding more mutexes does not always help. However, hash partitioning the index into many independent trees (as in the case of a cluster) helps for this. There is also contention on a mutex for assigning threads to client requests, as there are large numbers of short operations.

    In subsequent posts, we will look at specific queries, what they in fact do, and what their theoretical performance limits would be. In this way we will have a precise understanding of which way SNB can steer the graph DB community.

    SNB Interactive Series

Title
  • SNB Interactive, Part 3: Choke Points and Initial Run on Virtuoso
is described using
atom:source
atom:updated
  • 2015-06-09T15:24:58Z
atom:title
  • SNB Interactive, Part 3: Choke Points and Initial Run on Virtuoso
links to
atom:author
label
  • SNB Interactive, Part 3: Choke Points and Initial Run on Virtuoso
atom:published
  • 2015-06-09T15:24:58Z
http://rdfs.org/si...ices#has_services
type
is made of
is atom:contains of
is atom:entry of
is container of of
is http://rdfs.org/si...vices#services_of of
Faceted Search & Find service v1.17_git63 as of Apr 23 2021


Alternative Linked Data Documents: iSPARQL | ODE     Content Formats:       RDF       ODATA       Microdata      About   
This material is Open Knowledge   W3C Semantic Web Technology [RDF Data] Valid XHTML + RDFa
OpenLink Virtuoso version 08.03.3322 as of Jun 3 2021, on Linux (x86_64-generic-linux-glibc25), Single-Server Edition (30 GB total memory)
Data on this page belongs to its respective rights holders.
Virtuoso Faceted Browser Copyright © 2009-2021 OpenLink Software