<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0">
<channel>

<title>In Hoc Signo Vinces (part 1 of n) -- Virtuoso meets TPC-H</title><link>http://www.openlinksw.com:443/weblog/oerling/?id=1739</link><description>
TPC-H is the data warehousing benchmark to date.  I will here cover the Virtuoso implementation of this in detail.  The primary audience is database experts.  This will also be very educational for DBAs and advanced application developers: Life becomes much more predictable if one knows a good query plan from a bad one.  Alongside a commentary on database science, you will also find here a guided tour of Virtuoso performance tuning and diagnostics.  To follow this, it is useful to have the official TPC-H spec at hand (download links are on the far-right of this page).

By now, TPC-H is an old game and it is safe to say that pretty much any player in the analytics database domain has had a go at it, even though some have never published a result.  So, the bar for new entrants is very high.

Especially, VectorWise and EXASolution have taken performance in this workload close to the limits of the achievable.  A challenger has to do everything right in order to win.    One wrong move will  lose the whole race.

This presentation has many objectives:


 
  To illustrate how Virtuoso is an excellent SQL analytics engine
 

  To provide an in-depth discussion on the science of query optimization and execution


  To outline avenues of future development, specifically as concerns analytics with schema-less data



In the TPC TC workshop at VLDB 2013 there was a paper TPC-H Analyzed: Hidden Messages and Lessons Learned from an Influential Benchmark, by Peter Boncz, Thomas Neumann, and myself concerning what the database world has learned from this very tough exercise. Peter Boncz is the original architect of Actian VectorWise, the current champion in TPC-H performance per core.  Thomas Neumann is the author of HyPer, most likely the best entry in DBMS research for simultaneously supporting analytics and OLTP.  Peter and Thomas are among the most renowned in database science.  I am the Program Manager of the Virtuoso column store, overseeing core engineering tasks such as SQL query optimization, execution, storage, and scale out. 

In this series I will go over the Virtuoso implementation of TPC-H and will elaborate further on the points discussed in the paper.  The subject is broader than any single paper can cover in detail, although there are plenty of papers only addressing one or two of the 22 queries.

Virtuoso is mostly known for RDF.  Here we will cover the whole benchmark in SQL first, with both single-server and cluster implementations, and discussion of where these differ.  A state-of-the-art SQL implementation is the necessary basis for discussing how the same can be accomplished in RDF.  Comparing good RDF to bad SQL is not interesting.

The earlier articles on the Star Schema Benchmark (SSB) (PDF) -- Annuit Coeptis, or, Star Schema and The Cost of Freedom and E Pluribus Unum, or, Star Schema Meets Cluster -- demonstrated how the most basic analytical database operations perform in Virtuoso.  All the techniques used there are also directly applicable to TPC-H, but the latter adds a good 20 more tricks one needs to see through.

Future installments will discuss TPC-H query by query.  We conclude with a full run of OSDL-DBT-3.  DBT-3™ is an unofficial TPC-H without auditing but with the same workload.



In Hoc Signo Vinces Series


In Hoc Signo Vinces (part 1): Virtuoso meets TPC-H (this post)


   In Hoc Signo Vinces (part 2): TPC-H Schema Choices


   In Hoc Signo Vinces (part 3): Benchmark Configuration Settings


   In Hoc Signo Vinces (part 4): Bulk Load and Refresh


   In Hoc Signo Vinces (part 5): The Return of SQL Federation


   In Hoc Signo Vinces (part 6): TPC-H Q1 and Q3: An Introduction to Query Plans


   In Hoc Signo Vinces (part 7): TPC-H Q13: The Good and the Bad Plans


   In Hoc Signo Vinces (part 8): TPC-H: INs, Expressions, ORs 


   In Hoc Signo Vinces (part 9): TPC-H: TPC-H Q18, Ordered Aggregation, and Top K 


   In Hoc Signo Vinces (part 10): TPC-H: TPC-H Q9, Q17, Q20 - Predicate Games


   In Hoc Signo Vinces (part 11): TPC-H Q2, Q10 - Late Projection 


    In Hoc Signo Vinces (part 12): TPC-H:  Result Preview 


   In Hoc Signo Vinces (part 13): Virtuoso TPC-H Kit Now on V7 Fast Track 


   In Hoc Signo Vinces (part 14): Virtuoso TPC-H Implementation Analysis 


   In Hoc Signo Vinces (part 15): TPC-H and the Science of Hash 


   In Hoc Signo Vinces (part 16): Introduction to Scale-Out 


   In Hoc Signo Vinces (part 17): 100G and 300G Runs on Dual Xeon E5 2650v2 


   In Hoc Signo Vinces (part 18): Cluster Dynamics 


   In Hoc Signo Vinces (part 19): Scalability, 1000G, and 3000G 


   In Hoc Signo Vinces (part 20): 100G and 1000G With Cluster; When is Cluster Worthwhile; Effects of I/O 


   In Hoc Signo Vinces (part 21): Running TPC-H on Virtuoso Cluster on Amazon EC2 


</description><pubDate>Wed, 13 Nov 2013 22:56:22 GMT</pubDate><generator>Virtuoso Universal Server 08.03.3334</generator><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Orri Erling</dc:creator><image><title>In Hoc Signo Vinces (part 1 of n) -- Virtuoso meets TPC-H</title><url>http://www.openlinksw.com:443/weblog/public/images/vbloglogo.gif</url><link>http://www.openlinksw.com:443/weblog/oerling/?id=1739</link><description /><width>88</width><height>31</height></image>

</channel>
</rss>
