Big Data, Part 1: Virtuoso Meets Hive

Details

In this series, we will look at Virtuoso and some of the big data technologies out there. SQL on Hadoop is of interest, as well as NoSQL technologies.

We begin at the beginning, with Hive, the grand-daddy of SQL on Hadoop.

The test platform is two Amazon R3.8 AMI instances. We compared Hive with the Virtuoso 100G TPC-H experiment on the same platform, published earlier on this blog. The runs follow a bulk load in both cases, with all data served from memory. The platform has 2x244GB RAM with only 40GB or so of working set.

The Virtuoso version and settings are as in the Virtuoso Cluster test AMI.

The Hive version is 0.14 from the Hortonworks HDP 2.2 distribution>. The Hive schema and query formulations are the ones from hive-testbench on GitHub. The Hive configuration parameters are as set by Ambari 2.0.1. These are different from the ones in hive-testbench, but the Ambari choices offer higher performance on the platform. We did run statistics with Hive and did not specify any settings not in the hive-testbench. Thus we suppose the query plans were as good as Hive will make them. Platform utilization was even across both machines, and varied between 30% and 100% of the 2 x 32 hardware threads.

Load time with Hive was 742 seconds against 232 seconds with Virtuoso. In both cases, this was a copy from 32 CSV files into native database format; for Hive, this is ORC (Optimized Row Columnar). In Virtuoso, there is one index, (o_custkey); in Hive, there are no indices.

Query	Virtuoso	Hive	Notes
—	`332 s`	`742 s`	Data Load
Q1	`1.098 s`	`296.636 s`
Q2	`0.187 s`	`>3600 s`	Hive Timeout
Q3	`0.761 s`	`98.652 s`
Q4	`0.205 s`	`147.867 s`
Q5	`0.808 s`	`114.782 s`
Q6	`2.403 s`	`71.789 s`
Q7	`0.59 s`	`394.201 s`
Q8	`0.775 s`	`>3600 s`	Hive Timeout
Q9	`1.836 s`	`>3600 s`	Hive Timeout
Q10	`3.165 s`	`179.646 s`
Q11	`1.37 s`	`43.094 s`
Q12	`0.356 s`	`101.193 s`
Q13	`2.233 s`	`208.476 s`
Q14	`0.488 s`	`89.047 s`
Q15	`0.72 s`	`136.431 s`
Q16	`0.814 s`	`105.652 s`
Q17	`0.681 s`	`255.848 s`
Q18	`1.324 s`	`337.921 s`
Q19	`0.417 s`	`>3600 s`	Hive Timeout
Q20	`0.792 s`	`193.965 s`
Q21	`0.720 s`	`670.718 s`
Q22	`0.155 s`	`68.462 s`

Hive does relatively best on bulk load. This is understandable since this is a sequential read of many files in parallel with just compression to do.

Hive's query times are obviously affected by not having a persistent memory image of the data, as this is always streamed from the storage files into other files as MapReduce intermediate results. This seems to be an operator-at-a-time business as opposed to Virtuoso's vectorized streaming.

The queries that would do partitioned hash joins (e.g., Q9) did not finish under an hour in Hive, so we do not have a good metric of a cross-partition hash join.

One could argue that one should benchmark Hive only in disk-bound circumstances. We may yet get to this.

Our next stop will probably be Impala, which ought to do much better than Hive, as it dose not have the MapReduce overheads.

If you are a Hive expert and believe that Hive should have done much better, please let us know how to improve the Hive scores, and we will retry.

Orri Erling's Weblog

Details

Subscribe

Tag Cloud

Post Categories

Recent Articles

Comments

Post Comment

Orri Erling's Weblog

Details

Subscribe

Tag Cloud

Post Categories

Recent Articles

Related

Comments

Post Comment