In this article we will look at Virtuoso vs. Impala with 100G TPC-H on two R3.8 EC2 instances. We get a single user win for Virtuoso by a factor of 136, and a five user win by a factor of 55. The details and analysis follow.

The load setup is the same as ever, with copying from CSV files attached as external tables into Parquet tables. We get lineitem split over 88 Parquet files, which should provide enough parallelism for the platform. The Impala documentation states that there can be up to one thread per file, and here we wish to see maximum parallelism for a single query stream. We use the schema from the Impala github checkout, with string for string and date columns, and decimal for numbers. We suppose the authors know what works best.

The execution behavior is surprising. Sometimes we get full platform utilization, but quite often only 200% CPU per box. The query plan for Q1, for example, says 2 cores per box. This makes no sense, as the same plan fully well knows the table cardinality. The settings for scanner threads and cores to use (in impala-shell) can be changed, but the behavior does not seem to change.

Following are the run times for one query stream.

Query Virtuoso Impala Notes
332     s 841     s Data Load
Q1 1.098 s 164.61  s
Q2 0.187 s 24.19  s
Q3 0.761 s 105.70  s
Q4 0.205 s 179.67  s
Q5 0.808 s 84.51  s
Q6 2.403 s 4.43  s
Q7 0.59  s 270.88  s
Q8 0.775 s 51.89  s
Q9 1.836 s 177.72  s
Q10 3.165 s 39.85  s
Q11 1.37  s 22.56  s
Q12 0.356 s 17.03  s
Q13 2.233 s 103.67  s
Q14 0.488 s 10.86  s
Q15 0.72  s 11.49  s
Q16 0.814 s 23.93  s
Q17 0.681 s 276.06  s
Q18 1.324 s 267.13  s
Q19 0.417 s 368.80  s
Q20 0.792 s 60.45  s
Q21 0.720 s 418.09  s
Q22 0.155 s 40.59  s
Total 20     s 2724     s

Because the platform utilization was often low, we made a second experiment running the same queries in five parallel sessions. We show the average execution time for each query. We then compare this with the Virtuoso throughput run average times. We permute the single query stream used in the first tests in 5 different orders, as per the TPC-H spec. The results are not entirely comparable, because Virtuoso is doing the refreshes in parallel. According to Impala documentation, there is no random delete operation, so the refreshes cannot be implemented.

Just to establish a baseline, we do SELECT COUNT (*) FROM lineitem. This takes 20s when run by itself. When run in five parallel sessions, the fastest terminates in 64s and the slowest in 69s. Looking at top, the platform utilization is indeed about 5x more in CPU%, but the concurrency does not add much to throughput. This is odd, considering that there is no synchronization requirement worth mentioning between the operations.

Following are the average times for each query in the 5 stream experiment.

Query Virtuoso Impala Notes
Q1 1.95 s 191.81 s
Q2 0.70 s 40.40 s
Q3 2.01 s 95.67 s
Q4 0.71 s 345.11 s
Q5 2.93 s 112.29 s
Q6 4.76 s 14.41 s
Q7 2.08 s 329.25 s
Q8 3.00 s 98.91 s
Q9 5.58 s 250.88 s
Q10 8.23 s 55.23 s
Q11 4.26 s 27.84 s
Q12 1.74 s 37.66 s
Q13 6.07 s 147.69 s
Q14 1.73 s 23.91 s
Q15 2.27 s 23.79 s
Q16 2.41 s 34.76 s
Q17 3.92 s 362.43 s
Q18 3.02 s 348.08 s
Q19 2.27 s 443.94 s
Q20 3.05 s 92.50 s
Q21 2.00 s 623.69 s
Q22 0.37 s 61.36 s
Total for
Slowest Stream
67    s 3740    s

There are 4 queries in Impala that terminated with an error (memory limit exceeded). These were two Q21s, one Q19, one Q4. One stream executed without errors, so this stream is reported as the slowest stream. Q21 will, in the absence of indexed access, do a hash build side of half of lineitem, which explains running out of memory. Virtuoso does Q21 mostly by index.

Looking at the 5 streams, we see CPU between 1000% and 2000% on either box. This looks about 5x more than the 250% per box that we were seeing with, for instance, Q1. The process sizes for impalad are over 160G, certainly enough to have the working set in memory. iostat also does not show any I, so we seem to be running from memory, as intended.

We observe that Impala does not store tables in any specific order. Therefore a merge join of orders and lineitem is not possible. Thus we always get a hash join with a potentially large build side, e.g., half of orders and half of lineitem in Q21, and all orders in Q9. This explains in part why these take so long. TPC-DS does not pose this particular problem though, as there are no tables in the DS schema where the primary key of one would be the prefix of that of another.

However, the lineitem/orders join does not explain the scores on Q1, Q20, or Q19. A simple hash join of lineitem and part was about 90s, with a replicated part hash table. In the profile, the hash probe was 74s, which seems excessive. One would have to single-step through the hash probe to find out what actually happens. Maybe there are prohibitive numbers of collisions, which would throw off the results across the board. We would have to ask the Impala community about this.

Anyway, Impala experts out there are invited to set the record straight. We have attached the results and the output of the Impala profile statement for each query for the single stream run. impala_stream0.zip contains the evidence for the single-stream run; impala-stream1-5.zip holds the 5-stream run.

To be more Big Data-like, we should probably run with significantly larger data than memory; for example, 3T in 0.5T RAM. At EC2, we could do this with 2 I3.8 instances (6.4T SSD each). With Virtuoso, we'd be done in 8 hours or so, counting 2x for the I/O and 30x for the greater scale (the 100G experiment goes in 8 minutes or so, all included). With Impala, we could be running for weeks, so at the very least we'd like to do this with an Impala expert, to make sure things are done right and will not have to be retried. Some of the hash joins would have to be done in multiple passes and with partitioning.

In subsequent articles, we will look at other players in this space, and possibly some other benchmarks, like the TPC-DS subset that Actian uses to beat Impala.