In the evening of day 8, we have kernel settings in the cluster changed to allow more mmaps. At this point, we notice that the dataset is missing the implied types of products; i.e., the most specific type is given but its superclasses are not directly associated with the product. We have always run this with this unique inference materialized, which is also how the data generator makes the data, with the right switch. But the switch was not used. So a further 10 Gt (Giga-triples) are added, by running a SQL script to make the superclasses explicit.
At this point, we run BSBM explore for the first time. To what degree does the 37.5 Gt predict the 500 Gt behavior? First, there is an overflow that causes a query plan cost to come out negative if the default graph is specified. This is a bona fide software bug you don't get unless a sample is quite large. Also, we note that starting the databases takes a few minutes due to disk. Further, the first query takes a long time to compile, again because of sampling the database for overall statistics.
The statistics are therefore gathered by running a few queries, and then saved. Subsequent runs will reload the stats from the file system, saving some minutes of start time. There is a function for this, stat_import and stat_export. These are used for a similar purpose by some users.
On day 10, Wednesday August 20, we have some results of BSBM explore.
Then, we get into BSBM updates. The BSBM generator makes an update dataset, but it cannot be made large enough. The BSBM test driver suite is by now hated and feared in equal measure. Is it bad in and of itself? Depends. It was certainly not made for large data. Anyway, no fix will be attempted this time. Instead, a couple of SQL procedures are made to drive a random update workload. These can run long enough to get a steady state with warm cache, which is what any OLTP measurement needs.
On day 12, some updates are measured, with a one hour ramp-up to steady-state, but these are not quite the right mix, since these are products only and the mix needs to contain offers and reviews also. The first steady-state rate was 109 Kt/s, a full 50x less than the bulk load, but then this was very badly bound by latency. So, the updates are adjusted to have more variety. The final measurement was on day 17. Now the steady-state rate is 2563 Kt/s, which is better but still quite bound by network. By adding diversity to the dataset, we get slammed by a sharp rise in warm-up time (now 2 hours to be at 230 Kt/s), at which point we launch the explore mix to be timed during update. Time is short and we do not want to find out exactly how long it takes to get the plateau in insert rate. As it happens, the explore mix is hardly slowed down by the updates, but the updates get hit worse, so that the rate goes to about 1/3 of what it was, then comes back up when the explore is finished. Finally, half an hour after this, there is a steady state of 263 Kt/s update rate.
Of course, the main object of the festivities is still the business intelligence (BI) mix. This is our (specifically, Orri's) own invention from years back, subsequently formulated in SPARQL by FU Berlin (Andreas Schultz). Well, it is already something to do big joins with 150 Gt, all on index and vectored random access, as was done in January 2013, the last time results were published on the CWI cluster. You may remember that there was an aborted attempt in January 2014. So now, with the LOD2 end date under two weeks away, we will take the BI racer out for a spin with 500 Gt. This is now a very different proposition from Jan 2013, as we have by now done the whole TPC-H work documented on this blog. This serves to show, inter alia, that we can run with the best in the much bigger and harder mainstream database sports. The full benefits of this will be realized for the semantic data public still this year, so this is more than personal vanity.
So we will see. The BI mix is not exactly TPC-H, but what is good for one is good for the other. Checking that the plans are good on the 37 Gt scale model is done around day 12. On day 13, we try this on the larger cluster. You never know — pushing the envelope, even when you know what you are doing and have written the whole thing, is still a dive in the fog. Claiming otherwise would be a lie lacking credibility. The iceberg which first emerges is overflow and partition skew. Well, there can be a lot of messages if all messages go via the same path. So we make the data structure different and retry and now die from out of memory. On the scale model, this looks like a little imbalance you don't bother to notice; at 13x scale, this kills. So, as is the case with most database problems, the query plan is bad. Instead of using a PSOG index, it uses a POSG index, and there is a constant for O. Partitioning is on either S or O, whichever is first. Not hard to fix, but still needs a cost-model adjustment to penalize low-cardinality partition columns. This is something you don't get with TPC-H, where there are hardly any indices. Once this is fixed there are other problems, such as Q5, which we ended up leaving out. The scale model is good; the large one does not produce a plan, because some search-space corner is visited that is not visited in the scale model, due to different ratios of things in the cost model. Could be a couple of days to track; this is complex stuff. So we dropped it. It is not a big part of the metric, and its omission is immaterial to the broader claim of handling 500 Gt in all safety and comfort. The moral is: never get stuck; only do what is predictable, insofar as anything in this shadowy frontier is such.
So, on days 15 and 16, the BI mix that is reported was run. The multiuser score was negatively impacted by memory skew, so some swapping on one of the nodes, but the run finished in about 2 hours anyway. The peak of transient memory consumption is another thing that you cannot forecast with exact precision. There is no model for that; the query streams are in random order, and you just have to try. And it is a few hours per iteration, so you don't want to be stuck doing that either. A rerun would get a higher multiuser BI score; maybe one will be made but not before all the rest is wrapped up.
Now we are talking 2 hours, versus 9 hours with the 150 Gt set back in January 2013. So 3.3x the data, 4.5x less time, 1.5x the gear. This comes out at one order of magnitude. With a better score from better memory balance and some other fixes, a 15x improvement for BSBM BI is in the cards.
The final explore runs were made on day 18, while writing the report to be published at the LOD2 deliverables repository. The report contains in depth discussion on the query plans and diverse database tricks and their effectiveness.
The overall moral of this trip into these uncharted spaces is this: Expect things to break. You have to be the designer and author of the system to take it past its limits. You will cut it or you won't, and nobody can do anything about it, not with the best intentions, nor even with the best expertise, which both were present. This is true of the last minute daredevil stuff like this; if you have a year full time instead of the last 20 days of a project, all is quite different, and these things are more leisurely. This might then become a committee affair, though, which has different problems. In the end, the Virtuoso DBMS has never thrown anything at us we could not handle. The uncertainty in trips of this sort is with the hardware platform, of which we had to replace 2 units to get on the way, and with how fast you can locate and fix a software problem. So you pick the quickest ones and leave the uncertain aside. There is another category of rare events like network failures that in theory cannot happen. Yet they do. So, to program a cluster, you have to have some recovery things for these. We saw a couple of these along the way. Duplication of these can take days, and whether this correlates with specific links or is a bona fide software thing is time consuming to prove, and getting into this is a sure way to lose the race. These seem to be load peaks outside of steady-state; steady-state is in fact very steady once it is there. Except at the start, network glitches were not a big factor in these experiments. The bulk of these went away after replacing a machine. After this we twice witnessed something that cannot exist but knew better than to get stuck with that. Neither incident happened again. This is days of running at a cross sectional 1 GB/s of traffic. These are the truly unpredictable, and, in a crash course like this, can sink the whole gig no matter how good you are.
Thanks are due to CWI and especially Peter Boncz for providing the race track as well as advice and support.
In the next installments of this series, we will look at how schema and characteristic sets will deliver the promise of RDF without its cost. All the experiments so far were done with a quads table, as always before. So we could say that the present level is close to the limit of the achievable within this physical model. The future lies beyond the misconception of triples/quads as primary physical model.
To be continued...
About this entry:
Author: Virtuso Data Space Bot
Published: 08/29/2014 12:50 GMT-0500
Comment Status: 0 Comments