Orri Erling


Post Categories

Recent Articles

Display Settings

articles per page.
SEMANTiCS 2014 (part 3 of 3): Conversations

I was asked for an oracular statement about the future of relational database (RDBMS) at the conference. The answer, without doubt or hesitation, is that this is forever. But this does not mean that the RDBMS world would be immutable, quite the opposite.

The specializations converge. The RDBMS becomes more adaptable and less schema-first. Of course the RDBMS also take new data models beside the relational. RDF and other property graph models, for instance.

The schema-last-ness is now well in evidence. For example, PostgreSQL has an hstore column type which is a list of key-value pairs. Vertica has a feature called flex tables where a column can be added on a row-by-row basis.

Specialized indexing for text and geometries is a well established practice. However, dedicated IR systems, often Lucene derivatives, can offer more transparency in the IR domain for things like vector-space-models and hit-scoring. There is specialized faceted search support which is quite good. I do not know of an RDBMS that would do the exact same trick as Lucene for facets, but, of course, in the forever expanding scope of RDB, this is added easily enough.

JSON is all the rage in the web developer world. Phil Archer even said in his keynote, as a parody of the web developer: " I will never touch that crap of RDF or the semantic web; this is a pipe dream of reality ignoring academics and I will not have it. I will only use JSON-LD."

XML and JSON are much the same thing. While most databases have had XML support for over a decade, there is a crop of specialized JSON systems like MongoDB. PostgreSQL also has a JSON datatype. Unsurprisingly, MarkLogic too has JSON, as this is pretty much the same thing as their core competence of XML.

Virtuoso, too, naturally has a JSON parser, and mapping this to the native XML data type is a non-issue. This should probably be done.

Stefano Bertolo of the EC, also LOD2 project officer, used the word Cambrian explosion when talking about the proliferation of new database approaches in recent years.

Hadoop is a big factor in some environments. Actian Vector (née VectorWise), for example, can use this as its file system. HDFS is singularly cumbersome for this but still not impossible and riding the Hadoop bandwagon makes this adaptation likely worthwhile.

Graphs are popular in database research. We have a good deal of exposure to this via LDBC. Going back to an API for database access, as is often done in graph database, can have its point, especially as a reaction to the opaque and sometimes hard to predict query optimization of declarative languages. This just keeps getting more complex, so a counter-reaction is understandable. APIs are good if crossed infrequently and bad otherwise. So, graph database APIs will develop vectoring, is my prediction and even recommendation in LDBC deliverables.

So, there are diverse responses to the same evolutionary pressures. These are of initial necessity one-off special-purpose systems, since the time to solution is manageable. Doing these things inside an RDBMS usually takes longer. The geek also likes to start from scratch. Well, not always, as there have been some cases of grafting some entirely non-MySQL-like functionality, e.g. Infobright and Kickfire, onto MySQL.

From the Virtuoso angle, adding new data and control structures has been done many times. There is no reason why this cannot continue. The next instances will consist of some graph processing (BSP, or Bulk Synchronous Processing) in the query languages. Another recent example is an interface for pluggable specialized content indices. One can make chemical structure indices, use alternate full text indices, etc., with this.

Most of this diversification has to do with physical design. The common logical side is a demand for more flexibility in schema and sometimes in scaling, e.g., various forms of elasticity in growing scale-out clusters, especially with the big web players.

The diversification is a fact, but the results tend to migrate into the RDBMS given enough time.

On the other hand, when a new species like the RDF store emerges, with products that do this and no other thing and are numerous enough to form a market, the RDBMS functionality seeps in. Bigdata has a sort of multicolumn table feature, if I am not mistaken. We just heard about the wish for strict schema, views, and triggers. By all means.

From the Virtuoso angle, with structure awareness, the difference of SQL and RDF gradually fades, and any advance can be exploited to equal effect on either side.

Right now, I would say we have convergence when all the experimental streams feel many of the same necessities.

Of course you cannot have a semantic tech conference without the matter of the public SPARQL end point coming up. The answer is very simple: If you have operational need for SPARQL accessible data, you must have your own infrastructure. No public end points. Public end points are for lookups and discovery; sort of a dataset demo. If operational data is in all other instances the responsibility of the one running the operation, why should it be otherwise here? Outsourcing is of course possible, either for platform (cloud) or software (SaaS). To outsource something with a service level, the service level must be specifiable. A service level cannot be specified in terms of throughput with arbitrary queries but in terms of well defined transactions; hence the services world runs via APIs, as in the case of Open PHACTS. For arbitrary queries (i.e., analytics on demand), with the huge variation in performance dependent on query plans and configuration of schema, the best is to try these things with platform on demand in a cloud. Like this, there can be a clear understanding of performance, which cannot be had with an entirely uncontrolled concurrent utilization. For systems in constant operation, having one's own equipment is cheaper, but still might be impossible to procure due to governance.

Having clarified this, the incentives for operators also become clearer. A public end point is a free evaluation; a SaaS deal or product sale is the commercial offering.

Anyway, common datasets like DBpedia are available preconfigured on AWS with a Virtuoso server. For larger data, there is a point to making ready-to-run cluster configurations available for evaluation, now that AWS has suitable equipment (e.g., dual E5 2670 with 240 GB RAM and SSD for USD 2.8 an hour). According to Amazon, up to five of these are available at a time without special request. We will try this during the fall and make the images available.

SEMANTiCS 2014 Series

# PermaLink Comments [0]
09/08/2014 16:11 GMT
SEMANTiCS 2014 (part 2 of 3): RDF Data Shapes

The first keynote of Semantics 2014 was by Phil Archer of the W3C, entitled "10 Years of Achievement." After my talk, in the questions, Phil brought up the matter of the upcoming W3C work group charter on RDF Data Shapes. We had discussed this already at the reception the night before and I will here give some ideas about this.

After the talk, my answer was that naturally the existence of something that expressed the same sort of thing as SQL DDL, with W3C backing, can only be a good thing and will give the structure awareness work by OpenLink in Virtuoso and probably others a more official seal of approval. Quite importantly, this will be a facilitator of interoperability and will raise this from a product specific optimization trick to a respectable, generally-approved piece of functionality.

This is the general gist of the matter and can hardly be otherwise. But underneath is a whole world of details, which we discussed at the reception.

Phil noted that there was controversy around whether a lightweight OWL-style representation or SPIN should function as the basis for data shapes.

Phil stated in the keynote that the W3C considered the RDF series of standards as good and complete, but would still have working groups for filling in gaps as these came up. This is what I had understood from my previous talks with him at the Linking Geospatial Data workshop in London earlier this year.

So, against this backdrop, as well as what I had discussed with Ralph Hodgson of Top Quadrant at a previous LDBC TUC meeting in Amsterdam, SPIN seems to me a good fit.

Now, it turns out that we are talking about two different use cases. Phil said that the RDF Data Shapes use case was about making explicit what applications required of data. For example, all products should have a unit price, and this should have one value that is a number.

The SPIN proposition on the other hand, as Ralph himself put it in the LDBC meeting, is providing to the linked data space functionality that roughly corresponds to SQL views. Well, this is one major point, but SPIN involves more than this.

So, is it DDL or views? These are quite different. I proposed to Phil that there was in fact little point in fighting over this; best to just have two profiles.

To be quite exact, even SQL DDL equivalence is tricky, since enforcing this requires a DBMS; consider, for instance, foreign key and check constraints. At the reception, Phil stressed that SPIN was certainly good but since it could not be conceived without a SPARQL implementation, it was too heavy to use as a filter for an application that, for example, just processed a stream of triples.

The point, as I see it, is that there is a wish to have data shape enforcement, at least to a level, in a form that can apply to a stream without random access capability or general purpose query language. This can make sense for some big data style applications, like an ETL-stage pre-cooking of data before the application. Applications mostly run against a DBMS, but in some cases, this could be a specialized map-reduce or graph analytics job also, so no low cost random access.

My own take is that views are quite necessary, especially for complex query; this is why Virtuoso has the SPARQL macro extension. This will do, by query expansion, a large part of what general purpose inference will do, except for complex recursive cases. Simple recursive cases come down to transitivity and still fit the profile. SPIN is a more generic thing, but has a large intersection with SPARQL macro functionality.

My other take is that structure awareness needs a way of talking about structure. This is a use case that is clearly distinct from views.

A favorite example of mine is the business rule that a good customer is one that has ordered more than 5 times in the last year, for a total of more than so much, and has no returns or complaints. This can be stated as a macro or SPIN rule with some aggregates and existences. This cannot be stated in any of the OWL profiles. When presented with this, Phil said that this was not the use case. Fair enough. I would not want to describe what amounts to SQL DDL in these terms either.

A related topic that has come up in other conversations is the equivalent of the trigger. One use case of this is enforcement of business rules and complex access rights for updates. So, we see that the whole RDBMS repertoire is getting recreated.

Now, talking from the viewpoint of the structure-aware RDF store, or the triple-stream application for that matter, I will outline some of what data shapes should do. The triggers and views matter is left out, here.

The commonality of bulk-load, ETL, and stream processing, is that they should not rely on arbitrary database access. This would slow them down. Still, they must check the following sorts of things:

  • Data types
  • Presence of some required attributes
  • Cardinality — e.g., a person has no more than one date of birth
  • Ranges — e.g., a product's price is a positive number; gender is male/female; etc.
  • Limited referential integrity — e.g., a product has one product type, and this is a subject of the RDF type product type.
  • Limited intra-subject checks — e.g.. delivery date is greater-than-or-equal-to ship date.

All these checks depend on previous triples about the subject; for example, these checks may be conditional on the subject having a certain RDF type. In a data model with a join per attribute, some joining cannot be excluded. Checking conditions that can be resolved one triple at a time is probably not enough, at least not for the structure-aware RDF store case.

But, to avoid arbitrary joins which would require a DBMS, we have to introduce a processing window. The triples in the window must be cross-checkable within the window. With RDF set semantics, some reference data may be replicated among processing windows (e.g., files) with no ill effect.

A version of foreign key declarations is useful. To fit within a processing window, complete enforcement may not be possible but the declaration should still be possible, a little like in SQL where one can turn off checking.

In SQL, it is conventional to name columns by prefixing them with an abbreviation of the table name. All the TPC schemas are like that, for example. Generally in coding, it is good to prefix names with data type or subsystem abbreviation. In RDF, this is not the practice. For reuse of vocabularies, where a property may occur in anything, the namespace or other prefix denotes where the property comes from, not where it occurs.

So, in TPC-H, l_partkey and ps_partkey are both foreign keys that refer to part, plus that l_partkey is also a part of a composite foreign key to partsupp. By RDF practices, these would be called rdfh:hasPart. So, depending on which subject type we have, rdfh:hasPart is 30:1 or 4:1. (distinct subjects:distinct objects) Due to this usage, the property's features are not dependent only on the property, but on the property plus the subject/object where it occurs.

In the relational model, when there is a parent and a child item (one to many), the child item usually has a composite key prefixed with the parent's key, with a distinguishing column appended, e.g., l_orderkey, l_linenumber. In RDF, this is rdfh:hasOrder as a property of the lineitem subject. In SQL, there is no single part lineitem subject at all, but in RDF, one must be made since everything must be referenceable with a single value. This does not have to matter very much, as long as it is possible to declare that lineitems will be primarily accessed via their order. It is either this or a scan of all lineitems. Sometimes a group of lineitems are accessed by the composite foreign key of l_partkey, l_suppkey. There could be a composite index on these. Furthermore, for each l_partkey, l_suppkey in lineitem there exists a partsupp. In an RDF translation, the rdfh:hasPart and rdfh:hasSupplier, when they occur in a lineitem subject, specify exactly one subject of type partsupp. When they occur in a partsupp subject, they are unique as a pair. Again, because names are not explicit as to where they occur and what role they play, the referential properties do not depend only on the name, but on the name plus included data shape. Declaring and checking all this is conventional in the mainstream and actually useful for query optimization also.

Take the other example of a social network where the foaf:knows edge is qualified by a date when this edge was created. This may be by reification, or more usually by an "entitized" relationship where the foaf:knows is made into a subject with the persons who know each other and the date of acquaintance as properties. In a SQL schema, this is a key person1, person2 -> date. In RDF, there are two join steps to go from person1 to person2; in SQL, 1. This is eliminated by saying that the foaf:knows entity is usually referenced by the person1 Object or person2 Object, not the Subject identifier of the foaf:knows.

This allows making the physical storage by O, S, G -> O2, O3, …. A secondary index with S, G, O still allows access by the mandatory subject identifier. In SQL, a structure like this is called a clustered table. In other words, the row is arranged contiguous with a key that is not necessarily the primary key.

So, identifying a clustering key in RDF can be important.

Identifying whether there are value-based accesses on a given Object without making the Object a clustering key is also important. This is equivalent to creating a secondary index in SQL. In the tradition of homogenous access by anything, such indexing may be on by default, except if the property is explicitly declared of low cardinality. For example, an index on gender makes no sense. The same is most often true of rdfs:type. Some properties may have many distinct values (e.g., price), but are still not good for indexing, as this makes for the extreme difference in load time between SQL and the all-indexing RDF.

Identifying whether a column will be frequently updated is another useful thing. This will turn off indexing and use an easy-to-update physical representation. Plus, properties which are frequently updated are best put physically together. This may, for example, guide the choice between row-wise and column-wise representation. A customer's account balance and orders year-to-date would be an example of such properties.

Some short string valued properties may be frequently returned or used as sorting keys. This requires accessing the literal via an ID in the dictionary table. Non-string literals, numbers, dates, etc., are always inlined (at least in most implementations), but strings are a special question. Bigdata and early versions of Virtuoso would inline short ones; later versions of Virtuoso would not. So specifying, per property/class combination, a length limit for an inlined string is very high gain and trivial to do. The BSBM explore score at large scales can get a factor of 2 gain just from inlining one label. BSBM is out of its league here, but this is still really true and yields benefits across the board. The simpler the application, the greater the win.

If there are foreign keys, then data should be loaded with the referenced entities first. This makes dimensional clustering possible at load time. If the foreign key is frequently used for accessing the referencing item (for example, if customers are often accessed by country), then loading customers so that customers of the same country end up next to each other can result in great gains. The same applies to a time dimension, which in SQL is often done as a dimension table, but rarely so in linked data. Anyhow, if date is a frequent selection criterion, physically putting items in certain date ranges together can give great gains.

The trick here is not necessarily to index on date, but rather to use zone maps (aka min/max index). If nearby values are together, then just storing a min-max value for thousands of consecutive column values is very compact and fast to check, provided that the rows have nearby values. Actian Vector's (VectorWise) prowess in TPC-H is in part from smart use of date order in this style.

To recap, the data shapes desiderata from the viewpoint of guiding physical storage is as follows:

(I will use "data shape" to mean "characteristic set," or "set of Subjects subject to the same set of constraints." A Subject belonging to a data shape may be determined either by its rdfs:type or by the fact of it having, within the processing window, all or some of a set of properties.)

  • All normal range, domain, cardinality, optionality, etc. Specifically, declaring something as single valued (as with SQL's UNIQUE constraint) and mandatory (as with SQL's NOT NULL constraint) is good.
  • Primary access path — The Properties whose Objects are dominant access criteria is important
  • No-index — Declare that no index will be made on the Object of a Property within a data shape.
  • Inlined string — String values of up to so many characters in this data shape are inlined
  • Clustering key — The Subject identifiers will be picked to be correlated with the Object of this Property in this data shape. This can be qualified by a number of buckets (e.g., if dates are from 2000 to 2020, then this interval may be 100 buckets), with an exception bucket for out of range values.
  • No full text index — A string value will not need to be full text indexed in this Property even if full text indexing is generally on.
  • Full text index desired — This means that if the value of the property is a string, then the row must be locatable via this string. The string may or may not be inlined, but an index will exist on the literal ID of the string, e.g., POSG.
  • Co-location — This is akin to clustering but specifies, for a high cardinality Object, that the Subject identifier should be picked to fall in the same partition as the Object. The Object is typically a parent of the Subject being loaded; for example, the containing assembly of a sub-assembly. Traversing the assembly created in this way will be local on a scale-out system. This can also apply to geometries or text values: If primary access is by text or geo index, then the metadata represented as triples should be in the same partition as the entry in the full text/geo index.
  • Update group — A set of properties that will often change together. Implies no index and some form of co-location, plus update-friendly physical representation. Many update groups may exist, in which case they may or may not be collocated.
  • Composite foreign/primary key. A data shape can have a multicolumn foreign key, e.g., l_partkey, l_suppkey in lineitem with the matching primary key of ps_partkey, ps_suppkey in partsupp. This can be used for checking and for query optimization: Looking at l_partkey and l_suppkey as independent properties, the guess would be that there hardly ever exists a partsupp, whereas one does always exist. The XML standards stack also has a notion of a composite key for random access on multiple attributes.

These things have the semantic of "hint for physical storage" and may all be ignored without effect on semantics, at least if the data is constraint-compliant to start with.

These things will have some degree of reference implementation through the evolution of Virtuoso structure awareness, though not necessarily immediately. These are, to the semanticist, surely dirty low-level disgraceful un-abstractions, some of the very abominations the early semanticists abhorred or were blissfully ignorant of when they first raised their revolutionary standard.

Still, these are well-established principles of the broader science of database. SQL does not standardize some of these, nor does it have much need to, as the use of these features is system-specific. The support varies widely and the performance impacts are diverse. However, since RDF excels as a reference model and as a data interchange format, giving these indications as hints to back-end systems cannot hurt, and can make a difference of night and day in load and query time.

As Phil Archer said, the idea of RDF Data Shapes is for an application to say that "it will barf if it gets data that is not like this." An extension is for the data to say what the intended usage pattern is so that the system may optimize for this.

All these things may be learned from static analysis and workload traces. The danger of this is over-fitting a particular profile. This enters a gray area in benchmarking. For big data, if RDF is to be used as the logical model and the race is about highest absolute performance, never mind what the physical model ends up being, all this and more is necessary. And if one is stretching the envelope for scale, the race is always about highest absolute performance. For this reason, these things will figure at the leading edge with or without standardization. I would say that the build-up of experience in the RDBMS world is sufficient for these things to be included as hints in a profile of data shapes. The compliance cost will be nil if these are ignored, so for the W3C, these will not make the implementation effort for compliance with an eventual data shapes recommendation prohibitive.

The use case is primarily the data warehouse to go. If many departments or organizations publish data for eventual use by their peers, users within the organization may compose different combinations of extractions for different purposes. Exhaustive indexing of everything by default makes the process slow and needlessly expensive, as we have seen. Much of such exploration is bounded by load time. Federated approaches for analytics are just not good, even though they may work for infrequent lookups. If datasets are a commodity to be plugged in and out, the load and query investment must be minimized without the user/DBA having to run workload analysis and manual schema optimization. Therefore, bundling guidelines such as these with data shapes in a dataset manifest can do no harm and can in cases provide 10-50x gains in load speeds and 2-4x in space consumption, not to mention unbounded gains in query time, as good and bad plans easily differ by 10-100x, especially in analytics.

So, here is the pitch:

  • Dramatic gains in ad hoc user experience
  • Minimal effort by data publishers, as much of the physical guidelines can be made from workload trace and dataset; the point is that the ad hoc user does not have to do this.
  • Great optimization potential for system vendors; low cost for initial compliance
  • Better understanding of the science of performance by the semantic community

To be continued...

SEMANTiCS 2014 Series

# PermaLink Comments [0]
09/08/2014 15:22 GMT Modified: 09/08/2014 16:12 GMT
SEMANTiCS 2014 (part 1 of 3): Keynote

I was invited to give a keynote at SEMANTiCS 2014 in Leipzig, Germany last Thursday. I will here recap some of the main points, and comment on some of the ensuing controversy. The talk was initially titled Virtuoso, the Prometheus of RDF. Well, mythical Prometheus did perform a service but ended up paying for it. Still, the mythical reference is sometimes used when talking of major breakthroughs and big-gain ambitions. In the first slide, I changed it to Linked Data at Dawn, which is less product specific and more a reflection on the state of the linked data enterprise at large.

The first part of the talk was under the heading of the promise and the practice. The promise we know well and find no fault with: Schema-last-ness, persistent unique identifiers, self-describing data, some but not too much inference. The applications usually involve some form of integration and often have a mix of strictly structured content with semi-structured or textual content.

These values are by now uncontroversial and embraced by many; however, most instances of this embracing do not occur in the context of RDF as such. For example, the big online systems on the web: all have some schema-last (key-value) functionality. Applications involving long-term data retention have diverse means of having persistent IDs and self description, from UUIDs to having the table name in a column so that one can tell where a CSV dump came from.

The practice involves competing with diverse alternative technologies: SQL, key-value, information retrieval (often Lucene-derived). In some instances, graph databases occur as alternatives: Young semanticist, do or die.

In this race, linked data is often the prettiest and most flexible, but gets a hit on different aspects of performance and scalability. This is a database gig, and database is a performance game; make no mistake.

After these preliminaries we come to the "RDF tax," or the more or less intrinsic overheads of describing all as triples. The word "triple" is used by habit. In fact, we nearly always talk about quads, i.e., subject-predicate-object-graph (SPOG). The next slide is provocatively titled the Bane of the Triple, and is about why having all as triples is, on the surface, much like relational, except it makes life hard, where tables make it at least manageable, if still not altogether trivial.

The very first statement on the tax slide reads "90% of bad performance comes from non-optimal query plans." If one does triples in the customary way (i.e., a table of quads plus dictionary tables to map URIs and literal strings to internal IDs), one incurs certain fixed costs.

These costs are deemed acceptable by users who deploy linked data. If these costs were not acceptable, the proof of concept would have already disqualified linked data.

The support cases that come my way are nearly always about things taking too much time. Much less frequently, are these about something unambiguously not working. Database has well defined semantics, so whether something works or not is clear cut.

So, support cases are overwhelmingly about query optimization. The problems fall in two categories:

  • The plan is good in the end, but it takes much longer to make the plan than to execute it.
  • The plan either does the wrong things or does things in the wrong order, but produces a correct result.

Getting no plan at all or getting a clearly wrong result is much less frequent.

If the RDF overheads incurred with a good query plan were show stoppers, the show would have already stopped.

So, let's look at this in more detail; then we will talk about the fixed overheads.

The join selectivity of triple patterns is correlated. Some properties occur together all the time; some occur rarely; some not at all. Some property values can be correlated, i.e., order number and order date. Capturing these by sampling in a multicolumn table is easy; capturing this in triples would require doing the join in the cost model, which is not done since it would further extend compilation times. When everything is a join, selectivity estimation errors build up fast. When everything is a join, the space of possible graph query plans explodes as opposed to tables; thus, while the full plan space can be covered with 7 tables, it cannot be covered with 18 triple patterns. This is not factorial (number of permutations). For different join types (index/hash) and the different compositions of the hash build side, this is much worse, in some nameless outer space fringe of non-polynomiality.

TPC-H can be run with success because the cost model hits the right plan every time. The primary reason for this is the fact that the schema and queries unambiguously suggest the structure, even without foreign key declarations. The other reason is that with a handful of tables, all plans can be reviewed, and the cost model reliably tells how many rows will result from each sequence of operations.

Try this with triples; you will know what I mean.

Now, some people have suggested purely rule-based models of SPARQL query compilation. These are arguably faster to run and more predictable. But the thing that must be done, yet will not be done with these, is the right trade-off between index and hash. This is the crux of the matter, and without this, one can forget about anything but lookups. The choice depends on reliable estimation of cardinality (number of rows, number of distinct keys) on either side of the join. Quantity, not pattern matching.

Well, many linked data applications are lookups. The graph database API world is sometimes attractive because it gives manual control. Map reduce in the analytical space is sometimes attractive for the same reason.

On the other hand, query languages also give manual control, but then this depends on system specific hints and cheats. People are often black and white: Either all declarative or all imperative. We stand for declarative, but still allow physical control of plan, like most DBMS.

To round off, I will give a concrete example:

{  ?thing  rdfs:label    ?lbl         . 
   ?thing  dc:title      ?title       . 
   ?lbl    bif:contains  "gizmo"      . 
   ?title  bif:contains  "widget"     . 
   ?thing  a             xx:Document  . 
   ?thing  dc:date       ?dt          . 
   FILTER  ( ?dt  > "2014-01-01"^^xsd:date ) 

There are two full text conditions, one date, and one class, all on the same subject. How do you do this? Most selective text first, then get the data and check, then check the second full text given the literal and the condition, then check the class? Wrong. If widgets and gizmos are both frequent and most documents new, this is very bad because using a text index to check for a specific ID having a specific string is not easily vectorable. So, the right plan is: Take the more selective text expression, then check the date and class for the results, put the ?things in a hash table. Then do the less selective text condition, and drop the ones that are not in the hash table. Easily 10x better. Simple? In the end yes, but you do not know this unless you know the quantities.

This gives the general flavor of the problem. Doing this with TPC-H in RDF is way harder, but you catch my drift.

Each individual instance is do-able. Having closer and closer alignment between reality and prediction will improve the situation indefinitely, but since the space is as good as infinite there cannot be a guarantee of optimality except for toy cases.

The Gordian Knot shall not be defeated with pincers but by the sword.

We will come to this in a bit.

Now, let us talk of the fixed overheads. The embarrassments are in the query optimization domain; the daily grind, relative cost, and provisioning are in this one.

The overheads come from:

  • Indexing everything
  • Having literals and URI strings via dictionary
  • Having a join for every attribute

These all fall under the category of having little to no physical design room.

In the indexing everything department, we load 100 GB TPC-H in 15 minutes in SQL with ordering only on primary keys and almost no other indexing. The equivalent with triples is around 12 hours. This data can be found on this blog (TPC-H series and Meeting the Challenges of Linked Data in the Enterprise). This is on the order of confusing a screwdriver with a hammer. If the nail is not too big, the wood not too hard, and you hit it just right — the nail might still go in. The RDF bulk load is close to the fastest possible given the general constraints of what it does. The same logic is used for the record-breaking 15 minutes of TPC-H bulk load, so the code is good. But indexing everything is just silly.

The second, namely the dictionary of URIs and literals, is a dual edge. I talked to Bryan Thompson of SYSTAP (Bigdata RDF store) in D.C. at the ICDE there. He said that they do short strings inline and long ones via dictionary. I said we used to do the same but stopped in the interest of better compression. What is best depends on workload and working-set-to-memory ratio. But if you must make the choice once and for all, or at least as a database-wide global setting, you are between a rock and a hard place. Physical vs. logical design, again.

The other aspect of this is the applications that do regexps on URI strings or literals. Doing this is like driving a Formula 1 race in reverse gear. Use a text index. Always. This is why most implementations have one even though SPARQL itself makes no provisions for this. If you really need regexps, and on supposedly opaque URIs at that, tokenize them and put them in a text index as a text literal. Or if an inverted-file-word index is really not what you need, use a trigram one. So far, nobody has wanted one hard enough for us to offer this, even though this is easy enough. But special indices for special data types (e.g., chemical structure) are sometimes wanted, and we have a generic solution for all this, to be introduced shortly on this blog. Again, physical design.

I deliberately name the self-join-per-attribute point last, even though this is often the first and only intrinsic overhead that is named. True, if the physical model is triples, each attribute is a join against the triple table. Vectored execution and right use of hash-join help, though. The Star Schema Benchmark SQL to SPARQL gap is only 2.5x, as documented last year on this blog. This makes SPARQL win by 100+x against MySQL and lose by only 0.8x against column store pioneer MonetDB. Let it be said that this is so far the best case and that the gap is wider in pretty much all other cases. This gap is well and truly due to the self-join matter, even after the self-joins are done vectored, local, ordered; in one word, right. The literal and URI translation matter plays no role here. The needless indexing hurts at load but has no effect at query time, since none of the bloat participates in the running. Again, physical design.

Triples are done right, so?

In the summer of 2013, after the Star Schema results, it became clear that maybe further gains could be had and query optimization made smoother and more predictable, but that these would be paths of certain progress but with diminishing returns per effort. No, not the pincers; give me the sword. So, between fall 2013 and spring 2014, aside from doing diverse maintenance, I did the TPC-H series. This is the proficiency run for big league databases; the America's Cup, not a regatta on the semantic lake.

Even if the audience is principally Linked Data, the baseline must be that of the senior science of SQL.

It stands to reason and has been demonstrated by extensive experimentation at CWI that RDF data, by and large, has structure. This structure will carry linked data through the last mile to being a real runner against the alternative technologies (SQL, IR, key value) mentioned earlier.

The operative principles have been mentioned earlier and are set forth on the slides. In forthcoming articles I will display some results.

One important proposal for structure awareness was by Thomas Neumann in an RDF3X paper introducing characteristic sets. There, the application was creation of more predictable cost estimates. Neumann correctly saw this as possibly the greatest barrier to predictable RDF performance. Peter Boncz and I discussed the use of this for physical optimization once when driving back to Amsterdam from a LOD2 review in Luxembourg. Pham Minh Duc of CWI did much of the schema discovery research, documented in the now published LOD2 book (Linked Open Data -- Creating Knowledge Out of Interlinked Data). The initial Virtuoso implementation had to wait for the TPC-H and general squeezing of the quads model to be near complete. It will likely turn out that the greatest gain of all with structure awareness will be bringing optimization predictability to SQL levels. This will open the whole bag of tricks known to data warehousing to safe deployment for linked data. Of course, much of this has to do with exploiting physical layout; hence it also needs the physical model to be adapted. Many of these techniques have high negative impact if used in the wrong place; hence the cost model must guess right. But they work in SQL and, as per Thomas Neumann's initial vision, there is no reason why these would not do so in a schema-less model if adapted in a smart enough manner.

All this gives rise to some sociological or psychological observations. Jens Lehmann asked me why now, why not earlier; after all, over the years many people have suggested property tables and other structured representations. This is now because there is no further breakthroughs within an undifferentiated physical model.

For completeness, we must here mention other approaches to alternative, if still undifferentiated, physical models. A number of research papers mention memory-only, pointer-based (i.e., no index, no hash-join) implementations of triples or quads. Some of these are on graph processing frameworks, some stand-alone. Yarc Data is a commercial implementation that falls in this category. These may have higher top speeds than column stores, even after all vectoring and related optimizations. However the space utilization is perforce larger than with optimum column compression and this plus the requirement of 100% in memory makes these more expensive to scale. The linked data proposition is usually about integration, and this implies initially large data even if not all ends up being used.

The graph analytics, pointer-based item will be specially good for a per-application extraction, as suggested by Oracle in their paper at GRADES 13. No doubt this will come under discussion at LDBC, where Oracle Labs is now a participant.

But back to physical model. What we have in mind is relational column store — multicolumn-ordered column-wise compressed tables — a bit like Vertica and Virtuoso in SQL mode for the regular parts and quads for the rest. What is big is regular, since a big thing perforce comes from something that happens a lot, like click streams, commercial transactions, instrument readings. For the 8-lane-motorway of regular data, you get the F1 racer with the hardcore best in column store tech. When the autobahn ends and turns into the mountain trail, the engine morphs into a dirt bike.

This is complex enough, and until all the easy gains have been extracted from quads, there is little incentive. Plus this has the prerequisite of quads done right, plus the need for top of the line relational capability for not falling on your face once the speedway begins.

Steve Buxton of MarkLogic gave a talk right before mine. Coming from a document-centric world, it stands to reason that MarkLogic would have a whole continuum of different mixes between SPARQL and document oriented queries. Steve correctly observed that some users found this great; others found this a near blasphemy, an unholy heterodoxy of confusing distinct principles.

This is our experience as well, since usage of XML fragments in SPARQL with XPath and such things in Virtuoso is possible but very seldom practiced. This is not the same as MarkLogic, though, as MarkLogic is about triples-in-documents, and the Virtuoso take is more like documents-in-triples. Not to mention that use of SQL and stored procedures in Virtuoso is rare among the SPARQL users.

The whole thing about the absence of physical design in RDF is a related, but broader instance of such purism.

In my talk, I had a slide titled The Cycle of Adventure, generally philosophizing on the dynamics of innovation. All progress begins with an irritation with the status quo; to mention a few examples: the No-SQL rebellion; the rejection of parallel SQL database in favor of key-value and map-reduce; the admission that central schema authority at web scale is impossible; the anti-ACID stance when having wide-area geographies to deal with. The stage of radicalism tends to discard the baby with the bathwater. But when the purists have their own enclave, free of the noxious corruption of the rejected world, they find that life is hard and defects of human character persist, even when all subscribe to the same religion. Of course, here we may have further splinter groups. After this, the dogma adapts to reality: the truly valuable insights of the original rebellion gain in appreciation, and the extremism becomes more moderate. Finally there is integration with mainstream, which becomes enriched by new content.

By the time the term Linked Data came to broad use, the RDF enterprise had its break-away colonies that started to shed some of the initial zeal. By now, we have the last phase of reconciliation in its early stages.

This process is in principle complete when linked data is no longer a radical bet, but a technology to be routinely applied to data when the nature of the data fits the profile. The structure awareness and other technology discussed here will mostly eliminate the differential in deployment cost.

The spreading perception of an expertise gap in this domain will even-out the cost in terms of personnel. The flexibility gains that were the initial drive for the movement will be more widely enjoyed when these factors fuel broader adoption.

To help this along, we have LDBC, the Linked Data Benchmark Council, with the agenda of creating industry consensus on measuring progress across the linked data and graph DB frontiers. I duly invited MarkLogic to join.

There were many other interesting conversations at the conference, I will later comment on these.

To be continued...

SEMANTiCS 2014 Series

# PermaLink Comments [0]
09/08/2014 13:17 GMT Modified: 09/08/2014 16:12 GMT
LOD2 Finale (part 3 of n): The 500 Giga-triple Runs

In the evening of day 8, we have kernel settings in the cluster changed to allow more mmaps. At this point, we notice that the dataset is missing the implied types of products; i.e., the most specific type is given but its superclasses are not directly associated with the product. We have always run this with this unique inference materialized, which is also how the data generator makes the data, with the right switch. But the switch was not used. So a further 10 Gt (Giga-triples) are added, by running a SQL script to make the superclasses explicit.

At this point, we run BSBM explore for the first time. To what degree does the 37.5 Gt predict the 500 Gt behavior? First, there is an overflow that causes a query plan cost to come out negative if the default graph is specified. This is a bona fide software bug you don't get unless a sample is quite large. Also, we note that starting the databases takes a few minutes due to disk. Further, the first query takes a long time to compile, again because of sampling the database for overall statistics.

The statistics are therefore gathered by running a few queries, and then saved. Subsequent runs will reload the stats from the file system, saving some minutes of start time. There is a function for this, stat_import and stat_export. These are used for a similar purpose by some users.

On day 10, Wednesday August 20, we have some results of BSBM explore.

Then, we get into BSBM updates. The BSBM generator makes an update dataset, but it cannot be made large enough. The BSBM test driver suite is by now hated and feared in equal measure. Is it bad in and of itself? Depends. It was certainly not made for large data. Anyway, no fix will be attempted this time. Instead, a couple of SQL procedures are made to drive a random update workload. These can run long enough to get a steady state with warm cache, which is what any OLTP measurement needs.

On day 12, some updates are measured, with a one hour ramp-up to steady-state, but these are not quite the right mix, since these are products only and the mix needs to contain offers and reviews also. The first steady-state rate was 109 Kt/s, a full 50x less than the bulk load, but then this was very badly bound by latency. So, the updates are adjusted to have more variety. The final measurement was on day 17. Now the steady-state rate is 2563 Kt/s, which is better but still quite bound by network. By adding diversity to the dataset, we get slammed by a sharp rise in warm-up time (now 2 hours to be at 230 Kt/s), at which point we launch the explore mix to be timed during update. Time is short and we do not want to find out exactly how long it takes to get the plateau in insert rate. As it happens, the explore mix is hardly slowed down by the updates, but the updates get hit worse, so that the rate goes to about 1/3 of what it was, then comes back up when the explore is finished. Finally, half an hour after this, there is a steady state of 263 Kt/s update rate.

Of course, the main object of the festivities is still the business intelligence (BI) mix. This is our (specifically, Orri's) own invention from years back, subsequently formulated in SPARQL by FU Berlin (Andreas Schultz). Well, it is already something to do big joins with 150 Gt, all on index and vectored random access, as was done in January 2013, the last time results were published on the CWI cluster. You may remember that there was an aborted attempt in January 2014. So now, with the LOD2 end date under two weeks away, we will take the BI racer out for a spin with 500 Gt. This is now a very different proposition from Jan 2013, as we have by now done the whole TPC-H work documented on this blog. This serves to show, inter alia, that we can run with the best in the much bigger and harder mainstream database sports. The full benefits of this will be realized for the semantic data public still this year, so this is more than personal vanity.

So we will see. The BI mix is not exactly TPC-H, but what is good for one is good for the other. Checking that the plans are good on the 37 Gt scale model is done around day 12. On day 13, we try this on the larger cluster. You never know — pushing the envelope, even when you know what you are doing and have written the whole thing, is still a dive in the fog. Claiming otherwise would be a lie lacking credibility. The iceberg which first emerges is overflow and partition skew. Well, there can be a lot of messages if all messages go via the same path. So we make the data structure different and retry and now die from out of memory. On the scale model, this looks like a little imbalance you don't bother to notice; at 13x scale, this kills. So, as is the case with most database problems, the query plan is bad. Instead of using a PSOG index, it uses a POSG index, and there is a constant for O. Partitioning is on either S or O, whichever is first. Not hard to fix, but still needs a cost-model adjustment to penalize low-cardinality partition columns. This is something you don't get with TPC-H, where there are hardly any indices. Once this is fixed there are other problems, such as Q5, which we ended up leaving out. The scale model is good; the large one does not produce a plan, because some search-space corner is visited that is not visited in the scale model, due to different ratios of things in the cost model. Could be a couple of days to track; this is complex stuff. So we dropped it. It is not a big part of the metric, and its omission is immaterial to the broader claim of handling 500 Gt in all safety and comfort. The moral is: never get stuck; only do what is predictable, insofar as anything in this shadowy frontier is such.

So, on days 15 and 16, the BI mix that is reported was run. The multiuser score was negatively impacted by memory skew, so some swapping on one of the nodes, but the run finished in about 2 hours anyway. The peak of transient memory consumption is another thing that you cannot forecast with exact precision. There is no model for that; the query streams are in random order, and you just have to try. And it is a few hours per iteration, so you don't want to be stuck doing that either. A rerun would get a higher multiuser BI score; maybe one will be made but not before all the rest is wrapped up.

Now we are talking 2 hours, versus 9 hours with the 150 Gt set back in January 2013. So 3.3x the data, 4.5x less time, 1.5x the gear. This comes out at one order of magnitude. With a better score from better memory balance and some other fixes, a 15x improvement for BSBM BI is in the cards.

The final explore runs were made on day 18, while writing the report to be published at the LOD2 deliverables repository. The report contains in depth discussion on the query plans and diverse database tricks and their effectiveness.

The overall moral of this trip into these uncharted spaces is this: Expect things to break. You have to be the designer and author of the system to take it past its limits. You will cut it or you won't, and nobody can do anything about it, not with the best intentions, nor even with the best expertise, which both were present. This is true of the last minute daredevil stuff like this; if you have a year full time instead of the last 20 days of a project, all is quite different, and these things are more leisurely. This might then become a committee affair, though, which has different problems. In the end, the Virtuoso DBMS has never thrown anything at us we could not handle. The uncertainty in trips of this sort is with the hardware platform, of which we had to replace 2 units to get on the way, and with how fast you can locate and fix a software problem. So you pick the quickest ones and leave the uncertain aside. There is another category of rare events like network failures that in theory cannot happen. Yet they do. So, to program a cluster, you have to have some recovery things for these. We saw a couple of these along the way. Duplication of these can take days, and whether this correlates with specific links or is a bona fide software thing is time consuming to prove, and getting into this is a sure way to lose the race. These seem to be load peaks outside of steady-state; steady-state is in fact very steady once it is there. Except at the start, network glitches were not a big factor in these experiments. The bulk of these went away after replacing a machine. After this we twice witnessed something that cannot exist but knew better than to get stuck with that. Neither incident happened again. This is days of running at a cross sectional 1 GB/s of traffic. These are the truly unpredictable, and, in a crash course like this, can sink the whole gig no matter how good you are.

Thanks are due to CWI and especially Peter Boncz for providing the race track as well as advice and support.

In the next installments of this series, we will look at how schema and characteristic sets will deliver the promise of RDF without its cost. All the experiments so far were done with a quads table, as always before. So we could say that the present level is close to the limit of the achievable within this physical model. The future lies beyond the misconception of triples/quads as primary physical model.

To be continued...

LOD2 Finale Series

# PermaLink Comments [0]
08/29/2014 12:49 GMT
LOD2 Finale (part 2 of n): The 500 Giga-triples

No epic is complete without a descent into hell. Enter the historia calamitatum of the 500 Giga-triples (Gt) at CWI's Scilens cluster.

Now, from last time, we know to generate the data without 10 GB of namespace prefixes per file and with many short files. So we have 1.5 TB of gzipped data in 40,000 files, spread over 12 machines. The data generator has again been modified. Now the generation was about 4 days. Also from last time, we know to treat small integers specially when they occur as partition keys: 1 and 2 are very common values and skew becomes severe if they all go to the same partition; hence consecutive small INTs each go to a different partition, but for larger ones the low 8 bits are ignored, which is good for compression: Consecutive values must fall in consecutive places, but not for small INTs. Another uniquely brain-dead feature of the BSBM generator has also been rectified: When generating multiple files, the program would put things in files in a round-robin manner, instead of putting consecutive numbers in consecutive places, which is how every other data generator or exporter does it. This impacts bulk load locality and as you, dear reader, ought to know by now, performance comes from (1) locality and (2) parallelism.

The machines are similar to last time: each a dual E5 2650 v2 with 256 GB RAM and QDR InfiniBand (IB). No SSD this time, but a slightly higher clock than last time; anyway, a different set of machines.

The first experiment is with triples, so no characteristic sets, no schema.

So, first day (Monday), we notice that one cannot allocate more than 9 GB of memory. Then we figure out that it cannot be done with malloc, whether in small or large pieces, but it can with mmap. Ain't seen that before. One day shot. Then, towards the end of day 2, load begins. But it does not run for more than 15 minutes before a network error causes the whole thing to abort. All subsequent tries die within 15 minutes. Then, in the morning of day 3, we switch from IB to Gigabit Ethernet (GigE). For loading this is all the same; the maximal aggregate throughput is 800 MB/s, which is around 40% of the nominal bidirectional capacity of 12 GigE's. So, it works better, for 30 minutes, and one can even stop the load and do a checkpoint. But after resuming, one box just dies; does not even respond to ping. We change this to another. After this, still running on GigE, there are no more network errors. So, at the end of day 3, maybe 10% of the data are in. But now it takes 2h21min to make a checkpoint, i.e., make the loaded data durable on disk. One of the boxes manages to write 2 MB/s to a RAID-0 of three 2 TB drives. Bad disk, seen such before. The data can however be read back once the write is finally done.

Well, this is a non-starter. So, by mid-day of day 4, another machine has been replaced. Now writing to disk is possible within expected delays.

In the afternoon of day 4, the load rate is about 4.3 Mega-triples (Mt) per second, all going in RAM.

In the evening of day 4, adding more files to load in parallel increases the load rate to between 4.9 and 5.2 Mt/s. This is about as fast as this will go, since the load is not exactly even. This comes from the RDF stupidity of keeping an index on everything, so even object values where an index is useless get indexed, leading to some load peaks. For example, there is an index on POSG for triples were the predicate is rdf:type and the object is a common type. Use of characteristic sets will stop this nonsense.

But let us not get ahead of the facts: At 9:10 PM of day 4, the whole cluster goes unreachable. No, this is not a software crash or swapping; this also affects boxes on which nothing of the experiment was running. A whole night of running is shot.

A previous scale model experiment of loading 37.5 Gt in 192 GB of RAM, paging to a pair of 2 TB disks, has been done a week before. This finishes in time, keeping a load rate of above 400 Kt/s on a 12-core box.

At 10AM on day 5 (Friday), the cluster is rebooted; a whole night's run missed. The cluster starts and takes about 30 minutes to get to its former 5 Mt/s load rate. We now try switching the network back to InfiniBand. The whole ethernet network seemed to have crashed at 9PM on day 4. This is of course unexplained but the experiment had been driving the ethernet at about half its cross-sectional throughput, so maybe a switch crashed. We will never know. We will now try IB rather than risk this happening again, especially since if it did repeat, the whole weekend would be shot, as we would have to wait for the admin to reboot the lot on Monday (day 8).

So, at noon on day 5, the cluster is restarted with IB. The cruising speed is now 6.2 Mt/s, thanks to the faster network. The cross sectional throughput is about 960 MB/s, up from 720 MB/s, which accounts for the difference. CPU load is correspondingly up. This is still not full platform since there is load unbalance as noted above.

At 9PM on day 5, the rate is around 5.7 Mt/s with the peak node at 1500% CPU out of a possible 1600%. The next one is under 800%, which is just to show what it means to index everything. In specific, the node that has the highest CPU is the one in whose partition the bsbm:offer class falls, so that there is a local peak since one of every 9 or so triples says that something is an offer. The stupidity of the triple store is to index garbage like this to begin with. The reason why the performance is still good is that a POSG index where P and O are fixed and the S is densely ascending is very good, with everything but the S represented as run lengths and the S as bitmaps. Still, no representation at all is better for performance than even the most efficient representation.

The journey consists of 3 different parts. At 10PM, the 3rd and last part is started. The triples have more literals, but the load is more even. The cruising speed is 4.3 Mt/s down from 6.2, but the data has a different shape, including more literals.

The last stretch of the data is about reviews. This stretch of the data has less skew. So we increase parallelism, running 8 x 24 files at a time. The load rate goes above 6.3 Mt/s.

At 6:45 in the morning of day 6, the data is all loaded. The count of triples is 490.0 billion. If the load were done in a single stretch without stops and reconfiguration, it would likely go in under 24h. The average rate for a 4 hour sample between midnight and 4AM of day 6 is 6.8 MT/s. The resulting database files add up to 10.9 TB, with about 20% of the volume in unallocated pages.

At this time, noon of day 6, we find that some cross-partition joins need more distinct pieces of memory than the default kernel settings allow per process. A large number of partitions makes a large number of sometimes long messages which makes many mmaps. So we will wait until morning of day 8 (Monday) for the administrator to set these. In the meantime, we analyze the behavior of the workload on the 37 Gt scale model cluster on my desktop.

To be continued...

LOD2 Finale Series

# PermaLink Comments [0]
08/18/2014 16:54 GMT Modified: 08/29/2014 12:50 GMT
LOD2 Finale (part 1 of n): RDF Before The Dawn

The LOD2 FP7 ends at the end of August, 2014. This post begins a series that will crown the project with a grand finale, another decisive step towards the project’s chief goal of giving RDF and linked data performance parity with SQL systems.

In a nutshell, LOD2 went like this:

  1. Triples were done right, taking the best of the column store world and adapting it to RDF. This is now in widespread use.

  2. SQL was done right, as I have described in detail in the TPC-H series. This is generally available as open source in v7fasttrack. SQL is the senior science and a runner-up like sem-tech will not carry the day without mastering this.

  3. RDF is now breaking free of the triple store. RDF is a very general, minimalistic way of talking about things. It is not a prescription on how to do database. Confusing these two things has given rise to RDF’s relative cost against alternatives. To cap off LOD2, we will have the flexibility of triples with the speed of the best SQL.

In this post we will look at accomplishments so far and outline what is to follow during August. We will also look at what in fact constitutes the RDF overhead, why this is presently so, and why this does not have to stay thus.

This series will be of special interest to anybody concerned with RDF efficiency and scalability.

At the beginning of LOD2, I wrote a blog post discussing the RDF technology and its planned revolution in terms of the legend of Perseus. The classics give us exemplars and archetypes, but actual histories seldom follow them one-to-one; rather, events may have a fractal nature where subplots reproduce the overall scheme of the containing story.

So it is also with LOD2: The Promethean pattern of fetching the fire (state of the art of the column store) from the gods (the DB world) and bringing it to fuel the campfires of the primitive semantic tribes is one phase, but it is not the totality. This is successfully concluded, and Virtuoso 7 is widely used at present. Space efficiency gains are about 3x over the previous, with performance gains anywhere from 3 to 100x. As pointed out in the Star Schema Benchmark series (part 1 and part 2), in the good case one can run circles in SPARQL around anything but the best SQL analytics databases.

In the larger scheme of things, this is just preparation. In the classical pattern, there is the call or the crisis: Presently this is that having done triples about as right as they can be done, the mediocre in SQL can be vanquished, but the best cannot. Then there is the actual preparation: Perseus talking to Athena and receiving the shield of polished brass and the winged sandals. In the present case, this is my second pilgrimage to Mount Database, consisting of the TPC-H series. Now, the incense has been burned and libations offered at each of the 22 stations. This is not reading papers, but personally making one of the best-ever implementations of this foundational workload. This establishes Virtuoso as one of the top-of-the-line SQL analytics engines. The RDF public, which is anyway the principal Virtuoso constituency today, may ask what this does for them.

Well, without this step, the LOD2 goal of performance parity with SQL would be both meaningless and unattainable. The goal of parity is worth something only if you compare the RDF contestant to the very best SQL. And the comparison cannot possibly be successful unless it incorporates the very same hard core of down-to-the-metal competence the SQL world has been pursuing now for over forty years.

It is now time to cut the Gorgon’s head. The knowledge and prerequisite conditions exist.

The epic story is mostly about principles. If it is about personal combat, the persons stand for values and principles rather than for individuals. Here the enemy is actually an illusion, an error of perception, that has kept RDF in chains all this time. Yes, RDF is defined as a data model with triples in named graphs, i.e., quads. If nothing else is said, an RDF Store is a thing that can take arbitrary triples and retrieve them with SPARQL. The naïve implementation is to store things as rows in a quad table, indexed in any number of ways. There have been other approaches suggested, such as property tables or materialized views of some joins, but these tend to flush the baby with the bathwater: If RDF is used in the first place, it is used for its schema-less-ness and for having global identifiers. In some cases, there is also some inference, but the matter of schema-less-ness and identifiers predominates.

We need to go beyond a triple table and a dictionary of URI names while maintaining the present semantics and flexibility. Nobody said that physical structure needs to follow this. Everybody just implements things this way because this is the minimum that will in any case be required. Combining this with a SQL database for some other part of the data/workload hits basically insoluble problems of impedance mismatch between the SQL and SPARQL type systems, maybe using multiple servers for different parts of a query, etc. But if you own one of the hottest SQL racers in DB city and can make it do anything you want, most of these problems fall away.

The idea is simple: Put the de facto rectangular part of RDF data into tables; do not naively index everything in places where an index gives no benefit; keep the irregular or sparse part of the data as quads. Optimize queries according to the table-like structure, as that is where the volume is and where getting the best plan is a make or break matter, as we saw in the TPC-H series. Then, execute in a way where the details of the physical plan track the data; i.e., sometimes the operator is on a table, sometimes on triples, for the long tail of exceptions.

In the next articles we will look at how this works and what the gains are.

These experiments will for the first time showcase the adaptive schema features of the Virtuoso RDF store. Some of these features will be commercial only, but the interested will be able to reproduce the single server experiments themselves using the v7fasttrack open source preview. This will be updated around the second week of September to give a preview of this with BSBM and possibly some other datasets, e.g., Uniprot. Performance gains for regular datasets will be very large.

To be continued...

LOD2 Finale Series

# PermaLink Comments [0]
08/18/2014 16:54 GMT Modified: 08/29/2014 12:50 GMT
In Hoc Signo Vinces (part 15 of n): TPC-H and the Science of Hash

This piece is dedicated to Peter Boncz, architect of Actian Vector and MonetDB.

Query optimization is hard. It is a set of mutually interacting tricks and special cases. Execution is also hard, but there the tricks do not interact quite as much or as unpredictably. So, if there is a few percent of score to be had from optimization of either execution or query, I will take execution first. It is less likely to break things and will probably benefit a larger set of use cases.

As we see from the profile in the previous article, hash join is the main piece of execution in TPC-H. So between the article on late projection and the first result preview, I changed the hash table used in HASH JOIN and GROUP BY from cuckoo to linear.

Let's see how the hash tables work: Cuckoo hash is a scheme where an entry can be in one of two possible places in the table. If a new entry is inserted and either of the possible places is unoccupied, it goes there. If both are occupied, it could be that one contains an entry whose other possible location is free -- and then that entry may be relocated. Thus an insert may push the previous occupant of the place somewhere else, which in turn may push another, and so on. It may happen that an insert is still not possible, in which case the entry to insert goes into an exceptions list.

To look up an entry, you get a hash number, and use different fields of it to pick the two places. Look in one, then the other, then the exceptions. If there is no match and the table is reasonably close to capacity, you will have looked in at least 3 widely-separated places to determine the absence of a match. In practice, the hash table consists on a prime number of distinct arrays of a fixed size (partitions), and each partition has its own exception list. A modulo of the hash number picks the array, then two further modulos of different parts of the number pick the places in the array.

In most cases in TPC-H, the hash joins are selective; i.e., most items on the probe side find no match in the hash table.

So, quite often you have 3 cache misses to show that there is no hit. This is, at least in theory, quite bad.

There are Bloom filters before the hash table. The Bloom filter will prune most of the probes that would miss. A Bloom filter is an array of bits. Given a hash number, the Bloom filter will very efficiently tell you whether the entry is sure not to be in the hash table. If the Bloom filter says it can be in the hash table, you must look.

In the Virtuoso case, for each entry in the hash table, the Bloom filter has 8 bits. The Bloom check uses a field of the hash number to pick a 64-bit word from the Bloom filter. Then different fields of the hash number are used to set up-to 4 bits in a 64-bit bit-mask. When building the hash table, the masks are OR-ed into the Bloom filter. When probing, before looking in the hash table, the system checks to see if the bits corresponding to the hash number are all on in the appropriate word. If they are not, the hash lookup is sure to miss.

Most expositions of Bloom filters talk about setting two bits for every value. With two bits set, we found 8 bits-per-value to work best. More bits makes a larger filter and misses the cache more; fewer bits makes too many collisions, and the Bloom filter produces too many false positives. A significant finding is that with 8 bits-per-value, setting 4 bits instead of 2 causes the filter to be twice as selective. The simple trick of setting 4 bits cuts the number of hash lookups for items that passed the Bloom filter to half in many selective hash joins. Examples are the many joins of lineitem and part or supplier where there is a condition on the smaller table.

Still, even with Bloom filters, a cuckoo hash will make too many cache misses.

So, enter linear hash. The idea is simple: The hash number picks a place in an array. Either the entry being sought is in the vicinity, or it is not in the hash table. If the vicinity is full of other entries, the entry can still be in an exception list.

With this and cuckoo alike, there are 3 different variants of hash table:

  1. A set of single unique integers

  2. A single-integer key with 0 or more dependent values, possibly with a next link if the key is not unique

  3. A key of n arbitrary values, 0 or more dependent values, optional next link if the key is not unique

In the first case, the hash table is an array of values; in the two other cases, it is an array of pointers. But since a pointer is 64 bits, of which the high 16 are not in the address space of x86_64, these high bits can be used to keep a part of the hash number. It will be necessary to dereference the pointer only if the high bits match the hash number. This means that nearly all lookups that do not find a match are handled with a single cache miss.

Each cache miss brings in a cache line of 8 words. The lookup starts at a point given by the hash number and wraps around at the end of the cache line. Only in the case that all 8 words are occupied but do not match does one need to look at the exceptions. There is one exception list for each partition of the hash table, like in the cuckoo scheme.

A hash lookup is always done on a vector of keys; the loop that takes most of the time is in fact the Bloom filter check. It goes as follows:

#define CHB_VAR(n)                           \
   uint64 h##n, w##n, mask##n;

#define CHB_INIT(n, i)                       \
   MHASH_STEP_1 (h##n, i);                   \
   w##n = bf[BF_WORD (h##n, sz)];            \
   mask##n = BF_MASK (h##n);

#define CHB_CK(n)                            \
   { matches[mfill] = inx + n;               \
     mfill += (w##n & mask##n) == mask##n; }

       for (inx = inx; inx < last; inx ++)
           CHB_VAR (0);
           CHB_INIT (0, REF ((ce_first + sizeof (ELT_T) * inx)));
           CHB_CK (0);

This is the perfect loop for out-of-order execution. Now, I have tried every variation you can imagine, and this does not get better. The loop calculates a hash number, fetches the corresponding word from the Bloom filter, calculates a mask, stores the index of the key in a results array, and increments the results counter if all the bits were set. There is no control dependency anywhere, just a data dependency between successive iterations; i.e., to know where the result must go, you must know if the previous was a hit.

You can unroll this loop very easily, so, for example, take 4 keys, do the numbers, fetch the words, and then check them one after the other. One would think this would have more misses in flight at any one time, which it does. But it does not run any faster.

Maybe the loop is too long. Circumstantial evidence suggests that short loops are better for instruction prefetching. So, one can also make a loop that gets any number of words of the Bloom filter and puts them in one local array and the hash numbers in another array. A subsequent loop then reads the hash number, calculates the mask, and checks if there is a hit. In this way one can generate as many misses as one wants and check them as late as one wants. It so happens that doing 8 misses and then checking them is better than either 4 or 16. But 8 is still marginally worse than the loop first mentioned.

One can also vary the test. Instead of adding a truth value to the result counter, one can have

if ((word & mask) == mask) result[fill++] = inx;

There is no clear difference between predication (incrementing the fill by truth value) and a conditional jump. The theory of out-of-order execution would predict predication to be better, but the difference is lost in measurement noise. This is true on both Intel Nehalem (Xeon 55xx) and Sandy Bridge (E5 26xx), but could be different on other architectures.

The multicore scalability of the test will give some information about platform utilization.

This is the ultimately simplified selective hash join:

  FROM  lineitem, 
 WHERE  l_partkey = p_partkey 
   AND     p_size < 15

This is the second simplest hash join but misses the cache much more; since this now has a key and a dependent part in the hash table, there is an extra pointer to follow, and the hash entry is two words plus the pointer to these in the hash table array.

SELECT  SUM (p_retailprice) 
  FROM  lineitem, 
 WHERE  l_partkey = p_partkey 
   AND     p_size < 15

By adjusting the number of parts selected, we can vary the Bloom filter selectivity and the size of the hash table. Below, we show the times for the two queries with single-thread and 24-thread execution, with different percentages of the part table on the build side of the hash join. All runs are against warm 100G TPC-H on the same test system as in the rest of the TPC-H series (dual Xeon E5-2630).

This table compares the performance of the linear and cuckoo implementations on the above queries (count vs. sum) on either 24 threads or 1 thread. Four data points are given for different sizes of hash table, given as percentage of the part table (having 400K - 20M entries) in the hash table. The rightmost column, which represents the case where the entire part table is on the build side does not have a Bloom filter; the other cases do. The Bloom bits are 8/4 for linear and 8/2 for cuckoo. The times are all in milliseconds, and the thousands separator is a comma.

Hash type Query type Threads 2%
Linear COUNT 24 1,204 1,683 3,100 6,214
Linear SUM 24 1,261 2,447 5,059 13,086
Linear COUNT 1 15,286 22,451 38,863 66,722
Linear SUM 1 17,575 33,664 81,927 179,013
Cuckoo COUNT 24 1,849 2,840 4,105 6,203
Cuckoo SUM 24 2,833 4,903 9,446 19,652
Cuckoo COUNT 1 25,146 39,064 57,383 85,105
Cuckoo SUM 1 33,647 67,089 121,989 240,941

We clearly see cache effects on the two first lines, where the SUM and COUNT run in almost the same time on a small hash table but have a 2x difference on the larger hash table. The instruction path length is not very different for SUM and COUNT, but the memory footprint has a 3x difference.

We note that the SMP scalability of linear is slightly better, contrasting the ratio of 24-thread SUM to single-thread SUM. Both numbers are over 12x, indicating net benefit from core multithreading. (The test system has 12 physical cores.) The linear hash systematically outperforms cuckoo, understandably, since it makes a smaller net number of cache misses. The overall effect on the TPC-H score is noticeable, at around 15-20K units of composite score at 100G.

In conclusion, the Virtuoso hash join implementation is certainly on the level, with only small gains to be expected from further vectoring and prefetching. These results may be reproduced using the v7fasttrack Virtuoso Open Source releases from GitHub; develop/7 for cuckoo and feature/analytics for linear hash.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
05/28/2014 17:11 GMT
In Hoc Signo Vinces (part 14 of n): Virtuoso TPC-H Implementation Analysis

In the previous article we saw an unofficial result of running the full workload. Here we will look more closely at the performance profile.

In this article we look at what the server actually does. The execution profiles for all the queries are available for download. To experiment with parallelism, you may download the software and run it locally. An Amazon image may be provided later.

Execution Profile

Below is the top of the oprofile output for a run of the 22 queries with qualification parameters against the 100G database. The operation in TPC-H terms is given under each heading.

CPU: Intel Sandy Bridge microarchitecture, speed 2299.98 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
1406009   9.5117  ce_vec_int_range_hash
-- Bloom filter before selective hash join where selective hash join is best or only condition on a scan, e.g., lineitem scan where l_partkey checked against a Bloom filter
730935    4.9448  ce_vec_int_sets_hash
-- Bloom filter check where another condition is applied first, e.g., lineitem scan with condition on l_shipdate, then Bloom filter check on l_partkey
617091    4.1746  hash_source_chash_input_1i_n
-- Q13 right outer hash join with probe from orders, build from customer, NOT EXISTS test in Q16
586938    3.9706  cs_decode
-- Generic reading of a column, all queries
536273    3.6279  ce_intd_range_ltgt
-- Date range comparison, most queries, e.g., Q1, 3, 4, 5, 6, 7, 8, 20
479898    3.2465  cha_cmp_1i
-- Q13 GROUP BY on c_custkey, Q15 GROUP BY on S_suppkey. Indicates missing the cache on high cardinality GROUP BY single INT grouping key
473721    3.2047  cha_inline_1i_n_int
-- Selective hash join after prefiltering with Bloom filter. Check only that key in hash table, no dependent part. For example Q8, Q9, Q17, Q20, with lineitem filtered by part
463723    3.1371  ce_dict_generic_range_filter
-- Range condition on low cardinality column (dictionary encoded), e.g., l_quantity, l_discount
425149    2.8761  cha_inline_1i_int
-- Hash join check that a single INT key with a dependent part is in a hash table, fetch the dependent part for hits
365040    2.4695  setp_chash_run
-- GROUP BY, e.g., GROUP BY on c_custkey in Q13
359645    2.4330  clrg_partition_dc
-- Partitioning a vector of values, occurs in all high cardinality GROUP BYs, e.g., Q13, Q15
349473    2.3642  gb_aggregate
-- Updating aggregates after the grouping keys are resolved, e.g., Q1
331926    2.2455  ce_dict_any_sets_decode
-- Fetching non-contiguous dictionary encoded strings, e.g., l_returnflag, l_linestatus in Q1
316731    2.1427  cha_insert_1i
-- Building a hash join hash table from a 1 INT key to a dependent part, e.g., Q14 from l_partkey to lineitems in a time window with a given l_partkey
286865    1.9406  ce_search_rld
-- Index lookup for run length delta compressed keys, e.g., l_orderkey, ps_partkey
231390    1.5654  cha_insert
-- Build of a hash join hash table for multipart keys, e.g., hash join with partsupp (Q9) or lineitem (Q17, Q20) on build side
224070    1.5158  ce_dict_int64_sets_decode
-- Fetching a non-contiguous set of double column values from dictionary encoding, e.g., l_discount, l_quantity
218506    1.4782  ce_intd_any_sets_decode
-- Fetching a non-contiguous set of date column values, e.g., l_shipdate in Q9
200686    1.3576  page_wait_access
-- Translating page numbers into buffers in buffer pool for column access
198854    1.3452  itc_col_seg
-- Generic part of table scan/index access
197645    1.3371  cha_insert_1i_n
-- Hash join build for hash tables with a 1 INT key and no dependent, e.g., lineitem to part join in Q9, Q17, Q20
195300    1.3212  cha_bloom_unroll_a
-- Bloom filter for hash based in predicate
192309    1.3010  hash_source_chash_input
-- Hash join probe with multi-part key, e.g., Q9 against partsupp
191313    1.2942  strstr_sse42
-- SSE 4.2 substring match, e.g. NOT LIKE condition in Q13
186325    1.2605  itc_fetch_col_vec
-- Translating page numbers into buffers in buffer pool for column access
159198    1.0770  ce_vec_int64_sets_decode
-- Fetching non-contiguous 64-bit values from array represented column, e.g., l_extendedprice

So what does TPC-H do? It is all selective hash joins. Then it is table scans with a condition on a DATE column. The DATE column part is because this implementation does not order the big tables (lineitem, orders) on their DATE columns. Third, TPC-H does big GROUP BYs, with Q13 representing most of this; all other GROUP BYs have at least 10x fewer groups. Then it just extracts column values, most often after selecting the rows on some condition; quite often the condition is a foreign key column of the row finding a hit in a hash table. This last is called invisible hash join.

Then there are some index lookups, but there the join is usually a merge pattern, like reading orders in order of o_orderkey and then getting l_orderkey matches from lineitem (e.g., Q21). Then there are quite often partitioning operations, i.e., many threads produce tuples and these tuples go to a consumer thread selected based on a partitioning column on the tuple. This is also called exchange. This is in all the high cardinality GROUP BYs and in the RIGHT OUTER JOIN of Q13.

Whether the implementation is RAM-only or paging from disk makes next to no difference. Virtuoso and Actian Vector both have a buffer pool, with Virtuoso using substantially smaller pages (8K vs. 256K or 512K). The page-number-to-buffer translation and the latching that goes with it is under 3% (itc_fetch_col_vec, page_wait_access). Of course, actually accessing secondary storage will kill the score, but checking that something is in memory is safe as long as this is always a hit.

So, once the query plans are right, the problem resolves into a few different loops. The bulk of the effort in making a TPC-H implementation is in query optimization so that the right loops run in the right order. I will further explain what the loops should contain in the next article.

Many of the functions seen in the profile are instantiations of a template for a specific data type. There are also different templates for variants of a data structure, like a hash table with different keys.

Compilation is a way of generating exactly the right loop for any set of data types. We do not find much need for this here, though. The type-specific operations that are in fact needed are anticipatable and can be predefined based on templates.

Future Gains

There are possible gains in the following domains:

  • Query optimization - Q15, Q17 do some work in duplicate, so reuse will roughly cut the time in half. In addition to reuse, there is a possibility to convert a scalar subquery into a derived table with a GROUP BY (decorrelation). Decorrelation also applies in Q20. Reuse will give some 10-12K more score; decorrelation is probably below measurement noise.

  • Group join - Merging the GROUP BY and the hash probe in Q13 will save 2% of time on throughput, maybe 5K more score.

  • Better parallelism in power test - A good power test takes 42s, and 5 times the work takes 165s, so the platform utilization of the power test is roughly 4/5. This could be slightly better.

  • NUMA - Up to 20% performance gains have been seen on scale-out configurations from binding a process to one CPU socket. This gain will be readily seen in a scale-out setting when running one process per physical CPU. Gains in the NUMA department are rather fickle and fragile and are worth less in real life than in benchmarks. So I would say that work in single-server NUMA optimization is not immediately needed since scale-out configurations will get these gains as soon as one sets affinities for processes. But whether binding processes to CPUs makes sense depends on how even the workload is. TPC-H is even, but reality is less so.

In principle, a composite score of a good run could go from 250K to 280K by diverse incremental improvements. There is some noise in run times, so two consecutive runs can deviate by a few percent, which is already significant when talking of small improvements.

Larger Scales

For a 100 GB run with 5 throughput streams, the peak query memory consumption is 3.5 GB. This includes all the GROUP BY and hash join tables that are made. Q13 is the most expensive in terms of space and allocates a peak of 1 GB for a single execution. This is trivial in comparison with the working set. But at 100x greater scale (10,000 GB or 10 TB) this becomes about 630 GB, as there will be a minimum of 9 concurrent streams instead of 5.

Now 10 TB is clearly a scale-out size, so the 630 GB transient memory is not on a single machine.

Still going to large scale-out will change the nature of some queries and introduce significant data transfer. We will know soon enough.

Problems of the Metric

A small-scale TPC-H, e.g., 100 GB, starts to have features of a lookup workload. This means that there is high variability between consecutive executions, and that the pre-execution state of the system does have large effect on a measurement.

The rules say that power test follows bulk load with maybe some checking of correctness of load in between. The bulk load is basically unregulated and usually will include statistics gathering.

The first power test shows significant variation, anything from 220 K to 240 K, while the second power test is steadily around 255 K. Since the reported score is the lower of the two, the biggest return to the implementor is in making sure the first power test is good.

The second throughput test is usually 5-10 K higher than the first; the throughput test is less sensitive. The difference does not come from I/O, but from system time for memory allocation. The memory and the same quantities and block sizes is reused by the second run.

A good power run is 42s from 100% warm RAM. A power run that gets the data from the OS disk buffers is 70s or so. A power run that gets data from SSD is worse, maybe 120s.

To cite an example, increasing the buffer pool size from 64 GB to 72 GB gets the first post-load power test from 120-150K to 230-240K while having no appreciable effect on subsequent tests. The effect is exacerbated by the fact that the power score is based on a geometric mean of run times. Very short queries (e.g., Q2) vary between consecutive in-memory executions from 120ms to 220ms. A similar variation occurs in Q13 which is on either side of 6s. Due to geometric mean, the same variability has very different impact depending on which query it hits. A Q2 that reads data from out of process can take 2s instead of the expected under 200ms. This kills the score even if a delay of 1.8s as such did not. So increasing the buffer pool in the example just serves to make sure the small supplier table is in memory. Fetching it from the OS is simply not an option in the first Q2 even if it were an option in a longer running query. Remember, the the lower of the two scores is reported, and the first power test will be bad unless it is somehow primed by some trick like bulk load order.

When differences between implementations are small, variation between consecutive runs becomes important. This is why OLTP benchmarks are required to run for a relatively long time and only measure the steady state portion. This would also be appropriate for small TPC-H runs.


Query performance reduces to a few loops when all the conditions are right. Getting the conditions to be right depends on query optimization and the right architecture choices. These results would be unthinkable without vectored execution and a compressed column store design. These in and of themselves guarantee nothing unless all plans are right. A previous Virtuoso 7 takes over 10x longer because of bad plans and missing some key execution techniques like the RIGHT OUTER JOIN in Q13 and partitioned GROUP BY.

Last summer, we did runs of the much simpler Star Schema Benchmark. The results were very good, because the engine had the much smaller set of tricks and the right plans. Repeating these tests now would show some gains from a still better hash join but nothing dramatic.

In the next article we will look at the finer points of hash join. After this we move to larger scales and clusters.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
05/01/2014 13:03 GMT Modified: 05/28/2014 17:12 GMT
In Hoc Signo Vinces (part 13 of n): Virtuoso TPC-H Kit Now on V7 Fast Track

This article shows how you can reproduce the TPC-H experiments from the previous posts in this series.

All the code is in the feature/analytics branch of the v7fasttrack git repository on GitHub.


Start by checking out and compiling Virtuoso Open Source (VOS).

git clone
cd virtuoso-opensource 
git checkout feature/analytics
export CFLAGS="-msse4.2 -DSSE42"
make -j 24
make install

The system should be an x86_64, Intel Core i7 or later, with SSE 4.2 support. (Running without SSE 4.2 is possible, but for better score you need to define it before doing configure.) The gcc may be any that supports SSE 4.2.

To have a good result, the system should have at least 96 GB of RAM and SSD.

To get a good load time, both the database files and the CSV files made by dbgen should be on SSD.

Running TPC-H

Set up

Copy the binsrc/tests/tpc-h directory from the check-out to an SSD, if it is not already on one. Set $HOME to point to the root directory of the check-out. Rename the virtuoso.ini-100G in the tpc-h directory to virtuoso.ini. Edit the database file paths in the virtuoso.ini. Where it says --

Segment1 = 1024, /1s1/dbs/tpch100cp-1.db = q1, /1s2/dbs/tpch100cp-2.db = q2

-- change the file names, and add or remove files as appropriate, each file with a different = qn, until you have one file per independent device. If this is a RAID, one file per distinct device in the RAID usually brings improvement. Edit the TransactionFile entry, and replace /1s2/dbs/ with a suitable path.

Edit ThreadsPerQuery to be the number of threads on the machine. For i7, this is double the number of cores; your environment may vary. AsyncQueueMaxThreads should be set to double ThreadsPerQuery.

The memory settings (NumberOfBuffers and MaxDirtyBuffers) are OK for 100G scale. For larger scale, make the memory settings correspondingly larger, not to exceed 75% of system memory. Count 8.5 KB per buffer. If you have less memory, you can decrease these. If so, the first power test will be hit the worst, so the scores will not be as good.

The default BIOS settings are usually OK. Disabling prefetch of adjacent cache line does not help, and turning off core threads does not help either.

For 100G, the data files for loading will take 100 GB, and the database files will take 88 GB divided among however many files. Be sure there is enough space before you start.

Generate the data

In the tpc-h directory (copy of binsrc/tests/tpc-h) --

./ 100 5 2

The first parameter is the scale factor; the second is the number of streams to use in the throughput test; the last is the number of consecutive test runs. The minimum number of streams is 5 for 100G; each successive scale adds one more. A larger number of streams is allowed, but will not make a better result in this case. A test always consists of 2 runs; you could specify more, but the extra tests will not influence the score.

Making the data files takes the longest time. You may run dbgen multithreaded to make the dataset in parallel, but then the load scripts will have to be changed to match.

Run the load

Start Virtuoso.

./ 100

Looking at iostat, you will see a read rate of about 140 MB/s from the source files.

Run the test

./ 100 5 2 

The parameters have the same meaning as in, and the same values must be specified.


The run produces two main files, report1.txt and report2.txt. These are the numerical quantity summaries for the first and second run. Additionally, there are output files for each query stream. The script can be used to collect the supporting files archive for a TPC-H full disclosure report.

The database is left running. To reuse the loaded data for another experiment, kill the Virtuoso process, delete the transaction log, and restart. This will have the data in the post-load state. To get warm cache, use the warm.sql script in the tpc-h directory.

On the test system used in this series, 12 core E5 at 2.3GHz, we expect 240K for the first run and 250K for the second. With a top-of-the-line E5, we expect around 400K. For an 8-core 2.26GHz Nehalem, we expect 150K.

If you get scores that are significantly different, something is broken; we would like to know about this.

If you have this audited according to the TPC rules, you will be allowed to call this a TPC-H result. Without such audit, the result should be labeled Virt-H.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
04/28/2014 12:09 GMT Modified: 05/28/2014 17:12 GMT
In Hoc Signo Vinces (part 12 of n): TPC-H: Result Preview

In this article, we look at the 100 GB single-server results for the whole workload. We will call this Virt-H instead of TPC-H in order to comply with the TPC rules: Use of the TPC-H label requires an audit.

The test consists of a bulk load followed by two runs. Each run consists of a single user power test and a multi-user throughput test. The number of users in the throughput test is up to the test sponsor but must be at least 5 for the 100 GB scale. The reported score is the lower of the two scores.

Result Summary

Scale Factor 100 GB
dbgen version 2.15
Lload time 0:15:02
Composite qph 241,482.3
System Availability Date 2014-04-22

The price/performance is left open. The hardware costs about 5000 euros and the software is open source so the cost per performance would be a minimum of 0.02 euros per qph at 100G. This is not compliant with the TPC pricing rules though. These require 3 year maintenance contracts for all parts.

The software configuration did not use RAID. Otherwise the software would be auditable to the best of my knowledge. The hardware would have to be the same from Dell, HP, or other large brand to satisfy the TPC pricing rule.

Executive Summaries of Each Run

Run 1

Report Date 2014-04-21
Database Scale Factor 100
Total Data Storage/Database Size 1 TB / 87,496 MB
Start of Database Load 2014-04-21 21:02:43
End of Database Load 2014-04-21 21:17:45
Database Load Time 0:15:02
Query Streams for Throughput Test 5
Virt-H Power 239,785.1
Virt-H Throughput 243,191.4
Virt-H Composite Query-per-Hour Metric (Qph@100GB) 241,482.3
Measurement Interval in Throughput Test (Ts) 162.935000 seconds

Duration of stream execution

  Start Date/Time End Date/Time Duration
Stream 0 2014-04-21 21:17:46 2014-04-21 21:18:33 0:00:47
Stream 1 2014-04-21 21:18:33 2014-04-21 21:21:13 0:02:40
Stream 2 2014-04-21 21:18:33 2014-04-21 21:21:13 0:02:40
Stream 3 2014-04-21 21:18:33 2014-04-21 21:21:06 0:02:33
Stream 4 2014-04-21 21:18:33 2014-04-21 21:21:10 0:02:37
Stream 5 2014-04-21 21:18:33 2014-04-21 21:21:16 0:02:43
Refresh 0 2014-04-21 21:17:46 2014-04-21 21:17:49 0:00:03
  2014-04-21 21:17:50 2014-04-21 21:17:51 0:00:01
Refresh 1 2014-04-21 21:19:25 2014-04-21 21:19:38 0:00:13
Refresh 2 2014-04-21 21:18:33 2014-04-21 21:18:48 0:00:15
Refresh 3 2014-04-21 21:18:49 2014-04-21 21:19:01 0:00:12
Refresh 4 2014-04-21 21:19:01 2014-04-21 21:19:13 0:00:12
Refresh 5 2014-04-21 21:19:13 2014-04-21 21:19:25 0:00:12

Numerical Quantities Summary -- Timing Intervals in Seconds

  Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Stream 0 2.311882 0.383459 1.143286 0.439926 1.594027 0.736482 1.440826 1.198925
Stream 1 5.192341 0.952574 6.184940 1.194804 6.998207 5.122059 5.962717 6.773401
Stream 2 7.354001 1.191604 4.238262 1.770639 5.782669 1.357578 4.034697 6.354747
Stream 3 6.489788 1.585291 4.645022 3.358926 7.904636 3.220767 5.694622 7.431067
Stream 4 5.609555 1.066582 6.740518 2.503038 9.439980 3.424101 4.404849 4.256317
Stream 5 10.346825 1.787459 4.391000 3.151059 4.974037 2.932079 6.191782 3.619255
Min Qi 5.192341 0.952574 4.238262 1.194804 4.974037 1.357578 4.034697 3.619255
Max Qi 10.346825 1.787459 6.740518 3.358926 9.439980 5.122059 6.191782 7.431067
Avg Qi 6.998502 1.316702 5.239948 2.395693 7.019906 3.211317 5.257733 5.686957
  Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16
Stream 0 4.476940 2.004782 2.070967 1.015134 7.995799 2.142581 1.989357 1.581758
Stream 1 11.351299 6.657059 7.719765 5.157236 25.156379 8.566067 7.028898 8.146883
Stream 2 13.954105 8.341359 10.265949 3.289724 25.249435 6.370577 11.262650 7.684574
Stream 3 13.597277 5.783821 5.944240 5.214661 24.253991 8.742896 7.701709 5.801641
Stream 4 15.612070 6.126494 4.533748 5.733828 23.021583 6.423207 8.358223 6.866477
Stream 5 8.421209 9.040726 7.799425 3.908758 23.342975 9.934672 11.455598 8.258504
Min Qi 8.421209 5.783821 4.533748 3.289724 23.021583 6.370577 7.028898 5.801641
Max Qi 15.612070 9.040726 10.265949 5.733828 25.249435 9.934672 11.455598 8.258504
Avg Qi 12.587192 7.189892 7.252625 4.660841 24.204873 8.007484 9.161416 7.351616
  Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2
Stream 0 2.258070 0.981896 1.161602 1.933124 2.203497 1.042949 3.349407 1.296630
Stream 1 8.213340 4.070175 5.662723 12.260503 7.792825 3.323136 9.296430 3.939927
Stream 2 16.754827 3.895688 4.413773 7.529466 6.288539 2.717479 11.222082 4.135510
Stream 3 8.486809 2.615640 7.426936 7.274289 6.706145 3.402654 8.278881 4.260483
Stream 4 12.604905 7.735042 5.627039 6.343302 7.242370 3.492640 6.503095 3.698821
Stream 5 8.221733 2.670036 5.866626 13.108081 9.428098 4.282014 8.213320 4.088321
Min Qi 8.213340 2.615640 4.413773 6.343302 6.288539 2.717479 6.503095 3.698821
Max Qi 16.754827 7.735042 7.426936 13.108081 9.428098 4.282014 11.222082 4.260483
Avg Qi 10.856323 4.197316 5.799419 9.303128 7.491595 3.443585 8.702762 4.024612

Run 2

Report Date 2014-04-21
Database Scale Factor 100
Total Data Storage/Database Size 1 TB / 87,496 MB
Start of Database Load 2014-04-21 21:02:43
End of Database Load 2014-04-21 21:17:45
Database Load Time 0:15:02
Query Streams for Throughput Test 5
Virt-H Power 257,944.7
Virt-H Throughput 240,998.0
Virt-H Composite Query-per-Hour Metric (Qph@100GB) 249,327.4
Measurement Interval in Throughput Test (Ts) 164.417000 seconds

Duration of stream execution

  Start Date/Time End Date/Time Duration
Stream 0 2014-04-21 21:21:20 2014-04-21 21:22:01 0:00:41
Stream 1 2014-04-21 21:22:02 2014-04-21 21:24:41 0:02:39
Stream 2 2014-04-21 21:22:02 2014-04-21 21:24:41 0:02:39
Stream 3 2014-04-21 21:22:02 2014-04-21 21:24:41 0:02:39
Stream 4 2014-04-21 21:22:02 2014-04-21 21:24:44 0:02:42
Stream 5 2014-04-21 21:22:02 2014-04-21 21:24:46 0:02:44
Refresh 0 2014-04-21 21:21:20 2014-04-21 21:21:22 0:00:02
&$160; 2014-04-21 21:21:22 2014-04-21 21:21:23 0:00:01
Refresh 1 2014-04-21 21:22:49 2014-04-21 21:23:04 0:00:15
Refresh 2 2014-04-21 21:22:01 2014-04-21 21:22:14 0:00:13
Refresh 3 2014-04-21 21:22:14 2014-04-21 21:22:27 0:00:13
Refresh 4 2014-04-21 21:22:26 2014-04-21 21:22:39 0:00:13
Refresh 5 2014-04-21 21:22:39 2014-04-21 21:22:49 0:00:10

Numerical Quantities Summary -- Timing Intervals in Seconds

  Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Stream 0 2.437262 0.227516 1.172620 0.541201 1.542084 0.743255 1.459368 1.183166
Stream 1 5.205225 0.499833 4.854558 4.818087 5.920773 3.347414 5.446411 3.723247
Stream 2 5.833803 0.659051 6.023266 3.123523 4.358200 3.371315 6.772453 4.978415
Stream 3 6.308935 0.662744 7.573807 5.000859 5.282467 4.391930 5.280472 7.852718
Stream 4 5.791856 0.421592 5.953592 4.688037 9.949038 3.098282 4.153124 4.824209
Stream 5 13.537098 1.760386 3.308982 2.299178 4.882695 2.652497 5.383128 10.178447
Min Qi 5.205225 0.421592 3.308982 2.299178 4.358200 2.652497 4.153124 3.723247
Max Qi 13.537098 1.760386 7.573807 5.000859 9.949038 4.391930 6.772453 10.178447
Avg Qi 7.335383 0.800721 5.542841 3.985937 6.078635 3.372288 5.407118 6.311407
  Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16
Stream 0 4.441940 1.948770 2.154384 1.148494 6.014453 1.647725 1.437587 1.585284
Stream 1 14.127674 7.824844 7.100679 3.586457 28.216115 7.587547 9.859152 5.829869
Stream 2 16.102880 7.676986 5.887327 2.796729 24.847035 7.146757 11.408922 7.641239
Stream 3 15.678701 5.786427 9.221883 2.692321 28.434916 6.657457 8.219745 7.706585
Stream 4 11.985421 10.182807 5.667618 6.875264 27.547492 7.438075 9.065924 8.895070
Stream 5 6.913707 7.662703 8.657333 3.282895 24.126612 10.963691 12.138564 7.962654
Min Qi 6.913707 5.786427 5.667618 2.692321 24.126612 6.657457 8.219745 5.829869
Max Qi 16.102880 10.182807 9.221883 6.875264 28.434916 10.963691 12.138564 8.895070
Avg Qi 12.961677 7.826753 7.306968 3.846733 26.634434 7.958705 10.138461 7.607083
  Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2
Stream 0 2.275267 1.139390 1.165591 2.073658 2.261869 0.703055 2.327755 1.146501
Stream 1 13.720792 4.428528 3.651645 9.841610 6.710473 2.595879 9.783844 3.800103
Stream 2 12.532257 2.312755 6.182661 8.666967 9.383983 1.414853 7.570509 4.539598
Stream 3 7.578779 3.342352 8.155356 4.925493 6.590047 2.612912 8.497542 4.638512
Stream 4 10.967178 2.173935 6.382803 5.082562 8.744671 3.074768 7.577794 4.435140
Stream 5 9.438581 2.551124 8.375607 8.339441 8.201650 1.982935 7.334306 3.404017
Min Qi 7.578779 2.173935 3.651645 4.925493 6.590047 1.414853 7.334306 3.404017
Max Qi 13.720792 4.428528 8.375607 9.841610 9.383983 3.074768 9.783844 4.638512
Avg Qi 10.847517 2.961739 6.549614 7.371215 7.926165 2.336269 8.152799 4.163474

Details of System Under Test (SUT)


Chassis Supermicro 2U
Motherboard Supermicro X9DR3-LN4F+
CPU 2 x Intel Xeon E5-2630 @ 2.3 GHz
(6 cores, 12 threads each;
total 12 cores, 24 threads)
RAM 192 GB DDR3 (24 x 8 GB, 1066MHz)
Storage 2 x Crucial 512 GB SSD


DBMS Virtuoso Open Source 7.11.3209
(feature/analytics on v7fasttrack on GitHub)
OS CentOS 6.2


This experiment places Virtuoso in the ballpark with Actian Vector (formerly branded Vectorwise), which has dominated the TPC-H score board in recent years. The published Vector results are on more cores and/or faster clock; one would have to run on the exact same platform to make precise comparisons.

Virtuoso ups the ante by providing this level of performance in open source. For a comparison with EXASolution and Actian Matrix (formerly ParAccel), we will have to go to the Virtuoso scale-out configuration, to follow shortly.

The next articles will provide a detailed analysis of performance and instructions for reproducing the results. The run outputs and scripts are available for download.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
04/22/2014 11:30 GMT Modified: 05/28/2014 17:12 GMT
 <<     | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |     >>
Powered by OpenLink Virtuoso Universal Server
Running on Linux platform