RDF Data Integration Benchmarking

This post complements the previous post on a social web oriented RDF store benchmark. This is a draft outline of a benchmark for mapping relational data to RDF. This is meant for mapping technologies and relies on external relational databases for all storage.

The scenario is meant to capture a case of an existing database infrastructure and translating this into RDF for ad hoc query and integration. This may serve to publish some of the data, such as product catalog information and order status as an outside accessible SPARQL end point. The RDF modeling of the data may also facilitate in-house analytics. The tentative workload is intended to represent both aspects.

The below is very tentative, all imaginable caveats apply.

To avoid reinventing the wheel, we start with the TPC C and H databases. The C database is an online order processing system and the H database is a data warehouse of orders and shipments. Both databases have orders, products and customers but are clearly not meant to be joined together. The keys for the basic entities consist of different numbers of parts, for example.

Basing this benchmark on the TPC work saves the design of the schemas and use cases.

Given this, what could we measure? From the RDF angle, we are not concerned with update transactions.

The RDF mapping benchmark must expose any information that could be had from the relational sources by themselves plus demonstrate a layer of unification that facilitates querying without significant performance overhead.

First, let us see about preserving existing functionality:

The TPC-H queries cannot be represented with SPARQL since they all involve GROUP BY. They can however be made into parameterized views of some sort and these views could be mapped into RDF. For example, Q1 is a report on order lines grouped by shipped and returned status over a period of time. The period of time does not figure in the result set. One cannot parameterize the view by just specifying a range for the time, as in

SELECT * FROM q1_view WHERE dt BETWEEN ? AND ?

There simply is no time in the output. If the time were added as a min and max of times, then one could have a condition on this and the query processor could infer that only order lines from between these times would be considered. In practice, we expect a parametric view of some vendor dependent type to be used.

How would one invoke a parameterized view from SPARQL? Like

{ ?r a q1_report . ?r1 start_time ?dt1 . r1 end_time dt2 . ?r1 returnflag ?rf . ?r1 extended_price ?price . }

So r1 would be instantiated to each row of the report view and the start and end times would be considered specially, requiring dt1 and dt2 to be bound.

In this way RDF could be used for accessing the TPC-H queries and for specifying values for the substitution parameters. How each implementation achieves this is left to their discretion.

Now, let us see what more is enabled by RDF.

As said, the C and H schemas were designed separately and are not intended for joining. A mapping between C and H is conceptually meaningful between customer, geographic area and product. Orders and order lines exist in both. For purposes of our benchmark, we consider the C database to be the recent history and the H database the longer term history, with possible overlap between the two.

To make things more interesting, we could have multiple separately populated C and H databases, for example split according to product type or geographical area. We may consider this later, for now let us just consider one C database and one H database. We will assume that these are hosted by different DBMS instances and that the the DBMS managing one cannot join with the other within a single SQL query. Thus, distributed joining is to be accomplished by the RDF layer.

The queries to be answered are among the following:

Browsing - Any primary key is to be an IRI. Any foreign key relation is to be navigable through a predicate representing this. Take

?customer has_order ?order
This is to be mapped to both customer to order relations, from the C as well as the H databases. Thus the query
?c has_name "ACME" . ?c has_order ?o . ?o has_line ?l . ?l has_product ?p . ?p has_name ?n .
should get all names of products ever ordered by ACME in either database.
The has_order and other predicates will have sub-predicates specifically referring to the customer to order relationship in one of the databases.
Periodically any of the TPC-H queries. This is evaluated by the RDBMS and the RDF layer is not expected to affect the performance. This is included simply in order to demonstrate that this can be done.
Advertising - Given the customer's purchase history, the query selects advertising to show when the user connects to the E-commerce web site running off the C database.
Product family - The products in both databases are classified by a SKOS vocabulary. Both databases can list products based on queries using the SKOS terminology. The SKOS taxonomy is mapped into the different fields of the source product tables, to be specified.
Product recommendations - Find customers with a similar purchasing history and recommend products, excluding ones the user has already purchased
Ad-hoc describe of product or customer, fetching information from both databases. For example, get the shipping delay of orders of products of category c now and this time last year.
Faceted browsing of products and recent orders. Start with a category and drill down into individual orders and customers. This would represent a customer browsing the catalog and personal order information through a web user interface.
Extraction - Given a time interval, retrieve all orders and order lines placed within this interval as well as and customers, products, suppliers reachable from these orders and export as RDF text. The rationale is load the data into an RDF warehouse, presumably for off-line joining with other data sets.

Defining the Workload

We cannot here go to the details of the workload and ad hoc experimentation will be needed to see what interesting queries can be made against the data and what other data structures are needed for supporting these queries, for example a cube of pre-computed totals for faceted drill-down.

Scaling

The databases should be scaled so that the H database has 1.5 x so many products, as there may exist discontinued products or versions that still figure in the history but are no longer stocked.
The H database should have the same number of customers as the C database.
The H database should have 18x the number of orders. If the C database covers the two last months, roughly the longest delay between order and shipping, the H database has 3 years history.

Joining

Since the primary keys in the databases are different, we use the customer names and phone numbers to establish identity.
Products are joined based on product name.
The data must be tweaked so that these will match for a sufficient number of rows.

Implementation

We expect custom code to be required for most of the above tasks. All such code should be disclosed.

Metric

The metric should not recapture the TPC-H metric. Hence the H queries per se are not measured, they are rather something that must be possible through the RDF interface.
It is reasonable to publish the run time of each of the 22 queries when submitted as SQL and when submitted through the RDF view interface. We do not expect great divergence.
Two actual metrics can be thought of. One for the extraction and the other for short ad hoc lookups.
We expect that persistent structures not present in the C and H databases will be needed for supporting the queries. Implementing these should not be predicated on having write access to the relational databases past the stage of setting up the benchmark data sets.
The costing should not include the C and H databases and supporting systems. These will already be paid for in the usage scenario. Any extra hardware and software should be included. The major part of the cost is expected to be in terms of custom coding and configuration. These should be reported as lines of code, specifying lines per each language used. Monetary cost of these will depend on the proficiency of the parties with the technologies and cannot be accurately measured.

Orri Erling's Weblog

Details

Subscribe

Tag Cloud

Post Categories