A few blogs back, I was talking about the comeback of SQL (i.e., relational) federation. True, Virtuoso has always had that, but in the general perception, it is primarily known for RDF and Linked Data. Now these two areas of functionality are combined in a way that makes sense and can provide real performance, often surpassing the relational installed base while running SPARQL with no predefined schema.

So, we take our standard TPC-H 100 GB dataset and turn it into RDF. We adjust the schema a little, so that each customer, its orders, and the lineitems of those orders go into a per-customer graph. The part, partsupp, and other tables will all go into a public graph. The per-customer graph can be used as a security label; for example, if there is customer self-service access to the warehouse, or if the access is compartmentalized by areas of responsibility (e.g., customer countries or market segments).

We use two Virtuoso processes. The first contains the TPC-H 100G dataset, the same that was discussed in the TPC-H bulk load article. The second process attaches the tables from the first via SQL federation, and constructs an RDF translation into its RDF store. The mapping is made with an RDF view, also known as a Linked Data View. The initial RDF view can be generated from the relational schema, then edited for the selection of properties. If there are modeling or unit changes in the mapping, these are easiest done with SQL views, in which case the RDF mapping is made on top of the views, not the actual tables. The SQL views reside on the same server that has the RDF views, so no write access to the source database is needed.

The server configuration is found in virtuoso.ini. This is for 4 disks and 192 GB RAM, so if you try this, make sure you have at least this much or use an accordingly scaled down dataset.

The data is defined by loading the scripts:

SQL> LOAD att2.sql ;

SQL> LOAD sql_rdf_rdfh11.sql ;

The first attaches the tables from the source server; the second defines the mapping from tables to triples. The final script (below) starts the actual ETL.

We set the default vector size to 200,000. Experience shows this is good for this sort of operation, and may save some 20% of time.

SQL> __dbf_set ('dc_batch_sz', 200000) ;

We run the transformation: The commands are in the rdb2rdf_rdfh11_1.sql script, discussed below. The ld_meter_run starts a thread to record the load rate every 30 seconds.

SQL> LOAD /mvi/te/suite/tpc-d/rdb2rdf_rdfh11_1.sql &

SQL> ld_meter_run (30) &


Done. -- 40686670 msec.

-- The load is non-transactional bulk load, so needs an explicit checkpoint to make the result durable.

SQL> checkpoint ;
Done. -- 1858934 msec.

-- we check the result

SQL> sparql 
  SELECT  COUNT (*) 
   WHERE  { ?s  ?p  ?o }
;

11869611740

SQL> sparql 
  SELECT  ?p COUNT (*) 
   WHERE  { ?s  ?p  ?o } 
GROUP BY  ?p 
ORDER BY  DESC 2 
   LIMIT  200
;
Predicate URI number of triples
http://www.w3.org/1999/02/22-rdf-syntax-ns#type 881,038,747
http://lod2.eu/schemas/rdfh#l_has_order 600,037,902
http://lod2.eu/schemas/rdfh#l_has_part 600,037,902
http://lod2.eu/schemas/rdfh#l_number 600,037,902
http://lod2.eu/schemas/rdfh#l_discount 600,037,902
http://lod2.eu/schemas/rdfh#l_linestatus 600,037,902
http://lod2.eu/schemas/rdfh#l_shipdate 600,037,902
http://lod2.eu/schemas/rdfh#l_quantity 600,037,902
http://lod2.eu/schemas/rdfh#l_extendedprice 600,037,902
http://lod2.eu/schemas/rdfh#l_commitdate 600,037,902
http://lod2.eu/schemas/rdfh#l_has_supplier 600,037,902
http://lod2.eu/schemas/rdfh#l_tax 600,037,902
http://lod2.eu/schemas/rdfh#l_returnflag 600,037,902
http://lod2.eu/schemas/rdfh#l_receiptdate 600,037,902
http://lod2.eu/schemas/rdfh#l_shipinstruct 600,037,902
http://lod2.eu/schemas/rdfh#l_shipmode 600,037,902
http://lod2.eu/schemas/rdfh#o_clerk 150,000,000
http://lod2.eu/schemas/rdfh#o_comment 150,000,000
http://lod2.eu/schemas/rdfh#c_customer_of 150,000,000
http://lod2.eu/schemas/rdfh#o_orderstatus 150,000,000
http://lod2.eu/schemas/rdfh#o_totalprice 150,000,000
http://lod2.eu/schemas/rdfh#o_orderpriority 150,000,000
http://lod2.eu/schemas/rdfh#o_orderkey 150,000,000
http://lod2.eu/schemas/rdfh#o_orderdate 150,000,000
http://lod2.eu/schemas/rdfh#o_shippriority 150,000,000
http://lod2.eu/schemas/rdfh#ps_has_supplier 80,000,000
http://lod2.eu/schemas/rdfh#ps_availqty 80,000,000
http://lod2.eu/schemas/rdfh#ps_supplycost 80,000,000
http://lod2.eu/schemas/rdfh#ps_has_part 80,000,000
http://lod2.eu/schemas/rdfh#p_type 20,000,000
http://lod2.eu/schemas/rdfh#p_size 20,000,000
http://lod2.eu/schemas/rdfh#p_container 20,000,000
http://lod2.eu/schemas/rdfh#p_mfgr 20,000,000
http://lod2.eu/schemas/rdfh#p_partkey 20,000,000
http://lod2.eu/schemas/rdfh#p_name 20,000,000
http://lod2.eu/schemas/rdfh#p_brand 20,000,000
http://lod2.eu/schemas/rdfh#p_comment 20,000,000
http://xmlns.com/foaf/0.1/phone 16,000,000
http://lod2.eu/schemas/rdfh#c_comment 15,000,000
http://lod2.eu/schemas/rdfh#c_acctbal 15,000,000
http://lod2.eu/schemas/rdfh#c_mktsegment 15,000,000
http://lod2.eu/schemas/rdfh#c_custkey 15,000,000
http://lod2.eu/schemas/rdfh#c_name 15,000,000
http://lod2.eu/schemas/rdfh#c_has_nation 15,000,000
http://lod2.eu/schemas/rdfh#c_address 15,000,000
http://lod2.eu/schemas/rdfh#c_phone 15,000,000
http://lod2.eu/schemas/rdfh#n_nation_of 15,000,000
http://lod2.eu/schemas/rdfh#s_has_nation 1,000,000
http://lod2.eu/schemas/rdfh#s_acctbal 1,000,000
http://lod2.eu/schemas/rdfh#s_name 1,000,000
http://lod2.eu/schemas/rdfh#s_address 1,000,000
http://lod2.eu/schemas/rdfh#s_comment 1,000,000
http://lod2.eu/schemas/rdfh#s_suppkey 1,000,000
http://lod2.eu/schemas/rdfh#s_phone 1,000,000
...

We calculate the ETL speed:

SQL> SELECT 11869611740 / ( ( 1858934 + 40686670 ) / 1000.0 ) ;

278985.620700084549276

11.8 hours, at just under 280 Kt/s, end-to-end. Worse has been heard of. This is a small single-server speed, on the usual test system with dual Xeon E5 2630, 192G RAM. A single-server for double the price might get double throughput. Beyond this, scale out is clearly the better deal. An elastic cluster will get throughput linear to the count of machines for this type of workload.

This shows that deploying mid-size enterprise data as RDF is a job that goes easily overnight with a commodity box, reading directly from the source system; no file-system-based staging areas are needed.

The dataset is 600M order lines; 150M orders; 15M customers; 20M parts, each with 4 suppliers; 1M total suppliers. You can contrast this to what you have in-house to get a rough estimate of what your own DW would come to.

Later, we will use this dataset to illustrate how to scope queries to security categories with graph-level security. Of course, this dataset also provides a point of SQL-to-SPARQL comparison for the ongoing TPC-H series. There will be more installments in not too long.