As before, the result has been measured with the feature/analytics branch of the v7fasttrack open source distribution, and it will soon be available as a preconfigured Amazon EC2 image. The updated benchmarks AMI with this version of the software will be out there within the next week, to be announced on this blog.
RDF query optimization is harder than the relational equivalent; first, because there are more joins, hence an NP complete explosion of plan search space, and second, because cardinality estimation is harder and usually less reliable. The work on characteristic sets, pioneered by Thomas Neumann in RDF3X, uses regularities in structure for treating properties usually occurring in the same subject as columns of a table. The same idea is applied for tuning physical representation in the joint Virtuoso / MonetDB work published at WWW 2015.
The Virtuoso results discussed here, however, are all based on a single RDF quad table with Virtuoso's default index configuration.
Introducing query plan caching raises the Virtuoso score from 80 qps to 144 qps at the 256 Mtriple scale. The SPB queries are not extremely complex; lookups with many more triple patterns exist in actual workloads, e.g., Open PHACTS. In such applications, query optimization indeed dominates execution times. In SPB, data volumes touched by queries grow near linearly with data scale. At the 256 Mtriple scale, nearly half of CPU cycles are spent deciding a query plan. Below are the CPU cycles for execution and compilation per query type, sorted by descending sum of the times, scaled to milliseconds per execution. These are taken from a one minute sample of running at full throughput.
Test system is the same used before in the TPC-H series: dual Xeon E5-2630 Sandy Bridge, 2 x 6 cores x 2 threads, 2.3GHz, 192 GB RAM.
We measure the compile and execute times, with and without using hash join. When considering hash join, the throughput is 80 qps. When not considering hash join, the throughput is 110 qps. With query plan caching, the throughput is 145 qps whether or not hash join is considered. Using hash join is not significant for the workload but considering its use in query optimization leads to significant extra work.
With hash join
|
Without hash join
|
Considering hash join always slows down compilation, and sometimes improves and sometimes worsens execution. Some improvement in cost-model and plan-space traversal-order is possible, but altogether removing compilation via caching is better still. The results are as expected, since a lookup workload such as SPB has little use for hash join by nature.
The rationale for considering hash join in the first place is that analytical workloads rely heavily on this. A good TPC-H score is simply unfeasible without this as previously discussed on this blog. If RDF is to be a serious contender beyond serving lookups, then hash join is indispensable. The decision for using this however depends on accurate cardinality estimates on either side of the join.
Previous work (e.g., papers from FORTH around MonetDB) advocates doing away with a cost model altogether, since one is hard and unreliable with RDF anyway. The idea is not without its attraction but will lead to missing out of analytics or to relying on query hints for hash join.
The present Virtuoso thinking is that going to rule based optimization is not the preferred solution, but rather using characteristic sets for reducing triples into wider tables, which also cuts down on plan search space and increases reliability of cost estimation.
When looking at execution alone, we see that actual database operations are low in the profile, with memory management taking the top 19%. This is due to CONSTRUCT
queries allocating small blocks for returning graphs, which is entirely avoidable.
The new report has usage data through July 31, 2015, and brought a few surprises to our eyes. What do you think?
]]>The load setup is the same as ever, with copying from CSV files attached as external tables into Parquet tables. We get lineitem
split over 88 Parquet files, which should provide enough parallelism for the platform. The Impala documentation states that there can be up to one thread per file, and here we wish to see maximum parallelism for a single query stream. We use the schema from the Impala github checkout, with string
for string
and date
columns, and decimal
for numbers
. We suppose the authors know what works best.
The execution behavior is surprising. Sometimes we get full platform utilization, but quite often only 200% CPU per box. The query plan for Q1, for example, says 2 cores per box. This makes no sense, as the same plan fully well knows the table cardinality. The settings for scanner threads and cores to use (in impala-shell
) can be changed, but the behavior does not seem to change.
Following are the run times for one query stream.
Query | Virtuoso | Impala | Notes |
---|---|---|---|
— | 332 s |
841 s |
Data Load |
Q1 | 1.098 s |
164.61 s |
|
Q2 | 0.187 s |
24.19 s |
|
Q3 | 0.761 s |
105.70 s |
|
Q4 | 0.205 s |
179.67 s |
|
Q5 | 0.808 s |
84.51 s |
|
Q6 | 2.403 s |
4.43 s |
|
Q7 | 0.59 s |
270.88 s |
|
Q8 | 0.775 s |
51.89 s |
|
Q9 | 1.836 s |
177.72 s |
|
Q10 | 3.165 s |
39.85 s |
|
Q11 | 1.37 s |
22.56 s |
|
Q12 | 0.356 s |
17.03 s |
|
Q13 | 2.233 s |
103.67 s |
|
Q14 | 0.488 s |
10.86 s |
|
Q15 | 0.72 s |
11.49 s |
|
Q16 | 0.814 s |
23.93 s |
|
Q17 | 0.681 s |
276.06 s |
|
Q18 | 1.324 s |
267.13 s |
|
Q19 | 0.417 s |
368.80 s |
|
Q20 | 0.792 s |
60.45 s |
|
Q21 | 0.720 s |
418.09 s |
|
Q22 | 0.155 s |
40.59 s |
|
Total | 20 s |
2724 s |
Because the platform utilization was often low, we made a second experiment running the same queries in five parallel sessions. We show the average execution time for each query. We then compare this with the Virtuoso throughput run average times. We permute the single query stream used in the first tests in 5 different orders, as per the TPC-H spec. The results are not entirely comparable, because Virtuoso is doing the refreshes in parallel. According to Impala documentation, there is no random delete operation, so the refreshes cannot be implemented.
Just to establish a baseline, we do SELECT COUNT (*) FROM lineitem
. This takes 20s when run by itself. When run in five parallel sessions, the fastest terminates in 64s and the slowest in 69s. Looking at top
, the platform utilization is indeed about 5x more in CPU%, but the concurrency does not add much to throughput. This is odd, considering that there is no synchronization requirement worth mentioning between the operations.
Following are the average times for each query in the 5 stream experiment.
Query | Virtuoso | Impala | Notes |
---|---|---|---|
Q1 | 1.95 s |
191.81 s |
|
Q2 | 0.70 s |
40.40 s |
|
Q3 | 2.01 s |
95.67 s |
|
Q4 | 0.71 s |
345.11 s |
|
Q5 | 2.93 s |
112.29 s |
|
Q6 | 4.76 s |
14.41 s |
|
Q7 | 2.08 s |
329.25 s |
|
Q8 | 3.00 s |
98.91 s |
|
Q9 | 5.58 s |
250.88 s |
|
Q10 | 8.23 s |
55.23 s |
|
Q11 | 4.26 s |
27.84 s |
|
Q12 | 1.74 s |
37.66 s |
|
Q13 | 6.07 s |
147.69 s |
|
Q14 | 1.73 s |
23.91 s |
|
Q15 | 2.27 s |
23.79 s |
|
Q16 | 2.41 s |
34.76 s |
|
Q17 | 3.92 s |
362.43 s |
|
Q18 | 3.02 s |
348.08 s |
|
Q19 | 2.27 s |
443.94 s |
|
Q20 | 3.05 s |
92.50 s |
|
Q21 | 2.00 s |
623.69 s |
|
Q22 | 0.37 s |
61.36 s |
|
Total for Slowest Stream |
67 s |
3740 s |
There are 4 queries in Impala that terminated with an error (memory limit exceeded
). These were two Q21s, one Q19, one Q4. One stream executed without errors, so this stream is reported as the slowest stream. Q21 will, in the absence of indexed access, do a hash build side of half of lineitem
, which explains running out of memory. Virtuoso does Q21 mostly by index.
Looking at the 5 streams, we see CPU between 1000% and 2000% on either box. This looks about 5x more than the 250% per box that we were seeing with, for instance, Q1. The process sizes for impalad
are over 160G, certainly enough to have the working set in memory. iostat
also does not show any I
, so we seem to be running from memory, as intended.
We observe that Impala does not store tables in any specific order. Therefore a merge join of orders
and lineitem
is not possible. Thus we always get a hash join with a potentially large build side, e.g., half of orders
and half of lineitem
in Q21, and all orders
in Q9. This explains in part why these take so long. TPC-DS does not pose this particular problem though, as there are no tables in the DS schema where the primary key of one would be the prefix of that of another.
However, the lineitem/orders
join does not explain the scores on Q1, Q20, or Q19. A simple hash join of lineitem
and part
was about 90s, with a replicated part
hash table. In the profile, the hash probe was 74s, which seems excessive. One would have to single-step through the hash probe to find out what actually happens. Maybe there are prohibitive numbers of collisions, which would throw off the results across the board. We would have to ask the Impala community about this.
Anyway, Impala experts out there are invited to set the record straight. We have attached the results and the output of the Impala profile
statement for each query for the single stream run. impala_stream0.zip
contains the evidence for the single-stream run; impala-stream1-5.zip
holds the 5-stream run.
To be more Big Data-like, we should probably run with significantly larger data than memory; for example, 3T in 0.5T RAM. At EC2, we could do this with 2 I3.8 instances (6.4T SSD each). With Virtuoso, we'd be done in 8 hours or so, counting 2x for the I/O and 30x for the greater scale (the 100G experiment goes in 8 minutes or so, all included). With Impala, we could be running for weeks, so at the very least we'd like to do this with an Impala expert, to make sure things are done right and will not have to be retried. Some of the hash joins would have to be done in multiple passes and with partitioning.
In subsequent articles, we will look at other players in this space, and possibly some other benchmarks, like the TPC-DS subset that Actian uses to beat Impala.
]]>All the experiments are against the TPC-H 100G dataset hosted in Virtuoso on the test system used before in the TPC-H series: dual Xeon E5-2630, 2x6 cores x 2 threads, 2.3GHz, 192 GB RAM. The Virtuoso version corresponds to the feature/analytics branch in the v7fasttrack github project. All run times are from memory, and queries generally run at full platform, 24 concurrent threads.
We note that RDF stores and graph databases usually do not have secondary indices with multiple key parts. However, these do predominantly index-based access as opposed to big scans and hash joins. To explore the impact of this, we have decomposed the tables into projections with a single dependent column, which approximates a triple store or a vertically-decomposed graph database like Sparksee.
So, in these experiments, we store the relevant data four times over, as follows:
100G TPC-H dataset in the column-wise schema as discussed in the TPC-H series, now complemented with indices on l_partkey
and on l_partkey, l_suppkey
The same in row-wise data representation
Column-wise tables with a single dependent column for l_partkey, l_suppkey, l_extendedprice, l_quantity, l_discount, ps_supplycost, s_nationkey, p_name
. These all have the original tables primary key, e.g., l_orderkey, l_linenumber
for the l_ prefixed tables
The same with row-wise tables
The column-wise structures are in the DB
qualifier, and the row-wise are in the R
qualifier. There is a summary of space consumption at the end of the article. This is relevant for scalability, since even if row-wise structures can be faster for scattered random access, they will fit less data in RAM, typically 2 to 3x less. Thus, if "faster" rows cause the working set not to fit, "slower" columns will still win.
As a starting point, we know that the best Q9 is the one in the Virtuoso TPC-H implementation which is described in Part 10 of the TPC-H blog series. This is a scan of lineitem
with a selective hash join followed ordered index access of orders
, then hash joins against the smaller tables. There are special tricks to keep the hash tables small by propagating restrictions from the probe side to the build side.
The query texts are available here, along with the table declarations and scripts for populating the single-column projections. rs.sql
makes the tables and indices, rsload.sql
copies the data from the TPC-H tables.
The business question is to calculate the profit from sale of selected parts
grouped by year
and country
of the supplier
. This touches most of the tables, aggregates over 1/17 of all sales, and touches at least every page of the tables concerned, if not every row.
SELECT
n_name AS nation,
EXTRACT(year FROM o_orderdate) AS o_year,
SUM (l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity) AS sum_profit
FROM lineitem, part, partsupp, orders, supplier, nation
WHERE s_suppkey = l_suppkey
AND ps_suppkey = l_suppkey
AND ps_partkey = l_partkey
AND p_partkey = l_partkey
AND o_orderkey = l_orderkey
AND s_nationkey = n_nationkey
AND p_name LIKE '%green%'
GROUP BY nation, o_year
ORDER BY nation, o_year DESC
The query variants discussed here are:
Hash based, the best plan -- 9h.sql
Index based with multicolumn rows, with lineitem
index on l_partkey
-- 9i.sql, 9ir.sql
Index based with multicolumn rows, lineitem
index on l_partkey, l_suppkey
-- 9ip.sql, 9ipr.sql
Index based with one table per dependent column, index on l_partkey
-- 9p.sql
index based with one table per dependent column, with materialized l_partkey, l_suppkey
-> l_orderkey, l_minenumber
-- 9pp.sql, 9ppr.sql
These are done against row- and column-wise data representations with 3 different vectorization settings. The dynamic vector size starts at 10,000 values in a vector, and adaptively upgrades this to 1,000,000 if it finds that index access is too sparse. Accessing rows close to each other is more efficient than widely scattered rows in vectored index access, so using a larger vector will likely cause a denser, hence more efficient, access pattern.
The 10K vector size corresponds to running with a fixed vector size. The Vector 1 sets vector size to 1, effectively running a tuple at a time, which corresponds to a non-vectorized engine.
We note that lineitem
and its single column projections contain 600M rows. So, a vector of 10K values will hit, on the average, every 60,000th row. A vector of 1,000,000 will thus hit every 600th. This is when doing random lookups that are in no specific order, e.g., getting lineitems
by a secondary index on l_partkey
.
Vector | Dynamic | 10k | 1 |
---|---|---|---|
Column-wise | 4.1 s
|
4.1 s
|
145 s
|
Row-wise | 25.6 s
|
25.9 s
|
45.4 s
|
Dynamic vector size has no effect here, as there is no indexed access that would gain from more locality. The column store is much faster because of less memory access (just scan the l_partkey
column, and filter this with a Bloom filter; and then hash table lookup to pick only items with the desired part
). The other columns are accessed only for the matching rows. The hash lookup is vectored since there are hundreds of compressed l_partkey
values available at each time. The row store does the hash lookup row by row, hence losing cache locality and instruction-level parallelism.
Without vectorization, we have a situation where the lineitem
scan emits one row at a time. Restarting the scan with the column store takes much longer, since 5 buffers have to be located and pinned instead of one for the row store. The row store is thus slowed down less, but it too suffers almost a factor of 2 from interpretation overhead.
lineitem
indexed on l_partkey
Vector | Dynamic | 10k | 1 |
---|---|---|---|
Column-wise | 30.4 s
|
62.3 s
|
321 s
|
Row-wise | 31.8 s
|
27.7 s
|
122 s
|
Here the plan scans part
, then partsupp
, which shares ordering with part
; both are ordered on partkey
. Then lineitem
is fetched by a secondary index on l_partkey
. This produces l_orderkey, l_lineitem
, which are used to get the l_suppkey
. We then check if the l_suppkey
matches the ps_suppkey
from partsupp
, which drops 3/4 of the rows. The next join is on orders
, which shares ordering with lineitem
; both are ordered on orderkey
.
There is a narrow win for columns with dynamic vector size. When access becomes scattered, rows win by 2.5x, because there is only one page to access instead of 1 + 3 for columns. This is compensated for if the next item is found on the same page, which happens if the access pattern is denser.
lineitem
indexed on L_partkey, l_suppkey
Vector | Dynamic | 10k | 1 |
---|---|---|---|
Column-wise | 16.9 s
|
47.2 s
|
151 s
|
Row-wise | 22.4 s
|
20.7 s
|
89 s
|
This is similar to the previous, except that now only lineitems
that match ps_partkey, ps_suppkey
are accessed, as the secondary index has two columns. Access is more local. Columns thus win more with dynamic vector size.
l_partkey
Vector | Dynamic | 10k | 1 |
---|---|---|---|
Column-wise | 35.7 s
|
170 s
|
601 s
|
Row-wise | 44.5 s
|
56.2 s
|
130 s
|
Now, each of the l_extendedprice, l_discount, l_quantity
and l_suppkey
is a separate index lookup. The times are slightly higher but the dynamic is the same.
The non-vectored columns case is hit the hardest.
l_partkey, l_suppkey
Vector | Dynamic | 10k | 1 |
---|---|---|---|
Column-wise | 19.6 s
|
111 s
|
257 s
|
Row-wise | 32.0 s
|
37 s
|
74.9 s
|
Again, we see the same dynamic as with a multicolumn table. Columns win slightly more at long vector sizes because of overall better index performance in the presence of locality.
The following tables list the space consumption in megabytes of allocated pages. Unallocated space in database files is not counted.
The row-wise table also contains entries for column-wise structures (DB.*
) since these have a row-wise sparse index. The size of this is however negligible, under 1% of the column-wise structures.
Row-Wise | Column-Wise | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
In both cases, the large tables are on top, but the column-wise case takes only half the space due to compression.
We note that the single column projections are smaller column-wise. The l_extendedprice
is not very compressible hence column-wise takes much more space than l_quantity
; the row-wise difference is less. Since the leading key parts l_orderkey, l_linenumber
are ordered and very compressible, the column-wise structures are in all cases noticeably more compact.
The same applies to the multipart index l_pksk
and r_l_pksk
(l_partkey, l_suppkey, l_orderkey, l_linenumber
) in column- and row-wise representations.
Note that STRING
columns (e.g., l_comment
) are not compressed. If they were, the overall space ratio would be even more to the advantage of the column store.
Column stores and vectorization inextricably belong together. Column-wise compression yields great gains also for indices, since sorted data is easy to compress. Also for non-sorted data, adaptive use of dictionaries, run lengths, etc., produce great space savings. Columns also win with indexed access if there is locality.
Row stores have less dependence on locality, but they also will win by a factor of 3 from dropping interpretation overhead and exploiting join locality.
For point lookups, columns lose by 2+x but considering their better space efficiency, they will still win if space savings prevent going to secondary storage. For bulk random access, like in graph analytics, columns will win because of being able to operate on a large vector of keys to fetch.
For many workloads, from TPC-H to LDBC social network, multi-part keys are a necessary component of physical design for performance if indexed access predominates. Triple stores and most graph databases do not have such and are therefore at a disadvantage. Self-joining, like in RDF or other vertically decomposed structures, can cost up to a factor of 10-20 over a column-wise multicolumn table. This depends however on the density of access.
For analytical workloads, where the dominant join pattern is the scan with selective hash join, column stores are unbeatable, as per common wisdom. There are good physical reasons for this and the row store even with well implemented vectorization loses by a factor of 5.
For decomposed structures, like RDF quads or single column projections of tables, column stores are relatively more advantageous because the key columns are extensively repeated, and these compress better with columns than with rows. In all the RDF workloads we have tried, columns never lose, but there is often a draw between rows and columns for lookup workloads. The longer the query, the more columns win.
]]>Orri Erling (OpenLink Software); Alex Averbuch (Neo Technology); Josep Larriba-Pey (Sparsity Technologies); Hassan Chafi (Oracle Labs); Andrey Gubichev (TU Munich); Arnau Prat-Pérez (Universitat Politècnica de Catalunya); Minh-Duc Pham (VU University Amsterdam); Peter Boncz (CWI): The LDBC Social Network Benchmark: Interactive Workload. Proceedings of SIGMOD 2015, Melbourne.
This paper is an overview of the challenges posed in the LDBC social network benchmark, from data generation to the interactive workload.
Mihai Capotă (Delft University of Technology), Tim Hegeman (Delft University of Technology), Alexandru Iosup (Delft University of Technology), Arnau Prat-Pérez (Universitat Politècnica de Catalunya), Orri Erling (OpenLink Software), Peter Boncz (CWI): Graphalytics: A Big Data Benchmark for Graph-Processing Platforms. Sigmod GRADES 2015.
This paper discusses the future evolution of the LDBC Social Network Benchmark and gives a preview of Virtuoso graph traversal performance.
We begin at the beginning, with Hive, the grand-daddy of SQL on Hadoop.
The test platform is two Amazon R3.8 AMI instances. We compared Hive with the Virtuoso 100G TPC-H experiment on the same platform, published earlier on this blog. The runs follow a bulk load in both cases, with all data served from memory. The platform has 2x244GB RAM with only 40GB or so of working set.
The Virtuoso version and settings are as in the Virtuoso Cluster test AMI.
The Hive version is 0.14 from the Hortonworks HDP 2.2 distribution>. The Hive schema and query formulations are the ones from hive-testbench
on GitHub. The Hive configuration parameters are as set by Ambari 2.0.1. These are different from the ones in hive-testbench
, but the Ambari choices offer higher performance on the platform. We did run statistics with Hive and did not specify any settings not in the hive-testbench
. Thus we suppose the query plans were as good as Hive will make them. Platform utilization was even across both machines, and varied between 30% and 100% of the 2 x 32 hardware threads.
Load time with Hive was 742 seconds against 232 seconds with Virtuoso. In both cases, this was a copy from 32 CSV files into native database format; for Hive, this is ORC (Optimized Row Columnar). In Virtuoso, there is one index, (o_custkey
); in Hive, there are no indices.
Query | Virtuoso | Hive | Notes |
---|---|---|---|
— | 332 s |
742 s |
Data Load |
Q1 | 1.098 s |
296.636 s |
|
Q2 | 0.187 s |
>3600 s |
Hive Timeout |
Q3 | 0.761 s |
98.652 s |
|
Q4 | 0.205 s |
147.867 s |
|
Q5 | 0.808 s |
114.782 s |
|
Q6 | 2.403 s |
71.789 s |
|
Q7 | 0.59 s |
394.201 s |
|
Q8 | 0.775 s |
>3600 s |
Hive Timeout |
Q9 | 1.836 s |
>3600 s |
Hive Timeout |
Q10 | 3.165 s |
179.646 s |
|
Q11 | 1.37 s |
43.094 s |
|
Q12 | 0.356 s |
101.193 s |
|
Q13 | 2.233 s |
208.476 s |
|
Q14 | 0.488 s |
89.047 s |
|
Q15 | 0.72 s |
136.431 s |
|
Q16 | 0.814 s |
105.652 s |
|
Q17 | 0.681 s |
255.848 s |
|
Q18 | 1.324 s |
337.921 s |
|
Q19 | 0.417 s |
>3600 s |
Hive Timeout |
Q20 | 0.792 s |
193.965 s |
|
Q21 | 0.720 s |
670.718 s |
|
Q22 | 0.155 s |
68.462 s |
Hive does relatively best on bulk load. This is understandable since this is a sequential read of many files in parallel with just compression to do.
Hive's query times are obviously affected by not having a persistent memory image of the data, as this is always streamed from the storage files into other files as MapReduce intermediate results. This seems to be an operator-at-a-time business as opposed to Virtuoso's vectorized streaming.
The queries that would do partitioned hash joins (e.g., Q9) did not finish under an hour in Hive, so we do not have a good metric of a cross-partition hash join.
One could argue that one should benchmark Hive only in disk-bound circumstances. We may yet get to this.
Our next stop will probably be Impala, which ought to do much better than Hive, as it dose not have the MapReduce overheads.
If you are a Hive expert and believe that Hive should have done much better, please let us know how to improve the Hive scores, and we will retry.
]]>I had lunch with Stefan Manegold of CWI last week, where we talked about where European research should go. Stefan is involved in RETHINK big, a European research project for compiling policy advice regarding big data for EC funding agencies. As part of this, he is interviewing various stakeholders such as end user organizations and developers of technology.
RETHINK big wants to come up with a research agenda primarily for hardware, anything from faster networks to greener data centers. CWI represents software expertise in the consortium.
So, we went through a regular questionnaire about how we see the landscape. I will summarize this below, as this is anyway informative.
My own core competence is in core database functionality, specifically in high performance query processing, scale-out, and managing schema-less data. Most of the Virtuoso installed base is in the RDF space, but most potential applications are in fact outside of this niche.
The life sciences vertical is the one in which I have the most application insight, from going to Open PHACTS meetings and holding extensive conversations with domain specialists. We have users in many other verticals, from manufacturing to financial services, but there I do not have as much exposure to the actual applications.
Having said this, the challenges throughout tend to be in diversity of data. Every researcher has their MySQL database or spreadsheet, and there may not even be a top level catalogue of everything. Data formats are diverse. Some people use linked data (most commonly RDF) as a top level metadata format. The application data, such as gene sequences or microarray assays, reside in their native file formats and there is little point in RDF-izing these.
There are also public data resources that are published in RDF serializations as vendor-neutral, self-describing format. Having everything as triples, without a priori schema, makes things easier to integrate and in some cases easier to describe and query.
So, the challenge is in the labor intensive nature of data integration. Data comes with different levels of quantity and quality, from hand-curated to NLP extractions. Querying in the single- or double-digit terabyte range with RDF is quite possible, as we have shown many times on this blog, but most use cases do not even go that far. Anyway, what we see on the field is primarily a data diversity game. The scenario is data integration; the technology we provide is database. The data transformation proper, data cleansing, units of measure, entity de-duplication, and such core data-integration functions are performed using diverse, user-specific means.
Jerven Bolleman of the Swiss Institute of Bioinformatics is a user of ours with whom we have long standing discussions on the virtues of federated data and querying. I advised Stefan to go talk to him; he has fresh views about the volume challenges with unexpected usage patterns. Designing for performance is tough if the usage pattern is out of the blue, like correlating air humidity on the day of measurement with the presence of some genomic patterns. Building a warehouse just for that might not be the preferred choice, so the problem field is not exhausted. Generally, I’d go for warehousing though.
OK. Even a fast network is a network. A set of processes on a single shared-memory box is also a kind of network. InfiniBand is maybe half the throughput and 3x the latency of single threaded interprocess communication within one box. The operative word is latency. Making large systems always involves a network or something very much like one in large scale-up scenarios.
On the software side, next to nobody understands latency and contention; yet these are the one core factor in any pursuit of scalability. Because of this situation, paradigms like MapReduce and bulk synchronous parallel (BSP) processing have become popular because these take the communication out of the program flow, so the programmer cannot muck this up, as otherwise would happen with the inevitability of destiny. Of course, our beloved SQL or declarative query in general does give scalability in many tasks without programmer participation. Datalog has also been used as a means of shipping computation around, as in the the work of Hellerstein.
There are no easy solutions. We have built scale-out conscious, vectorized extensions to SQL procedures where one can express complex parallel, distributed flows, but people do not use or understand these. These are very useful, even indispensable, but only on the inside, not as a programmer-facing construct. MapReduce and BSP are the limit of what a development culture will absorb. MapReduce and BSP do not hide the fact of distributed processing. What about things that do? Parallel, partitioned extensions to Fortran arrays? Functional languages? I think that all the obvious aids to parallel/distributed programming have been conceived of. No silver bullet; just hard work. And above all the discernment of what paradigm fits what problem. Since these are always changing, there is no finite set of rules, and no substitute for understanding and insight, and the latter are vanishingly scarce. "Paradigmatism," i.e., the belief that one particular programming model is a panacea outside of its original niche, is a common source of complexity and inefficiency. This is a common form of enthusiastic naïveté.
If you look at power efficiency, the clusters that are the easiest to program consist of relatively few high power machines and a fast network. A typical node size is 16+ cores and 256G or more RAM. Amazon has these in entirely workable configurations, as documented earlier on this blog. The leading edge in power efficiency is in larger number of smaller units, which makes life again harder. This exacerbates latency and forces one to partition the data more often, whereas one can play with replication of key parts of data more freely if the node size is larger.
One very specific item where research might help without having to rebuild the hardware stack would be better, lower-latency exposure of networks to software. Lightweight threads and user-space access, bypassing slow protocol stacks, etc. MPI has some of this, but maybe more could be done.
So, I will take a cluster of such 16-core, 256GB machines on a faster network, over a cluster of 1024 x 4G mobile phones connected via USB. Very selfish and unecological, but one has to stay alive and life is tough enough as is.
The transition from capex to opex may be approaching maturity, as there have been workable cloud configurations for the past couple of years. The EC2 from way back, with at best a 4 core 16G VM and a horrible network for $2/hr, is long gone. It remains the case that 4 months of 24x7 rent in the cloud equals the purchase price of physical hardware. So, for this to be economical long-term at scale, the average utilization should be about 10% of the peak, and peaks should not be on for more than 10% of the time.
So, database software should be rented by the hour. A 100-150% markup for the $2.80 a large EC2 instance costs would be reasonable. Consider that 70% of the cost in TPC benchmarks is database software.
There will be different pricing models combining different up-front and per-usage costs, just as there are for clouds now. If the platform business goes that way and the market accepts this, then systems software will follow. Price/performance quotes should probably be expressed as speed/price/hour instead of speed/price.
The above is rather uncontroversial but there is no harm restating these facts. Reinforce often.
This is a harder question. There is some European business in wide area and mobile infrastructures. Competing against Huawei will keep them busy. Intel and Mellanox will continue making faster networks regardless of European policies. Intel will continue building denser compute nodes, e.g., integrated Knight’s Corner with dual IB network and 16G fast RAM on chip. Clouds will continue making these available on demand once the technology is in mass production.
What’s the next big innovation? Neuromorphic computing? Quantum computing? Maybe. For now, I’d just do more engineering along the core competence discussed above, with emphasis on good marketing and scalable execution. By this I mean trained people who know something about deployment. There is a huge training gap. In the would-be "Age of Data," knowledge of how things actually work and scale is near-absent. I have offered to do some courses on this to partners and public alike, but I need somebody to drive this show; I have other things to do.
I have been to many, many project review meetings, mostly as a project partner but also as reviewer. For the past year, the EC has used an innovation questionnaire at the end of the meetings. It is quite vague, and I don’t think it delivers much actionable intelligence.
What would deliver this would be a venture capital type activity, with well-developed networks and active participation in developing a business. The EC is not now set up to perform this role, though. But the EC is a fairly large and wealthy entity, so it could invest some money via this type of channel. Also there should be higher individual incentives and rewards for speed and excellence. Getting the next Horizon 2020 research grant may be good, but better exists. The grants are competitive enough and the calls are not bad; they follow the times.
In the projects I have seen, productization does get some attention, e.g., the LOD2 stack, but it is not something that is really ongoing or with dedicated commercial backing. It may also be that there is no market to justify such dedicated backing. Much of the RDF work has been "me, too" — let’s do what the real database and data integration people do, but let’s just do this with triples. Innovation? Well, I took the best of the real DB world and adapted this to RDF, which did produce a competent piece of work with broad applicability, extending outside RDF. Is there better than this? Well, some of the data integration work (e.g., LIMES) is not bad, and it might be picked up by some of the players that do this sort of thing in the broader world, e.g., Informatica, the DI suites of big DB vendors, Tamr, etc. I would not know if this in fact adds value to the non-RDF equivalents; I do not know the field well enough, but there could be a possibility.
The recent emphasis for benchmarking, spearheaded by Stefano Bertolo is good, as exemplified by the LDBC FP7. There should probably be one or two projects of this sort going at all times. These make challenges known and are an effective means of guiding research, with a large multiplier: Once a benchmark gets adopted, infinitely more work goes into solving the problem than in stating it in the first place.
The aims and calls are good. The execution by projects is variable. For 1% of excellence, there apparently must be 99% of so-and-so, but this is just a fact of life and not specific to this context. The projects are rather diffuse. There is not a single outcome that gets all the effort. In this, the level of engagement of participants is less and focus is much more scattered than in startups. A really hungry, go-getter mood is mostly absent. I am a believer in core competence. Well, most people will agree that core competence is nice. But the projects I have seen do not drive for it hard enough.
It is hard to say exactly what kinds of incentives could be offered to encourage truly exceptional work. The American startup scene does offer high rewards and something of this could be transplanted into the EC project world. I would not know exactly what form this could take, though.
]]>xsd:boolean
and TIMEZONE
-less DATETIME
& xsd:dateTime
; and significantly improved compatibility with the Jena and Sesame Frameworks.
New product features as of June 24, 2015, v7.2.1, include:
Virtuoso Engine
TIMEZONE
-less xsd:dateTime
and DATETIME
xsd:boolean
TOP/SKIP
SPARQL
GROUPING SETS
EBV
(Efficient Boolean Value)
define input:with-fallback-graph_uri
define input:target-fallback-graph-uri
Jena & Sesame Compatibility
rdf_insert_triple_c()
to insert BNode data
xsd:boolean
as true/false
rather than 1/0
maxQueryTimeout
in Sesame2 provider
JDBC Driver
setLogFileName
and getLogFileName
logFileName
" to VirtuosoDataSources
for logging support
Faceted Browser
Conductor and DAV
.TTL
redirection
.TTL
files
Virtuoso Commercial Edition
Virtuoso Open Source Edition
Note: This AMI is running a pre-release build of Virtuoso 7.5, Commercial Edition. Features are subject to change, and this build is not licensed for any use other than the AMI-based benchmarking described herein.
There are two preconfigured cluster setups; one is for two (2) machines/instances and one is for four (4). Generation and loading of TPC-H data, as well as the benchmark run itself, is preconfigured, so you can do it by entering just a few commands. The whole sequence of doing a terabyte (1000G) scale TPC-H takes under two hours, with 30 minutes to generate the data, 35 minutes to load, and 35 minutes to do three benchmark runs. The 100G scale is several times faster still.
To experiment with this AMI, you will need a set of license files, one per machine/instance, which our Sales Team can provide.
Detailed instructions are on the AMI, in /home/ec2-user/cluster_instructions.txt
, but the basic steps to get up and running are as follows:
Instantiate machine image ami-811becea) (AMI ID is subject to change; you should be able to find the latest by searching for "OpenLink Virtuoso Benchmarks" in "Community AMIs"; this one is short-named virtuoso-bench-cl
) with two or four (2 or 4) R3.8xlarge instances within one virtual private cluster and placement group. Make sure the VPC security is set to allow all connections.
Log in to the first, and fill in the configuration file with the internal IP addresses of all machines instantiated in step 1.
Distribute the license files to the instances, and start the OpenLink License Manager on each machine.
Run 3 shell commands to set up the file systems and the Virtuoso configuration files.
If you do not plan to run one of these benchmarks, you can simply start and work with the Virtuoso cluster now. It is ready for use with an empty database.
Before running one of these benchmark, generate the appropriate dataset with the dbgen.sh
command.
Bulk load the data with load.sh
.
Run the benchmark with run.sh
.
Right now the cluster benchmarks are limited to TPC-H but cluster versions of the LDBC Social Network and Semantic Publishing benchmarks will follow soon.
]]>In the following, the Amazon instance type is R3.8xlarge, each with dual Xeon E5-2670 v2, 244G RAM, and 2 x 300G SSD. The image is made from the Amazon Linux with built-in network optimization. We first tried a RedHat image without network optimization and had considerable trouble with the interconnect. Using network-optimized Amazon Linux images inside a virtual private cloud has resolved all these problems.
The network optimized 10GE interconnect at Amazon offers throughput close to the QDR InfiniBand running TCP-IP; thus the Amazon platform is suitable for running cluster databases. The execution that we have seen is not seriously network bound.
Run | Power | Throughput | Composite |
---|---|---|---|
1
|
523,554.3
|
590,692.6
|
556,111.2
|
2
|
565,353.3
|
642,503.0
|
602,694.9
|
Run | Power | Throughput | Composite |
---|---|---|---|
1
|
592,013.9
|
754,107.6
|
668,163.3
|
2
|
896,564.1
|
828,265.4
|
861,738.4
|
3
|
883,736.9
|
829,609.0
|
856,245.3
|
For the larger scale we did 3 sets of power + throughput tests to measure consistency of performance. By the TPC-H rules, the worst (first) score should be reported. Even after bulk load, this is markedly less than the next power score due to working set effects. This is seen to a lesser degree with the first throughput score also.
The numerical quantities summaries are available in a report.zip file, or individually --
Subsequent posts will explain how to deploy Virtuoso Elastic Clusters on AWS.
TPC-H , the classic of SQL data warehousing
LDBC SNB, the new Social Network Benchmark from the Linked Data Benchmark Council
LDBC SPB, the RDF/SPARQL Semantic Publishing Benchmark from LDBC
This package is ideal for technology evaluators and developers interested in getting the most performance out of Virtuoso. This is also an all-in-one solution to any questions about reproducing claimed benchmark results. All necessary tools for building and running are included; thus any developer can use this model installation as a starting point. The benchmark drivers are preconfigured with appropriate settings, and benchmark qualification tests can be run with a single command.
The Benchmarks AMI includes a precompiled, preconfigured checkout of the v7fasttrack github repository, checkouts of the github repositories of the benchmarks, and a number of running directories with all configuration files preset and optimized. The image is intended to be instantiated on a R3.8xlarge Amazon instance with 244G RAM, dual Xeon E5-2670 v2, and 600G SSD.
Benchmark datasets and preloaded database files can be downloaded from S3 when large, and generated as needed on the instance when small. As an alternative, the instance is also set up to do all phases of data generation and database bulk load.
The following benchmark setups are included:
The AMI will be expanded as new benchmarks are introduced, for example, the LDBC Social Network Business Intelligence or Graph Analytics.
To get started:
Instantiate machine image ami-eb789280 (AMI ID is subject to change; you should be able to find the latest by searching for "OpenLink Virtuoso Benchmarks" in "Community AMIs"; this one is short-named virtuoso-bench-6
) with a R3.8xlarge instance.
Connect via ssh
.
See the README (also found in the ec2-user
's home directory) for full instructions on getting up and running.
First, let's recap what the benchmark is about:
The updates exist so as to invalidate strategies that rely too heavily on precomputation. The short lookups exist for the sake of realism; after all, an online social application does lookups for the most part. The medium complex queries are to challenge the DBMS.
The DBMS challenges have to do firstly with query optimization, and secondly with execution with a lot of non-local random access patterns. Query optimization is not a requirement, per se, since imperative implementations are allowed, but we will see that these are no more free of the laws of nature than the declarative ones.
The workload is arbitrarily parallel, so intra-query parallelization is not particularly useful, if also not harmful. There are latency constraints on operations which strongly encourage implementations to stay within a predictable time envelope regardless of specific query parameters. The parameters are a combination of person and date range, and sometimes tags or countries. The hardest queries have the potential to access all content created by people within 2 steps of a central person, so possibly thousands of people, times 2000 posts per person, times up to 4 tags per post. We are talking in the millions of key lookups, aiming for sub-second single-threaded execution.
The test system is the same as used in the TPC-H series: dual Xeon E5-2630, 2x6 cores x 2 threads, 2.3GHz, 192 GB RAM. The software is the feature/analytics branch of v7fasttrack, available from www.github.com.
The dataset is the SNB 300G set, with:
1,136,127
persons
125,249,604
knows
edges847,886,644
posts
, including replies1,145,893,841
tags
of posts or replies1,140,226,235
likes
of posts or replies
As an initial step, we run the benchmark as fast as it will go. We use 32 threads on the driver side for 24 hardware threads.
Below are the numerical quantities for a 400K operation run after 150K operations worth of warmup.
Duration: 10:41.251
Throughput: 623.71 (op/s)
The statistics that matter are detailed below, with operations ranked in order of descending client-side wait-time. All times are in milliseconds.
% of total total_wait name count mean min max 20 %
4,231,130
LdbcQuery5
656
6,449.89
245
10,311
11 %
2,272,954
LdbcQuery8
18,354
123.84
14
2,240
10 %
2,200,718
LdbcQuery3
388
5,671.95
468
17,368
7.3 %
1,561,382
LdbcQuery14
1,124
1,389.13
4
5,724
6.7 %
1,441,575
LdbcQuery12
1,252
1,151.42
15
3,273
6.5 %
1,396,932
LdbcQuery10
1,252
1,115.76
13
4,743
5 %
1,064,457
LdbcShortQuery3PersonFriends
46,285
22.9979
0
2,287
4.9 %
1,047,536
LdbcShortQuery2PersonPosts
46,285
22.6323
0
2,156
4.1 %
885,102
LdbcQuery6
1,721
514.295
8
5,227
3.3 %
707,901
LdbcQuery1
2,117
334.389
28
3,467
2.4 %
521,738
LdbcQuery4
1,530
341.005
49
2,774
2.1 %
440,197
LdbcShortQuery4MessageContent
46,302
9.50708
0
2,015
1.9 %
407,450
LdbcUpdate5AddForumMembership
14,338
28.4175
0
2,008
1.9 %
405,243
LdbcShortQuery7MessageReplies
46,302
8.75217
0
2,112
1.9 %
404,002
LdbcShortQuery6MessageForum
46,302
8.72537
0
1,968
1.8 %
387,044
LdbcUpdate3AddCommentLike
12,659
30.5746
0
2,060
1.7 %
361,290
LdbcShortQuery1PersonProfile
46,285
7.80577
0
2,015
1.6 %
334,409
LdbcShortQuery5MessageCreator
46,302
7.22234
0
2,055
1 %
220,740
LdbcQuery2
1,488
148.347
2
2,504
0.96 %
205,910
LdbcQuery7
1,721
119.646
11
2,295
0.93 %
198,971
LdbcUpdate2AddPostLike
5,974
33.3062
0
1,987
0.88 %
189,871
LdbcQuery11
2,294
82.7685
4
2,219
0.85 %
182,964
LdbcQuery13
2,898
63.1346
1
2,201
0.74 %
158,188
LdbcQuery9
78
2,028.05
1,108
4,183
0.67 %
143,457
LdbcUpdate7AddComment
3,986
35.9902
1
1,912
0.26 %
54,947
LdbcUpdate8AddFriendship
571
96.2294
1
988
0.2 %
43,451
LdbcUpdate6AddPost
1,386
31.3499
1
2,060
0.0086%
1,848
LdbcUpdate4AddForum
103
17.9417
1
65
0.0002%
44
LdbcUpdate1AddPerson
2
22
10
34
At this point we have in-depth knowledge of the choke points the benchmark stresses, and we can give a first assessment of whether the design meets its objectives for setting an agenda for the coming years of graph database development.
The implementation is well optimized in general but still has maybe 30% room for improvement. We note that this is based on a compressed column store. One could think that alternative data representations, like in-memory graphs of structs and pointers between them, are better for the task. This is not necessarily so; at the least, a compressed column store is much more space efficient. Space efficiency is the root of cost efficiency, since as soon as the working set is not in memory, a random access workload is badly hit.
The set of choke points (technical challenges) actually revealed by the benchmark is so far as follows:
Cardinality estimation under heavy data skew — Many queries take a tag
or a country
as a parameter. The cardinalities associated with tags
vary from 29M posts
for the most common to 1 for the least common. Q6 has a common tag
(in top few hundred) half the time and a random, most often very infrequent, one the rest of the time. A declarative implementation must recognize the cardinality implications from the literal and plan accordingly. An imperative one would have to count. Missing this makes Q6 take about 40% of the time instead of 4.1% when adapting.
Covering indices — Being able to make multi-column indices that duplicate some columns from the table often saves an entire table lookup. For example, an index on post
by author
can also contain the post
's creation date
.
Multi-hop graph traversal — Most queries access a two-hop environment starting at a person
. Two queries look for shortest paths of unbounded length. For the two-hop case, it makes almost no difference whether this is done as a union or a special graph traversal operator. For shortest paths, this simply must be built into the engine; doing this client-side incurs prohibitive overheads. A bidirectional shortest path operation is a requirement for the benchmark.
Top K — Most queries returning posts
order results by descending date
. Once there are at least k results, anything older than the kth can be dropped, adding a date
selection as early as possible in the query. This interacts with vectored execution, so that starting with a short vector size more rapidly produces an initial top k.
Late projection — Many queries access several columns and touch millions of rows but only return a few. The columns that are not used in sorting or selection can be retrieved only for the rows that are actually returned. This is especially useful with a column store, as this removes many large columns (e.g., text of a post
) from the working set.
Materialization — Q14 accesses an expensive-to-compute edge weight, the number of post-reply
pairs between two people
. Keeping this precomputed drops Q14 from the top place. Other materialization would be possible, for example Q2 (top 20 posts
by friends), but since Q2 is just 1% of the load, there is no need. One could of course argue that this should be 20x more frequent, in which case there could be a point to this.
Concurrency control — Read-write contention is rare, as updates are randomly spread over the database. However, some pages get read very frequently, e.g., some middle level index pages in the post
table. Keeping a count of reading threads requires a mutex, and there is significant contention on this. Since the hot set can be one page, adding more mutexes does not always help. However, hash partitioning the index into many independent trees (as in the case of a cluster) helps for this. There is also contention on a mutex for assigning threads to client requests, as there are large numbers of short operations.
In subsequent posts, we will look at specific queries, what they in fact do, and what their theoretical performance limits would be. In this way we will have a precise understanding of which way SNB can steer the graph DB community.
For the future, an updated version of this list may be found on the main Virtuoso site.
GeoKnow D 2.6.1: Graph Analytics in the DBMS (2015-01-05)
This introduces the idea of unbundling basic cluster DBMS functionality like cross partition joins and partitioned group by to form a graph processing framework collocated with the data.
GeoKnow D2.4.1: Geospatial Clustering and Characteristic Sets (2015-01-06)
This presents experimental results of structure-aware RDF applied to geospatial data. The regularly structured part of the data goes in tables; the rest is triples/quads. Furthermore, for the first time in the RDF space, physical storage location is correlated to properties of entities, in this case geo location, so that geospatially adjacent items are also likely adjacent in the physical data representation.
LOD2 D2.1.5: 500 billion triple BSBM (2014-08-18)
This presents experimental results on lookup and BI workloads on Virtuoso cluster with 12 nodes, for a total of 3T RAM and 192 cores. This also discusses bulk load, at up to 6M triples/s and specifics of query optimization in scale-out settings.
LOD2 D2.6: Parallel Programming in SQL (2012-08-12)
This discusses ways of making SQL procedures partitioning-aware, so that one can, map-reduce style, send parallel chunks of computation to each partition of the data.
Minh-Duc, Pham, Linnea, P., Erling, O., and Boncz, P.A. "Deriving an Emergent Relational Schema from RDF Data," WWW, 2015.
This paper shows how RDF is in fact structured and how this structure can be reconstructed. This reconstruction then serves to create a physical schema, reintroducing all the benefits of physical design to the schema-last world. Experiments with Virtuoso show marked gains in query speed and data compactness.
Peter A. Boncz, Orri Erling, Minh-Duc Pham: Experiences with Virtuoso Cluster RDF Column Store. Linked Data Management 2014: 239-259
This book chapter gives an in-depth look at the performance dynamics of Virtuoso scale out.
P. A. Boncz, T. Neumann, and O. Erling. TPC-H Analyzed: Hidden Messages and Lessons Learned from an Influential Benchmark. Proceedings of the TPC Technology Conference on Performance Evaluation & Benchmarking TPCTC, 2013.
This is a summary of all factors that make up analytics performance by those who know. The Virtuoso TPC-H blog series is a further development and commentary on these same truths.
Orri Erling: Virtuoso, a Hybrid RDBMS/Graph Column Store. IEEE Data Eng. Bull. (DEBU) 35(1):3-8 (2012)
This paper introduces the Virtuoso column store architecture and design choices. One design is made to serve both random updates and lookups as well as the big scans where column stores traditionally excel. Examples are given from both TPC-H and the schema-less RDF world.
Minh-Duc Pham, Peter A. Boncz, Orri Erling: S3G2: A Scalable Structure-Correlated Social Graph Generator. TPCTC 2012:156-172
This paper presents the basis of the social network benchmarking technology later used in the LDBC benchmarks.
Christian Bizer, Peter A. Boncz, Michael L. Brodie, Orri Erling: "The Meaningful Use of Big Data: Four Perspectives – Four Challenges." SIGMOD Record (SIGMOD) 40(4):56-60 (2011)
This is an anthology of views by industry thought leaders on what semantics could or ought to contribute to the practice of data management.
Orri Erling, Ivan Mikhailov: Faceted Views over Large-Scale Linked Data. LDOW 2009
This paper introduces anytime query answering as an enabling technology for open-ended querying of large data on public service end points. While not every query can be run to completion, partial results can most often be returned within a constrained time window.
Orri Erling, Ivan Mikhailov: Virtuoso: RDF Support in a Native RDBMS. Semantic Web Information Management 2009:501-519
This is a general presentation of how a SQL engine needs to be adapted to serve a run-time typed and schema-less workload.
Orri Erling, Ivan Mikhailov: Integrating Open Sources and Relational Data with SPARQL. ESWC 2008:838-842
This paper introduces the still challenging RDF-H benchmark, an RDF translation of the classic TPC-H. Running this over SPARQL to SQL mapping is considered.
Orri Erling, Ivan Mikhailov: RDF Support in the Virtuoso DBMS. CSSW 2007:59-68
This is an initial discussion of RDF support in Virtuoso. Most specifics are by now different but this can give a historical perspective.
In the case of Virtuoso, we have played with SQL and SPARQL implementations. For a fixed schema and well known workload, SQL will always win. The reason is that SQL allows materialization of multi-part indices and data orderings that make sense for the application. In other words, there is transparency into physical design. An RDF/SPARQL-based application may also have physical design by means of structure-aware storage, but this is more complex and here we are just concerned with speed and having things work precisely as we intend.
SNB has a regular schema described by a UML diagram. This has a number of relationships, of which some have attributes. There are no heterogenous sets, i.e., no need for run-time typed attributes or graph edges with the same label but heterogenous end-points. Translation into SQL or SPARQL is straightforward. Edges with attributes (e.g., the foaf:knows
relation between people) would end up represented as a subject with the end points and the effective date as properties. The relational implementation has a two-part primary key and the effective date as a dependent column. A native property graph database would use an edge with an extra property for this, as such are typically supported.
The only table-level choice has to do with whether posts
and comments
are kept in the same or different data structures. The Virtuoso schema uses a single table for both, with nullable columns for the properties that occur only in one. This makes the queries more concise. There are cases where only non-reply posts
of a given author
are accessed. This is supported by having two author
foreign key columns each with its own index. There is a single nullable foreign key from the reply to the post/comment being replied to.
The workload has some frequent access paths that need to be supported by index. Some queries reward placing extra columns in indices. For example, a common pattern is accessing the most recent posts of an author or a group of authors. There, having a composite key of ps_creatorid, ps_creationdate, ps_postid
pays off since the top-k
on creationdate
can be pushed down into the index without needing a reference to the table.
The implementation is free to choose data types for attributes, particularly datetimes
. The Virtuoso implementation adopts the practice of the Sparksee and Neo4j implementations and represents this is a count of milliseconds since epoch. This is less confusing, faster to compare, and more compact than a native datetime datatype that may or may not have timezones, etc. Using a built-in datetime seems to be nearly always a bad idea. A dimension table or a number for a time dimension avoids the ambiguities of a calendar or at least makes these explicit.
The benchmark allows procedurally maintained materializations of intermediate results for use by queries as long as these are maintained transaction-by-transaction. For example, each person could have the 20 newest posts by their immediate contacts precomputed. This would reduce Q2 "top of the wall" to a single lookup. This does not however appear to be worthwhile. The Virtuoso implementation does do one such materialization for Q14: A connection weight is calculated for every pair of persons that know each other. This is related to the count of replies by either to content generated by the other. If there does not exist a single reply in either direction, the weight is taken to be 0. This weight is precomputed after bulk load and subsequently maintained each time a reply is added. The table for this is the only row-wise structure in the schema and represents a half-matrix of connected people, i.e., person1, person2 -> weight
. Person1
is by convention the one with the smaller p_personid
. Note that comparing IDs in this way is useful but not normally supported by SPARQL/RDF systems. SPARQL would end up comparing strings of URIs with disastrous performance implications unless an implementation-specific trick were used.
In the next installment, we will analyze an actual run.
With two implementations of SNB Interactive at four different scales, we can take a first look at what the benchmark is really about. The hallmark of a benchmark implementation is that its performance characteristics are understood; even if these do not represent the maximum of the attainable, there are no glaring mistakes; and the implementation represents a reasonable best effort by those who ought to know such, namely the system vendors.
The essence of a benchmark is a set of trick questions or "choke points," as LDBC calls them. A number of these were planned from the start. It is then the role of experience to tell whether addressing these is really the key to winning the race. Unforeseen ones will also surface.
So far, we see that SNB confronts the implementor with choices in the following areas:
Data model — Tabular relational (commonly known as SQL), graph relational (including RDF), property graph, etc.
Physical storage model — Row-wise vs. column-wise, for instance.
Ordering of materialized data — Sorted projections, composite keys, replicating columns in auxiliary data structures, etc.
Persistence of intermediate results — Materialized views, triggers, precomputed temporary tables, etc.
Query optimization — join order/type, interesting physical data orderings, late projection, top k, etc.
Parameters vs. literals — Sometimes different parameter values result in different optimal query plans.
Predictable, uniform latency — Measurement rules stipulate the the SUT (system under test) must not fall behind the simulated workload.
Durability — How to make data durable while maintaining steady throughput, e.g., logging, checkpointing, etc.
In the process of making a benchmark implementation, one naturally encounters questions about the validity, reasonability, and rationale of the benchmark definition itself. Additionally, even though the benchmark might not directly measure certain aspects of a system, making an implementation will take a system past its usual envelope and highlight some operational aspects.
Data generation — Generating a mid-size dataset takes time, e.g., 8 hours for 300G. In a cloud situation, keeping the dataset in S3 or similar is necessary; re-generating every time is not an option.
Query mix — Are the relative frequencies of the operations reasonable? What bias does this introduce?
Uniformity of parameters — Due to non-uniform data distributions in the dataset, there is easily a 100x difference between "fast" and "slow" cases of a single query template. How long does one need to run to balance these fluctuations?
Working set — Experience shows that there is a large difference between almost-warm and steady-state of working set. This can be a factor of 1.5 in throughput.
Reasonability of latency constraints — In the present case, a qualifying run must have no more than 5% of all query executions starting over 1 second late. Each execution is scheduled beforehand and done at the intended time. If the SUT does not keep up, it will have all available threads busy and must finish some work before accepting new work, so some queries will start late. Is this a good criterion for measuring consistency of response time? There are some obvious possibilities for abuse.
Ease of benchmark implementation/execution — Perfection is open-ended and optimization possibilities infinite, albeit with diminishing returns. Still, getting started should not be too hard. Since systems will be highly diverse, testing that these in fact do the same thing is important. The SNB validation suite is good for this and, given publicly available reference implementations, the effort of getting started is not unreasonable.
Ease of adjustment — Since a qualifying run must meet latency constraints while going as fast as possible, setting the performance target involves trial and error. Does the tooling make this easy?
Reasonability of durability rule — Right now, one is not required to do checkpoints but must report the time to roll forward from the last checkpoint or initial state. Inspiring vendors to build faster recovery is certainly good, but we are not through with all the implications. What about redundant clusters?
The following posts will look at the above in light of actual experience.
We think you'll find some interesting details in the statistics. There are also some important notes about Virtuoso configuration options and other sneaky technical issues that can surprise you (as they did us!) when exposing an ad-hoc query server to the world.
]]>Peter's description of his domain was roughly as follows, summarized from memory:
The new chair is for data analysis and engines for this purpose. The data analysis engine includes the analytical DBMS but is a broader category. For example, the diverse parts of the big data chain (including preprocessing, noise elimination, feature extraction, natural language extraction, graph analytics, and so forth) fall under this category, and most of these things are usually not done in a DBMS. For anything that is big, the main challenge remains one of performance and time to solution. These things are being done, and will increasingly be done, on a platform with heterogenous features, e.g., CPU/GPU clusters, possibly custom hardware like FPGAs, etc. This is driven by factors of cost and energy efficiency. Different processing stages will sometimes be distributed over a wide area, as for example in instrument networks and any network infrastructure, which is wide area by definition.
The design space of database and all that is around it is huge, and any exhaustive exploration is impossible. Development times are long, and a platform might take ten years to be mature. This is ill compatible with academic funding cycles. However, we should not leave all the research in this to industry, as industry maximizes profit, not innovation or absolute performance. Architecting data systems has aspects of an art. Consider the parallel with architecture of buildings: There are considerations of function, compatibility with environment, cost, restrictions arising from the materials at hand, and so forth. How a specific design will work cannot be known without experiment. The experiments themselves must be designed to make sense. This is not an exact science with clear-cut procedures and exact metrics of success.
This is the gist of Peter's description of our art. Peter's successes, best exemplified by MonetDB and Vectorwise, arise from focus over a special problem area and from developing and systematically applying specific insights to a specific problem. This process led to the emergence of the column store, which is now a mainstream thing. The DBMS that does not do columns is by now behind the times.
Needless to say, I am a great believer in core competence. Not every core competence is exactly the same. But a core competence needs to be broad enough so that its integral mastery and consistent application can produce a unit of value valuable in itself. What and how broad this is varies a great deal. Typically such a unit of value is something that is behind a "natural interface." This defies exhaustive definition but the examples below may give a hint. Looking at value chains and all diverse things in them that have a price tag may be another guideline.
There is a sort of Hegelian dialectic to technology trends: At the start, it was generally believed that a DBMS would be universal like the operating system itself, with a few products with very similar functionality covering the whole field. The antithesis came with Michael Stonebraker declaring that one size no longer fit all. Since then the transactional (OLTP) and analytical (OLAP) sides are clearly divided. The eventual synthesis may be in the air, with pioneering work like HyPer led by Thomas Neumann of TU München. Peter, following his Humbolt prize, has spent a couple of days a week in Thomas's group, and I have joined him there a few times. The key to eventually bridging the gap would be compilation and adaptivity. If the workload is compiled on demand, then the right data structures could always be at hand.
This might be the start of a shift similar to the column store turning the DBMS on its side, so to say.
In the mainstream of software engineering, objects, abstractions and interfaces are held to be a value almost in and of themselves. Our science, that of performance, stands in apparent opposition to at least any naive application of the paradigm of objects and interfaces. Interfaces have a cost, and boxes limit transparency into performance. So inlining and merging distinct (in principle) processing phases is necessary for performance. Vectoring is one take on this: An interface that is crossed just a few times is much less harmful than one crossed a billion times. Using compilation, or at least type-and-data-structure-specific variants of operators and switching their application based on run-time observed behaviors, is another aspect of this.
Information systems thus take on more attributes of nature, i.e., more interconnectedness and adaptive behaviors.
Something quite universal might emerge from the highly problem-specific technology of the column store. The big scan, selective hash join plus aggregation, has been explored in slightly different ways by all of HyPer, Vectorwise, and Virtuoso.
Interfaces are not good or bad, in and of themselves. Well-intentioned naïveté in their use is bad. As in nature, there are natural borders in the "technosphere"; declarative query languages, processor instruction sets, and network protocols are good examples. Behind a relatively narrow interface lies a world of complexity of which the unsuspecting have no idea. In biology, the cell membrane might be an analogy, but this is in all likelihood more permeable and diverse in function than the techno examples mentioned.
With the experience of Vectorwise and later Virtuoso, it turns out that vectorization without compilation is good enough for TPC-H. Indeed, I see a few percent of gain at best from further breaking of interfaces and "biology-style" merging of operators and adding inter-stage communication and self-balancing. But TPC-H is not the end of all things, even though it is a sort of rite of passage: Jazz players will do their take on Green Dolphin Street and Summertime.
Science is drawn towards a grand unification of all which is. Nature, on the other hand, discloses more and more diversity and special cases, the closer one looks. This may be true of physical things, but also of abstractions such as software systems or mathematics.
So, let us look at the generalized DBMS, or the data analysis engine, as Peter put it. The use of DBMS technology is hampered by its interface, i.e., declarative query language. The well known counter-reactions to this are the NoSQL, MapReduce, and graph DB memes, which expose lower level interfaces. But then the interface gets put in the whole wrong place, denying most of the things that make the analytics DBMS extremely good at what it does.
We need better and smarter building blocks and interfaces at zero cost. We continue to need blocks of some sort, since algorithms would stop being understandable without any data/procedural abstraction. At run time, the blocks must overlap and interpenetrate: Scan plus hash plus reduction in one loop, for example. Inter-thread, inter-process status sharing for things like top k for faster convergence, for another. Vectorized execution of the same algorithm on many data for things like graph traversals. There are very good single blocks, like GPU graph algorithms, but interface and composability are ever the problem.
So, we must unravel the package that encapsulates the wonders of the analytical DBMS. These consist of scan, hash/index lookup, partitioning, aggregation, expression evaluation, scheduling, message passing and related flow control for scale-out systems, just to mention a few. The complete list would be under 30 long, with blocks parameterized by data payload and specific computation.
By putting these together in a few new ways, we will cover much more of the big data pipeline. Just-in-time compilation may well be the way to deliver these components in an application/environment tailored composition. Yes, keep talking about block diagrams, but never once believe that this represents how things work or ought to work. The algorithms are expressed as distinct things, but at the level of the physical manifestation, things are parallel and interleaved.
The core skill for architecting the future of data analytics is correct discernment of abstraction and interface. What is generic enough to be broadly applicable yet concise enough to be usable? When should the computation move, and when should the data move? What are easy ways of talking about data location? How can protect the application developer be protected from various inevitable stupidities?
No mistake about it, there are at present very few people with the background for formulating the blueprint for the generalized data pipeline. These will be mostly drawn from architects of DBMS. The prospective user is any present-day user of analytics DBMS, Hadoop, or the like. By and large, SQL has worked well within its area of applicability. If there had never been an anti-SQL rebel faction, SQL would not have been successful. Now that a broader workload definition calls for redefinition of interfaces, so as to use the best where it fits, there is a need for re-evaluation of the imperative Vs. declarative question.
T. S. Eliot once wrote that humankind cannot bear very much reality. It seems that we in reality can deconstruct the DBMS and redeploy the state of the art to serve novel purposes across a broader set of problems. This is a cross-over that slightly readjusts the mental frame of the DBMS expert but leaves the core precepts intact. In other words, this is a straightforward extension of core competence with no slide into the dilettantism of doing a little bit of everything.
People like MapReduce and stand-alone graph programming frameworks, because these do one specific thing and are readily understood. By and large, these are orders of magnitude simpler than the DBMS. Even when the DBMS provides in-process Java or CLR, these are rarely used. The single-purpose framework is a much narrower core competence, and thus less exclusive, than the high art of the DBMS, plus it has a faster platform development cycle.
In the short term, we will look at opening the SQL internal toolbox for graph analytics applications. I was discussing this idea with Thomas Neumann at Peter Boncz's party. He asked who would be the user. I answered that doing good parallel algorithms, even with powerful shorthands, was an expert task; so the people doing new types of analytics would be mostly on the system vendor side. However, modifying such for input selection and statistics gathering would be no harder than doing the same with ready-made SQL reports.
There is significant possibility for generalization of the leading edge of database. How will this fare against single-model frameworks? We hope to shed some light on this in the final phase of LDBC and beyond.
]]>The lecture touched on the fact of the data economy and the possibilities of E-science. Peter proceeded to address issues of ethics of cyberspace and the fact of legal and regulatory practice trailing far behind the factual dynamics of cyberspace. In conclusion, Peter gave some pointers to his research agenda; for example, use of just-in-time compilation for fusing problem-specific logic with infrastructure software like databases for both performance and architecture adaptivity.
There was later a party in Amsterdam with many of the local database people as well as some from further away, e.g., Thomas Neumann of Munich, and Marcin Zukowsky, Vectorwise founder and initial CEO.
I should have had the presence of mind to prepare a speech for Peter. Stefan Manegold of CWI did give a short address at the party, while presenting the gifts from Peter's CWI colleagues. To this I will add my belated part here, as follows:
If I were to describe Prof. Boncz, our friend, co-worker, and mentor, in one word, this would be man of knowledge. If physicists define energy as that which can do work, then knowledge would be that which can do meaningful work. A schematic in itself does nothing. Knowledge is needed to bring this to life. Yet this is more than an outstanding specialist skill, as this implies discerning the right means in the right context and includes the will and ability to go through with this. As Peter now takes on the mantle of professor, the best students will, I am sure, not fail to recognize excellence and be accordingly inspired to strive for the sort of industry changing accomplishments we have come to associate with Peter's career so far. This is what our world needs. A big cheer for Prof. Boncz!
I did talk to many at the party, especially Pham Minh Duc, who is doing schema-aware RDF in MonetDB, and many others among the excellent team at CWI. Stefan Manegold told me about Rethink Big, an FP7 for big data policy recommendations. I was meant to be an advisor and still hope to go to one of their meetings for some networking about policy. On the other hand, the EU agenda and priorities, as discussed with, for example, Stefano Bertolo, are, as far as I am concerned, on the right track: The science of performance must meet with the real, or at least realistic, data. Peter did not fail to mention this same truth in his lecture: Spinoffs play a key part in research, and exposure to the world out there gives research both focus and credibility. As René Char put it in his poem L'Allumette (The Matchstick), "La tête seule à pouvoir de prendre feu au contact d'une réalité dure." ("The head alone has power to catch fire at the touch of hard reality.") Great deeds need great challenges, and there is nothing like reality to exceed man's imagination.
For my part, I was advertising the imminent advances in the Virtuoso RDF and graph functionality. Now that the SQL part, which is anyway the necessary foundation for all this, is really very competent, it is time to deploy these same things in slightly new ways. This will produce graph analytics and structure-aware RDF to match relational performance while keeping schema-last-ness. Anyway, the claim has been made; we will see how it is delivered during the final phase of LDBC and Geoknow.
]]>As an initial take on the issue we run 100 GB and 1000 GB on the test system. 100 GB is trivially in memory, 1000 GB is not, as the memory is 384 GB total, of which 360 GB may be used for the processes.
We run 2 workloads on the 100 GB database, having pre-loaded the data in memory:
run | power | throughput | composite |
---|---|---|---|
1 | 349,027.7 | 420,503.1 | 383,102.1 |
2 | 387,890.3 | 433,066.6 | 409,856.5 |
This is directly comparable to the 100 GB single-server results. Comparing the second runs, we see a 1.53x gain in power and a 1.8x gain in throughput from 2x the platform. This is fully on the level for a workload that is not trivially parallel, as we have seen in the previous articles. The difference between the first and second runs at 100 GB comes, for both single-server and cluster, from the latency of allocating transient query memory. For an official run, where the weakest link is the first power test, this would simply have to be pre-allocated.
We run 2 workloads on the 1000 GB database, starting from cold.
The result is:
run | power | throughput | composite |
---|---|---|---|
1 | 136,744.5 | 147,374.6 | 141,960.1 |
2 | 199,652.0 | 125,161.1 | 158,078.0 |
The 1000 GB result is not for competition with this platform; more memory would be needed. For actual applications, the numbers are still in the usable range, though.
The 1000 GB setup uses 4 SSDs for storage, one per server process. The server processes are each bound to their own physical CPU.
We look at the meters: 32M pages (8M per process) are in memory at each time. Over the 2 benchmark executions there are a total of 494M disk reads. The total CPU time is 165,674 seconds of CPU, of which about 10% are system, over 10,063 seconds of real-time. Cumulative disk-read wait-time is 130,177 s. This gives an average disk read throughput of 384 MB/s.
This is easily sustained by 4 SSDs; in practice, the maximum throughput we see for reading is 1 GB/s (256 MB/s per SSD). Newer SSDs would do maybe twice that. Using rotating media would not be an option.
Without the drop in CPU caused by waiting for SSD, we would have numbers very close to the 100 GB numbers.
The interconnect traffic for the two runs was 1,077 GB with no message compression. The write block time was 448 seconds of thread-time. So we see that blocking on write hurts platform utilization when running under optimal conditions, but compared to going to secondary storage, it is not a large factor.
The 1000 GB scale has a transient peak memory consumption of 42 GB. This consists of hash-join build sides and GROUP BYs. The greatest memory consumers are Q9 with 9 GB, Q13 with 11 GB, and Q16 with 7 GB. Having many of these at a time drives up the transient peak. The peak gets higher as the scale grows, also because a larger scale requires more concurrent query streams. At the 384 GB for 1000 GB ratio, we do not yet get into memory saving plans like hash joins in many passes or index use instead of hash. When the data size grows, replicated hash build sides will become less convenient, and communication will increase. Q9 and Q13 can be done by index with almost no transient memory, but these plans are easily 3x less efficient for CPU. These will probably help at 3000 GB and be necessary at least part of the time at 10,000 GB.
The I/O volume in MB per index over the 2 executions is:
index | MB |
---|---|
LINEITEM
|
1,987,483 |
ORDERS
|
1,440,526 |
PARTSUPP
|
199,335 |
PART
|
161,717 |
CUSTOMER
|
43,276 |
O_CK
|
19,085 |
SUPPLIER
|
13,393 |
Of this, maybe 600 GB could be saved by stream compressing o_comment
. Otherwise this cannot be helped without adding memory. The lineitem
reads are mostly for l_extendedprice
, which is not compressible. If compressing o_comment
made l_extendedprice
always fit in memory, then there would be a radical drop in I/O. Also, as a matter of fact, the buffer management policy of least-recently-used works the very worst for big scans, specifically those of l_extendedprice
: If the head is replaced when reading the tail, and the next read starts from the head, then the whole table/column is read all over again. Caching policies that specially recognized scans of this sort could further reduce I/O. Clustering lineitems
/orders
on date
, as Actian Vector TPC-H implementations do, also starts yielding a greater gain when not running from memory: One column (e.g., l_shipdate
) may be scanned for the whole table but, if the matches are bunched together, then most of l_extendedprice
will not be read at all. Still, if going for top ranks in the races, all will be from memory, or at least there will be SSDs with read throughput around 150 MB/s per core, so these tricks become relatively less important.
In the 100 GB numerical quantities summaries, we see much the same picture as in the single-server. Queries get faster, but their relative times are not radically different. The throughput test (many queries at a time) times are more or less multiples of the power (single user) times. This picture breaks at 1000 GB where I/O first drops the performance to under half and introduces huge variation in execution times within a single query. The time entirely depends on which queries are running along with or right before the execution and on whether these have the same or different working sets. All the streams have the same queries with different parameters, but the query order in each stream is different.
The numerical quantities follow for all the runs. Note that the first 1000 GB run is cold. A competition grade 1000 GB result can be made with double the memory, and the more CPU the better. We will try one at Amazon in a bit.
***
The conclusion is that scale-out pays from the get-go. At present prices, a system with twice the power of a single node of the test system is cost effective. Scales of up to 500 GB are single commodity server, under $10K. Rather than going from a mid-to-large dual-socket box to a quad-socket box, one is likely to be better off having two cheaper dual-socket boxes. These are also readily available on clouds, whereas scale-up configurations are not. Onwards of 1 TB, a cluster is expected to clearly win. At 3 TB, a commodity cluster will clearly be the better deal for both price and absolute performance.
Report Date | October 3, 2014 |
---|---|
Database Scale Factor | 100 |
Total Data Storage/Database Size | 0M |
Query Streams for Throughput Test |
5 |
Virt-H Power | 349,027.7 |
Virt-H Throughput | 420,503.1 |
Virt-H Composite Query-per-Hour Metric (Qph@100GB) |
383,102.1 |
Measurement Interval in Throughput Test (Ts) |
94.273000 seconds |
Start Date/Time | End Date/Time | Duration | |
---|---|---|---|
Stream 0 | 10/03/2014 15:05:07 | 10/03/2014 15:05:40 | 0:00:33 |
Stream 1 | 10/03/2014 15:05:42 | 10/03/2014 15:07:15 | 0:01:33 |
Stream 2 | 10/03/2014 15:05:42 | 10/03/2014 15:07:15 | 0:01:33 |
Stream 3 | 10/03/2014 15:05:42 | 10/03/2014 15:07:16 | 0:01:34 |
Stream 4 | 10/03/2014 15:05:42 | 10/03/2014 15:07:14 | 0:01:32 |
Stream 5 | 10/03/2014 15:05:42 | 10/03/2014 15:07:15 | 0:01:33 |
Refresh 0 | 10/03/2014 15:05:07 | 10/03/2014 15:05:13 | 0:00:06 |
10/03/2014 15:05:41 | 10/03/2014 15:05:42 | 0:00:01 | |
Refresh 1 | 10/03/2014 15:06:48 | 10/03/2014 15:07:03 | 0:00:15 |
Refresh 2 | 10/03/2014 15:05:42 | 10/03/2014 15:06:06 | 0:00:24 |
Refresh 3 | 10/03/2014 15:06:06 | 10/03/2014 15:06:20 | 0:00:14 |
Refresh 4 | 10/03/2014 15:06:20 | 10/03/2014 15:06:35 | 0:00:15 |
Refresh 5 | 10/03/2014 15:06:35 | 10/03/2014 15:06:48 | 0:00:13 |
Query | Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | Q7 | Q8 |
---|---|---|---|---|---|---|---|---|
Stream 0 | 2.045198 | 0.337315 | 1.129548 | 0.327029 | 1.230955 | 0.473090 | 0.979096 | 0.852639 |
Stream 1 | 4.521951 | 0.596538 | 3.464342 | 1.167101 | 3.944699 | 1.744325 | 5.442328 | 4.706185 |
Stream 2 | 4.678728 | 0.837205 | 3.594060 | 1.911751 | 3.942459 | 0.947788 | 3.821267 | 4.686319 |
Stream 3 | 5.126384 | 0.932394 | 0.961762 | 1.043759 | 5.359990 | 1.035597 | 3.056079 | 5.803445 |
Stream 4 | 4.497118 | 0.381036 | 4.665412 | 1.224975 | 5.316591 | 1.666253 | 2.297872 | 6.425171 |
Stream 5 | 4.080968 | 0.493741 | 4.416305 | 0.879202 | 5.705877 | 1.615987 | 3.846881 | 3.346686 |
Min Qi | 4.080968 | 0.381036 | 0.961762 | 0.879202 | 3.942459 | 0.947788 | 2.297872 | 3.346686 |
Max Qi | 5.126384 | 0.932394 | 4.665412 | 1.911751 | 5.705877 | 1.744325 | 5.442328 | 6.425171 |
Avg Qi | 4.581030 | 0.648183 | 3.420376 | 1.245358 | 4.853923 | 1.401990 | 3.692885 | 4.993561 |
Query | Q9 | Q10 | Q11 | Q12 | Q13 | Q14 | Q15 | Q16 |
---|---|---|---|---|---|---|---|---|
Stream 0 | 3.575916 | 2.786656 | 1.579488 | 0.611454 | 3.132460 | 0.685095 | 0.955559 | 1.060110 |
Stream 1 | 9.551437 | 7.187181 | 5.816455 | 2.004946 | 9.461347 | 5.624020 | 5.517677 | 2.924265 |
Stream 2 | 9.637427 | 6.641804 | 6.359532 | 2.412576 | 8.819754 | 3.335494 | 4.549792 | 3.163920 |
Stream 3 | 11.041451 | 6.464479 | 6.982671 | 3.272975 | 8.342983 | 3.448635 | 4.405911 | 2.886393 |
Stream 4 | 8.860228 | 6.754529 | 7.065501 | 3.225236 | 8.789565 | 3.419165 | 4.240718 | 2.399092 |
Stream 5 | 7.339672 | 8.121027 | 6.261988 | 2.711946 | 8.764934 | 3.106366 | 6.544712 | 3.472092 |
Min Qi | 7.339672 | 6.464479 | 5.816455 | 2.004946 | 8.342983 | 3.106366 | 4.240718 | 2.399092 |
Max Qi | 11.041451 | 8.121027 | 7.065501 | 3.272975 | 9.461347 | 5.624020 | 6.544712 | 3.472092 |
Avg Qi | 9.286043 | 7.033804 | 6.497229 | 2.725536 | 8.835717 | 3.786736 | 5.051762 | 2.969152 |
Query | Q17 | Q18 | Q19 | Q20 | Q21 | Q22 | RF1 | RF2 |
---|---|---|---|---|---|---|---|---|
Stream 0 | 1.433789 | 0.972152 | 0.780247 | 1.287222 | 1.360084 | 0.254051 | 6.201742 | 1.219707 |
Stream 1 | 3.398354 | 2.591249 | 3.021207 | 4.663204 | 4.775704 | 1.116547 | 8.770115 | 5.643550 |
Stream 2 | 6.811520 | 3.411846 | 2.634076 | 4.296810 | 4.669635 | 2.282003 | 18.039617 | 6.060465 |
Stream 3 | 4.947110 | 2.479268 | 2.952951 | 6.431644 | 5.469152 | 1.816467 | 8.271266 | 5.498956 |
Stream 4 | 5.240237 | 2.062261 | 2.734378 | 6.055141 | 2.997684 | 2.519301 | 7.889700 | 6.944722 |
Stream 5 | 4.839670 | 3.379315 | 3.231582 | 6.255944 | 3.759509 | 1.347830 | 8.707303 | 4.376033 |
Min Qi | 3.398354 | 2.062261 | 2.634076 | 4.296810 | 2.997684 | 1.116547 | 7.889700 | 4.376033 |
Max Qi | 6.811520 | 3.411846 | 3.231582 | 6.431644 | 5.469152 | 2.519301 | 18.039617 | 6.944722 |
Avg Qi | 5.047378 | 2.784788 | 2.914839 | 5.540549 | 4.334337 | 1.816430 | 10.335600 | 5.704745 |
Report Date | October 3, 2014 |
---|---|
Database Scale Factor | 100 |
Total Data Storage/Database Size | 0M |
Query Streams for Throughput Test |
5 |
Virt-H Power | 387,890.3 |
Virt-H Throughput | 433,066.6 |
Virt-H Composite Query-per-Hour Metric (Qph@100GB) |
409,856.5 |
Measurement Interval in Throughput Test (Ts) |
91.541000 seconds |
Start Date/Time | End Date/Time | Duration | |
---|---|---|---|
Stream 0 | 10/03/2014 15:07:19 | 10/03/2014 15:07:47 | 0:00:28 |
Stream 1 | 10/03/2014 15:07:48 | 10/03/2014 15:09:19 | 0:01:31 |
Stream 2 | 10/03/2014 15:07:48 | 10/03/2014 15:09:16 | 0:01:28 |
Stream 3 | 10/03/2014 15:07:48 | 10/03/2014 15:09:17 | 0:01:29 |
Stream 4 | 10/03/2014 15:07:48 | 10/03/2014 15:09:16 | 0:01:28 |
Stream 5 | 10/03/2014 15:07:48 | 10/03/2014 15:09:20 | 0:01:32 |
Refresh 0 | 10/03/2014 15:07:19 | 10/03/2014 15:07:22 | 0:00:03 |
10/03/2014 15:07:47 | 10/03/2014 15:07:48 | 0:00:01 | |
Refresh 1 | 10/03/2014 15:08:45 | 10/03/2014 15:08:59 | 0:00:14 |
Refresh 2 | 10/03/2014 15:07:49 | 10/03/2014 15:08:02 | 0:00:13 |
Refresh 3 | 10/03/2014 15:08:02 | 10/03/2014 15:08:17 | 0:00:15 |
Refresh 4 | 10/03/2014 15:08:17 | 10/03/2014 15:08:29 | 0:00:12 |
Refresh 5 | 10/03/2014 15:08:29 | 10/03/2014 15:08:45 | 0:00:16 |
Query | Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | Q7 | Q8 |
---|---|---|---|---|---|---|---|---|
Stream 0 | 2.081986 | 0.208487 | 0.902462 | 0.313160 | 1.312273 | 0.493157 | 0.926629 | 0.786345 |
Stream 1 | 2.755427 | 0.911578 | 3.618085 | 0.664407 | 3.740112 | 2.118189 | 4.738754 | 6.551446 |
Stream 2 | 4.189612 | 0.957921 | 5.267355 | 2.152479 | 6.068005 | 1.263380 | 4.251842 | 3.620160 |
Stream 3 | 4.708834 | 0.981651 | 2.411839 | 0.790955 | 4.384516 | 1.322670 | 2.641571 | 4.771831 |
Stream 4 | 3.739567 | 1.185884 | 2.863871 | 1.517891 | 5.946967 | 1.179960 | 3.840560 | 4.926325 |
Stream 5 | 5.258746 | 0.705228 | 3.460904 | 0.951328 | 4.530620 | 1.104500 | 3.226494 | 4.041142 |
Min Qi | 2.755427 | 0.705228 | 2.411839 | 0.664407 | 3.740112 | 1.104500 | 2.641571 | 3.620160 |
Max Qi | 5.258746 | 1.185884 | 5.267355 | 2.152479 | 6.068005 | 2.118189 | 4.738754 | 6.551446 |
Avg Qi | 4.130437 | 0.948452 | 3.524411 | 1.215412 | 4.934044 | 1.397740 | 3.739844 | 4.782181 |
Query | Q9 | Q10 | Q11 | Q12 | Q13 | Q14 | Q15 | Q16 |
---|---|---|---|---|---|---|---|---|
Stream 0 | 3.226685 | 1.878227 | 1.802562 | 0.676499 | 3.145884 | 0.653129 | 0.963449 | 0.990524 |
Stream 1 | 8.842030 | 5.630466 | 5.728147 | 2.643227 | 9.615551 | 3.197855 | 4.676538 | 4.285251 |
Stream 2 | 9.508612 | 5.288044 | 4.319998 | 1.492915 | 9.431995 | 3.206360 | 3.859749 | 3.201996 |
Stream 3 | 10.480224 | 5.880274 | 4.517320 | 2.509405 | 6.913159 | 2.892479 | 6.408602 | 2.938061 |
Stream 4 | 8.824111 | 5.752413 | 5.997959 | 2.581237 | 8.954756 | 3.351951 | 2.420598 | 4.148455 |
Stream 5 | 4.905553 | 7.099111 | 5.121041 | 2.516020 | 9.354924 | 3.955638 | 4.389209 | 3.818902 |
Min Qi | 4.905553 | 5.288044 | 4.319998 | 1.492915 | 6.913159 | 2.892479 | 2.420598 | 2.938061 |
Max Qi | 10.480224 | 7.099111 | 5.997959 | 2.643227 | 9.615551 | 3.955638 | 6.408602 | 4.285251 |
Avg Qi | 8.512106 | 5.930062 | 5.136893 | 2.348561 | 8.854077 | 3.320857 | 4.350939 | 3.678533 |
Query | Q17 | Q18 | Q19 | Q20 | Q21 | Q22 | RF1 | RF2 |
---|---|---|---|---|---|---|---|---|
Stream 0 | 1.405338 | 0.868313 | 0.806277 | 1.123366 | 1.314028 | 0.233214 | 2.590459 | 1.230242 |
Stream 1 | 5.191045 | 3.171244 | 3.403836 | 4.604523 | 3.721133 | 0.892096 | 7.136841 | 6.500452 |
Stream 2 | 6.282687 | 2.845465 | 3.024786 | 4.086546 | 3.530743 | 0.619683 | 9.263671 | 4.826173 |
Stream 3 | 6.040787 | 2.659766 | 2.787273 | 6.210077 | 3.902190 | 2.175417 | 7.974860 | 6.689780 |
Stream 4 | 4.978721 | 2.542674 | 3.518783 | 4.385571 | 3.906211 | 0.918752 | 6.303352 | 5.139326 |
Stream 5 | 5.208600 | 3.761975 | 3.682886 | 7.874493 | 5.017600 | 2.087150 | 7.999074 | 7.978154 |
Min Qi | 4.978721 | 2.542674 | 2.787273 | 4.086546 | 3.530743 | 0.619683 | 6.303352 | 4.826173 |
Max Qi | 6.282687 | 3.761975 | 3.682886 | 7.874493 | 5.017600 | 2.175417 | 9.263671 | 7.978154 |
Avg Qi | 5.540368 | 2.996225 | 3.283513 | 5.432242 | 4.015575 | 1.338620 | 7.735560 | 6.226777 |
Report Date | October 3, 2014 |
---|---|
Database Scale Factor | 1000 |
Total Data Storage/Database Size | 26M |
Query Streams for Throughput Test |
7 |
Virt-H Power | 136,744.5 |
Virt-H Throughput | 147,374.6 |
Virt-H Composite Query-per-Hour Metric (Qph@1000GB) |
141,960.1 |
Measurement Interval in Throughput Test (Ts) |
3,761.953000 seconds |
Start Date/Time | End Date/Time | Duration | |
---|---|---|---|
Stream 0 | 10/03/2014 09:18:42 | 10/03/2014 09:34:12 | 0:15:30 |
Stream 1 | 10/03/2014 09:34:43 | 10/03/2014 10:35:42 | 1:00:59 |
Stream 2 | 10/03/2014 09:34:43 | 10/03/2014 10:37:14 | 1:02:31 |
Stream 3 | 10/03/2014 09:34:43 | 10/03/2014 10:37:25 | 1:02:42 |
Stream 4 | 10/03/2014 09:34:43 | 10/03/2014 10:33:31 | 0:58:48 |
Stream 5 | 10/03/2014 09:34:43 | 10/03/2014 10:35:26 | 1:00:43 |
Stream 6 | 10/03/2014 09:34:43 | 10/03/2014 10:28:00 | 0:53:17 |
Stream 7 | 10/03/2014 09:34:43 | 10/03/2014 10:35:42 | 1:00:59 |
Refresh 0 | 10/03/2014 09:18:42 | 10/03/2014 09:19:27 | 0:00:45 |
10/03/2014 09:34:12 | 10/03/2014 09:34:42 | 0:00:30 | |
Refresh 1 | 10/03/2014 09:43:03 | 10/03/2014 09:43:38 | 0:00:35 |
Refresh 2 | 10/03/2014 09:34:43 | 10/03/2014 09:36:54 | 0:02:11 |
Refresh 3 | 10/03/2014 09:36:53 | 10/03/2014 09:38:39 | 0:01:46 |
Refresh 4 | 10/03/2014 09:38:39 | 10/03/2014 09:39:22 | 0:00:43 |
Refresh 5 | 10/03/2014 09:39:23 | 10/03/2014 09:41:09 | 0:01:46 |
Refresh 6 | 10/03/2014 09:41:09 | 10/03/2014 09:42:15 | 0:01:06 |
Refresh 7 | 10/03/2014 09:42:15 | 10/03/2014 09:43:02 | 0:00:47 |
Query | Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | Q7 | Q8 |
---|---|---|---|---|---|---|---|---|
Stream 0 | 104.488583 | 18.351559 | 24.631282 | 36.195531 | 36.319915 | 3.807790 | 22.750889 | 31.190630 |
Stream 1 | 209.323441 | 26.205435 | 59.637373 | 245.808484 | 60.699333 | 22.369379 | 289.435780 | 335.733425 |
Stream 2 | 109.134446 | 64.185831 | 96.131735 | 108.459418 | 310.273986 | 53.595127 | 152.242755 | 104.350098 |
Stream 3 | 73.321611 | 215.535408 | 69.543101 | 12.423757 | 64.445611 | 38.254747 | 122.952872 | 98.713213 |
Stream 4 | 110.875875 | 4.272757 | 78.697314 | 16.316807 | 59.746855 | 23.447211 | 353.190412 | 342.549908 |
Stream 5 | 41.972337 | 5.978707 | 60.784575 | 34.219229 | 42.372449 | 344.590640 | 146.186614 | 274.972270 |
Stream 6 | 115.760155 | 18.692078 | 58.493147 | 9.193234 | 49.831932 | 19.081395 | 60.603109 | 128.095501 |
Stream 7 | 58.601744 | 118.126585 | 297.327543 | 298.578268 | 714.284222 | 108.475250 | 91.868151 | 55.881029 |
Min Qi | 41.972337 | 4.272757 | 58.493147 | 9.193234 | 42.372449 | 19.081395 | 60.603109 | 55.881029 |
Max Qi | 209.323441 | 215.535408 | 297.327543 | 298.578268 | 714.284222 | 344.590640 | 353.190412 | 342.549908 |
Avg Qi | 102.712801 | 64.713829 | 102.944970 | 103.571314 | 185.950627 | 87.116250 | 173.782813 | 191.470778 |
Query | Q9 | Q10 | Q11 | Q12 | Q13 | Q14 | Q15 | Q16 |
---|---|---|---|---|---|---|---|---|
Stream 0 | 41.777880 | 10.035063 | 16.125611 | 9.245638 | 209.443782 | 111.271310 | 37.821595 | 9.483838 |
Stream 1 | 244.243830 | 63.473338 | 207.741931 | 33.696956 | 561.057408 | 141.026049 | 126.818051 | 54.774792 |
Stream 2 | 189.297446 | 144.853756 | 56.292537 | 184.781273 | 501.330052 | 49.965102 | 107.736393 | 85.691079 |
Stream 3 | 231.060699 | 355.394713 | 43.483645 | 11.806590 | 555.445111 | 36.722686 | 251.241817 | 9.057850 |
Stream 4 | 227.371508 | 32.207115 | 108.880658 | 139.922550 | 532.697956 | 57.106583 | 159.198489 | 153.088913 |
Stream 5 | 416.113856 | 108.689389 | 62.847727 | 702.712683 | 622.906487 | 58.198961 | 89.707091 | 85.614769 |
Stream 6 | 228.019243 | 62.474213 | 88.227994 | 282.932978 | 432.387869 | 238.544027 | 61.486269 | 56.950548 |
Stream 7 | 230.564416 | 69.197517 | 130.708759 | 120.531103 | 551.112816 | 57.438478 | 82.256530 | 63.796403 |
Min Qi | 189.297446 | 32.207115 | 43.483645 | 11.806590 | 432.387869 | 36.722686 | 61.486269 | 9.057850 |
Max Qi | 416.113856 | 355.394713 | 207.741931 | 702.712683 | 622.906487 | 238.544027 | 251.241817 | 153.088913 |
Avg Qi | 252.381571 | 119.470006 | 99.740464 | 210.912019 | 536.705386 | 91.285984 | 125.492091 | 72.710622 |
Query | Q17 | Q18 | Q19 | Q20 | Q21 | Q22 | RF1 | RF2 |
---|---|---|---|---|---|---|---|---|
Stream 0 | 22.897349 | 47.870269 | 12.735580 | 25.982194 | 46.091766 | 6.623306 | 45.120559 | 30.016788 |
Stream 1 | 123.444839 | 22.212194 | 647.523826 | 97.431531 | 81.592165 | 4.573040 | 21.068225 | 14.486185 |
Stream 2 | 80.853865 | 622.651044 | 288.656211 | 336.409076 | 70.925079 | 33.578052 | 82.910543 | 48.001583 |
Stream 3 | 392.340812 | 84.967695 | 57.181935 | 473.720060 | 497.262620 | 66.966740 | 54.778284 | 50.940094 |
Stream 4 | 97.069440 | 301.705125 | 338.035788 | 258.992426 | 103.699408 | 28.750257 | 23.858757 | 13.626079 |
Stream 5 | 69.882110 | 34.277914 | 146.031938 | 179.656129 | 104.788154 | 10.836148 | 54.319823 | 52.077352 |
Stream 6 | 141.310431 | 247.242904 | 94.392791 | 702.775460 | 80.142930 | 19.969889 | 46.027410 | 19.136271 |
Stream 7 | 89.018281 | 51.105998 | 281.234432 | 79.046122 | 84.341517 | 26.221892 | 33.169666 | 13.309634 |
Min Qi | 69.882110 | 22.212194 | 57.181935 | 79.046122 | 70.925079 | 4.573040 | 21.068225 | 13.309634 |
Max Qi | 392.340812 | 622.651044 | 647.523826 | 702.775460 | 497.262620 | 66.966740 | 82.910543 | 52.077352 |
Avg Qi | 141.988540 | 194.880411 | 264.722417 | 304.004401 | 146.107410 | 27.270860 | 45.161815 | 30.225314 |
Report Date | October 3, 2014 |
---|---|
Database Scale Factor | 1000 |
Total Data Storage/Database Size | 26M |
Query Streams for Throughput Test |
7 |
Virt-H Power | 199,652.0 |
Virt-H Throughput | 125,161.1 |
Virt-H Composite Query-per-Hour Metric (Qph@1000GB) |
158,078.0 |
Measurement Interval in Throughput Test (Ts) |
4,429.608000 seconds |
Start Date/Time | End Date/Time | Duration | |
---|---|---|---|
Stream 0 | 10/03/2014 10:37:29 | 10/03/2014 10:52:26 | 0:14:57 |
Stream 1 | 10/03/2014 10:52:35 | 10/03/2014 12:05:19 | 1:12:44 |
Stream 2 | 10/03/2014 10:52:35 | 10/03/2014 12:06:25 | 1:13:50 |
Stream 3 | 10/03/2014 10:52:35 | 10/03/2014 12:03:08 | 1:10:33 |
Stream 4 | 10/03/2014 10:52:35 | 10/03/2014 12:05:20 | 1:12:45 |
Stream 5 | 10/03/2014 10:52:35 | 10/03/2014 11:57:40 | 1:05:05 |
Stream 6 | 10/03/2014 10:52:35 | 10/03/2014 12:05:28 | 1:12:53 |
Stream 7 | 10/03/2014 10:52:35 | 10/03/2014 12:05:25 | 1:12:50 |
Refresh 0 | 10/03/2014 10:37:29 | 10/03/2014 10:37:52 | 0:00:23 |
10/03/2014 10:52:25 | 10/03/2014 10:52:34 | 0:00:09 | |
Refresh 1 | 10/03/2014 11:01:44 | 10/03/2014 11:02:29 | 0:00:45 |
Refresh 2 | 10/03/2014 10:52:35 | 10/03/2014 10:54:50 | 0:02:15 |
Refresh 3 | 10/03/2014 10:54:50 | 10/03/2014 10:57:02 | 0:02:12 |
Refresh 4 | 10/03/2014 10:57:05 | 10/03/2014 10:58:47 | 0:01:42 |
Refresh 5 | 10/03/2014 10:58:47 | 10/03/2014 10:59:46 | 0:00:59 |
Refresh 6 | 10/03/2014 10:59:45 | 10/03/2014 11:00:38 | 0:00:53 |
Refresh 7 | 10/03/2014 11:00:39 | 10/03/2014 11:01:44 | 0:01:05 |
Query | Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | Q7 | Q8 |
---|---|---|---|---|---|---|---|---|
Stream 0 | 34.105419 | 1.439089 | 9.802183 | 2.033956 | 10.525742 | 3.356152 | 23.953729 | 36.199533 |
Stream 1 | 26.598252 | 150.572833 | 41.930330 | 86.870320 | 50.604856 | 201.001372 | 61.638366 | 244.013359 |
Stream 2 | 50.129895 | 102.219282 | 12.380935 | 102.319615 | 62.577229 | 43.454392 | 891.076608 | 407.640626 |
Stream 3 | 269.947278 | 53.172724 | 54.649973 | 11.460062 | 66.695722 | 17.336698 | 63.371232 | 91.158050 |
Stream 4 | 41.149221 | 22.520836 | 28.707973 | 509.984321 | 68.916549 | 17.525025 | 702.191490 | 666.450230 |
Stream 5 | 59.179045 | 30.734442 | 99.504351 | 11.145990 | 101.334340 | 21.660836 | 74.625589 | 535.160207 |
Stream 6 | 225.105215 | 55.567328 | 46.749707 | 554.474507 | 215.657091 | 54.362551 | 72.960653 | 442.194302 |
Stream 7 | 220.993226 | 28.528230 | 47.543365 | 336.191006 | 308.931194 | 9.767397 | 850.258452 | 66.121298 |
Min Qi | 26.598252 | 22.520836 | 12.380935 | 11.145990 | 50.604856 | 9.767397 | 61.638366 | 66.121298 |
Max Qi | 269.947278 | 150.572833 | 99.504351 | 554.474507 | 308.931194 | 201.001372 | 891.076608 | 666.450230 |
Avg Qi | 127.586019 | 63.330811 | 47.352376 | 230.349403 | 124.959569 | 52.158324 | 388.017484 | 350.391153 |
Query | Q9 | Q10 | Q11 | Q12 | Q13 | Q14 | Q15 | Q16 |
---|---|---|---|---|---|---|---|---|
Stream 0 | 50.439615 | 9.287196 | 15.892947 | 7.112715 | 250.527755 | 131.478131 | 54.458992 | 10.525842 |
Stream 1 | 420.919329 | 317.402771 | 101.818338 | 403.213385 | 724.539887 | 160.669174 | 65.374584 | 28.563034 |
Stream 2 | 464.378760 | 210.938167 | 23.395678 | 545.086468 | 736.005716 | 54.680686 | 398.880053 | 34.018918 |
Stream 3 | 350.083270 | 321.781561 | 48.652019 | 435.954962 | 378.872739 | 100.588804 | 289.350342 | 190.140640 |
Stream 4 | 306.265994 | 249.621982 | 79.280220 | 221.255121 | 348.932746 | 49.555802 | 100.062439 | 61.368814 |
Stream 5 | 511.923087 | 133.018420 | 134.199065 | 9.655693 | 662.658830 | 104.380635 | 82.847242 | 59.952271 |
Stream 6 | 578.362701 | 61.221715 | 145.613349 | 47.957006 | 621.993889 | 256.150595 | 77.124777 | 91.163005 |
Stream 7 | 418.450091 | 391.818564 | 29.360218 | 17.236628 | 761.850888 | 31.952329 | 50.393082 | 27.530882 |
Min Qi | 306.265994 | 61.221715 | 23.395678 | 9.655693 | 348.932746 | 31.952329 | 50.393082 | 27.530882 |
Max Qi | 578.362701 | 391.818564 | 145.613349 | 545.086468 | 761.850888 | 256.150595 | 398.880053 | 190.140640 |
Avg Qi | 435.769033 | 240.829026 | 80.331270 | 240.051323 | 604.979242 | 108.282575 | 152.004646 | 70.391081 |
Query | Q17 | Q18 | Q19 | Q20 | Q21 | Q22 | RF1 | RF2 |
---|---|---|---|---|---|---|---|---|
Stream 0 | 22.444111 | 37.978532 | 13.347320 | 26.553364 | 115.511143 | 7.670304 | 22.771613 | 8.761026 |
Stream 1 | 329.153807 | 19.198590 | 258.455295 | 556.256015 | 99.647793 | 14.878746 | 32.803289 | 8.771923 |
Stream 2 | 76.940373 | 74.916489 | 75.246897 | 16.035355 | 14.403643 | 32.348500 | 91.981362 | 41.426540 |
Stream 3 | 88.918404 | 238.858707 | 221.257060 | 688.441713 | 247.669761 | 5.345632 | 70.780594 | 49.352955 |
Stream 4 | 497.105081 | 167.874781 | 67.668514 | 76.820831 | 78.585717 | 3.655421 | 73.165786 | 29.401670 |
Stream 5 | 309.991618 | 123.023557 | 380.801141 | 347.055909 | 93.478502 | 18.351491 | 33.338814 | 12.557542 |
Stream 6 | 57.200926 | 154.489850 | 386.007137 | 103.558355 | 32.676369 | 92.863316 | 35.576966 | 14.061801 |
Stream 7 | 160.332088 | 46.934177 | 340.957970 | 84.479720 | 78.985110 | 60.568796 | 44.362737 | 8.831746 |
Min Qi | 57.200926 | 19.198590 | 67.668514 | 16.035355 | 14.403643 | 3.655421 | 32.803289 | 8.771923 |
Max Qi | 497.105081 | 238.858707 | 386.007137 | 688.441713 | 247.669761 | 92.863316 | 91.981362 | 49.352955 |
Avg Qi | 217.091757 | 117.899450 | 247.199145 | 267.521128 | 92.206699 | 32.573129 | 54.572793 | 23.486311 |
To be continued...
There are parts of TPC-H which have an embarrassingly parallel nature, like Q1 and Q7. There are parts that are almost as easy, like Q14, Q17, Q19, and Q21, where there is a big scan and a selective hash join with a hash table small enough to replicate everywhere. The scan scales linearly; building the hash does not, since it is done at single-server speed (once in each process). Some queries like Q9 and Q13 end up doing a big cross-partition join which runs into communication overheads.
This is our first look at how performance behaves with bigger data and a larger platform. The results shown here are interesting but are not final. I bet I can do better; by how much is what we'll find out soon enough.
We will here compare a 1000G setup on my desktop, and a 3000G setup at the CWI's Scilens cluster. The former is 2 boxes of dual Xeon E5 2630, and the latter is 8 boxes of dual Xeon E5 2650v2. All things run from memory and both have QDR IB interconnect. Counting cores and clock, the CWI cluster is 6x larger.
As a rough approximation, for the worst queries, 6x the gear runs 3x the data in the same amount of real time. The 1000G setup has near full platform utilization and the 3000G setup has about half platform utilization. In both cases, running two instances of the same query at the same time takes twice as long.
We use Q9 for this study. The plan makes a hash table of part
with 1/14 of all parts
, replicating to all processes. Then there is a hash table of partsupp
with a key of ps_partkey, ps_suppkey
, and a dependent of ps_supplycost
. This is much larger than the part
hash table and is therefore partitioned on ps_partkey
. The build is for 1/14th of partsupp
. Then there is a scan of lineitem
filtered by the part
hash table; then a cross-partition join to the partsupp
hash table; then a cross partition join to orders
, this time by index; then a hash join on a replicated hash table of supplier
; then nation
; then aggregation. The aggregation is done in each slice; then the slices are added up at the end.
The plan could be made better by one fewer partition crossing. Now there is a crossing from l_orderkey
to l_partkey
and back to o_orderkey
. This would not be so if the cost model knew that the partsupp
always hits. The cost model thinks it hits 1/14 of the time, because it does not know that the selection on the build is exactly the same as on the probe.
For the present purposes, the extra crossing just serves to make the matter of interest more visible.
So, for the 1000G setup, we have 43.6 seconds (s) and
Cluster 4 nodes, 44 s. 459 m/s 119788 KB/s 3120% cpu 0% read 19% clw threads 1r 0w 0i buffers 17622126 68 d 0 w 0 pfs
For the 3000G setup, we have 49.9 s and
Cluster 16 nodes, 50 s. 49389 m/s 1801815 KB/s 7283% cpu 0% read 18% clw threads 1r 0w 0i buffers 135122893 15895255 d 0 w 17 pfs
The platform utilization on the small system is better, at 31/48 (running/total threads); the large one has 73/256.
The large case is clearly network bound. If this were for CPU only, it should be done in half the time it takes the small system to do 1000G.
We confirm this by looking at write wait: 3940 seconds of thread time blocked on write over 50s of real time. The figures on the small one are 3.9s of thread time blocked for 39s of real time. The data transfer on the large one is 93 GB.
How to block less? One idea would be to write less. So we try compression; there is a Google snappy-based message compression option in Virtuoso.
We now get 39.6 s and
Cluster 16 nodes, 40 s. 65161 m/s 1239922 KB/s 10201% cpu 0% read 21% clw threads 1r 0w 0i buffers 52828440 172 d 0 w 0 pfs
The write block time is 397 s of thread time over 39 s of real time, 10x better. The data transfer is 50.9 GB after compression. Snappy is somewhat effective for compression and very fast; in CPU profile, it is under 3% of Q9 on the small system. Gains on the small system are less, though, since blocking is not a big issue to start with.
This is still not full platform. But if the data transfer is further cut in half by a better plan, the situation will be quite good. Now we have 102/256 threads running, meaning that there could be another 40-50% of throughput to be added. The last 128 threads are second threads of a core, so count for roughly 30% of a real core.
The main cluster-specific operation is a send from one to many. This is now done by formulating the message to each recipient in a chain of string buffers; then, after all the messages are prepared, these are optionally compressed and sent to their recipient. This is needlessly simple: Compressing can proceed if ever there is a would-block situation on writing. If all the compression is done, then a blocked write should switch to another recipient, and only after all recipients have a would-block situation, then the thread can call-select with all descriptors and block on them collectively. There is a piece of code to this effect, but is not now being used. It has been seen to add no value in small cases, but could be useful here.
The IB fabric has been seen to do 1.8 GB/s bidirectionally on multiple independent point-to-point TCP links. This is about half the nominal 4 GB/s (40 Gbit/s with 10/8 encoding). So the aggregate throughputs that we see here are nowhere near the nominal spec of the network. Lower level interfaces and the occasional busy wait on the reading end could be tried to some advantage. We have not tried 10GbE either; but if that works at nominal speed, then 10GbE should also be good enough. We will try this at Amazon in due time.
In the meantime, there is a 3000G test made at the CWI cluster without message compression. The score is about 4x that of the single server at 300G using the same hardware. The run is with approximately half platform utilization. There are three runs of power plus throughput, the first run being cold.
Run | Power | Throughput | Composite |
---|---|---|---|
Run 1 | 305,881.5 | 1,072,411.9 | 572,739.8 |
Run 2 | 1,292,085.1 | 1,179,391.6 | 1,234,453.1 |
Run 3 | 1,178,534.1 | 1,092,936.2 | 1,134,928.4 |
The numerical quantities summaries follow. One problem of the run is a high peak of query memory consumption leading to slowdown. Some parts should probably be done in multiple passes to keep the peak lower and not run into swapping. The details will have to be sorted out. This is a demonstration of capability; the perfected accomplishment is to follow.
Report Date | September 29, 2014 |
---|---|
Database Scale Factor | 3000 |
Query Streams for Throughput Test |
8 |
Virt-H Power | 305,881.5 |
Virt-H Throughput | 1,072,411.9 |
Virt-H Composite Query-per-Hour Metric (Qph@100GB) |
572,739.8 |
Measurement Interval in Throughput Test (Ts) |
1,772.554000 seconds |
Start Date/Time | End Date/Time | Duration | |
---|---|---|---|
Stream 0 | 09/29/2014 12:54:52 | 09/29/2014 13:31:17 | 0:36:25 |
Stream 1 | 09/29/2014 13:31:24 | 09/29/2014 13:59:24 | 0:28:00 |
Stream 2 | 09/29/2014 13:31:24 | 09/29/2014 13:58:59 | 0:27:35 |
Stream 3 | 09/29/2014 13:31:24 | 09/29/2014 13:58:29 | 0:27:05 |
Stream 4 | 09/29/2014 13:31:24 | 09/29/2014 13:58:52 | 0:27:28 |
Stream 5 | 09/29/2014 13:31:24 | 09/29/2014 14:00:06 | 0:28:42 |
Stream 6 | 09/29/2014 13:31:24 | 09/29/2014 13:58:18 | 0:26:54 |
Stream 7 | 09/29/2014 13:31:24 | 09/29/2014 13:59:25 | 0:28:01 |
Stream 8 | 09/29/2014 13:31:24 | 09/29/2014 13:58:50 | 0:27:26 |
Refresh 0 | 09/29/2014 12:54:52 | 09/29/2014 12:56:59 | 0:02:07 |
09/29/2014 13:31:17 | 09/29/2014 13:31:23 | 0:00:06 | |
Refresh 1 | 09/29/2014 14:00:38 | 09/29/2014 14:01:11 | 0:00:33 |
Refresh 2 | 09/29/2014 13:31:25 | 09/29/2014 13:36:57 | 0:05:32 |
Refresh 3 | 09/29/2014 13:36:56 | 09/29/2014 13:47:02 | 0:10:06 |
Refresh 4 | 09/29/2014 13:47:03 | 09/29/2014 13:51:40 | 0:04:37 |
Refresh 5 | 09/29/2014 13:51:42 | 09/29/2014 13:56:40 | 0:04:58 |
Refresh 6 | 09/29/2014 13:56:40 | 09/29/2014 13:59:25 | 0:02:45 |
Refresh 7 | 09/29/2014 13:59:25 | 09/29/2014 14:00:10 | 0:00:45 |
Refresh 8 | 09/29/2014 14:00:11 | 09/29/2014 14:00:37 | 0:00:26 |
Query | Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | Q7 | Q8 |
---|---|---|---|---|---|---|---|---|
Stream 0 | 601.576975 | 90.803782 | 108.725110 | 177.112667 | 171.995572 | 2.098138 | 15.768311 | 152.511444 |
Stream 1 | 13.310341 | 32.722946 | 125.551415 | 1.912836 | 46.041675 | 13.294214 | 85.345068 | 165.424288 |
Stream 2 | 19.425885 | 9.248670 | 150.855556 | 7.085737 | 88.445566 | 10.490432 | 49.318554 | 322.500839 |
Stream 3 | 30.534391 | 14.273478 | 100.987791 | 59.341763 | 46.442443 | 9.613795 | 64.186196 | 146.324186 |
Stream 4 | 28.211213 | 37.134522 | 64.189335 | 10.931513 | 100.610673 | 9.929866 | 112.270530 | 108.489951 |
Stream 5 | 29.226411 | 18.132589 | 95.245160 | 63.100068 | 115.663908 | 6.151231 | 46.251309 | 127.742471 |
Stream 6 | 30.750930 | 20.888658 | 108.894177 | 55.168565 | 82.016828 | 69.451493 | 65.161517 | 103.697733 |
Stream 7 | 13.462570 | 18.033847 | 32.065492 | 78.910373 | 202.998301 | 10.688279 | 47.167022 | 139.601948 |
Stream 8 | 24.354314 | 16.711503 | 112.008551 | 8.307098 | 126.849630 | 7.127605 | 51.083118 | 98.648077 |
Min Qi | 13.310341 | 9.248670 | 32.065492 | 1.912836 | 46.041675 | 6.151231 | 46.251309 | 98.648077 |
Max Qi | 30.750930 | 37.134522 | 150.855556 | 78.910373 | 202.998301 | 69.451493 | 112.270530 | 322.500839 |
Avg Qi | 23.659507 | 20.893277 | 98.724685 | 35.594744 | 101.133628 | 17.093364 | 65.097914 | 151.553687 |
Query | Q9 | Q10 | Q11 | Q12 | Q13 | Q14 | Q15 | Q16 |
---|---|---|---|---|---|---|---|---|
Stream 0 | 92.991259 | 5.175922 | 42.238393 | 29.239879 | 367.805534 | 3.604910 | 15.557396 | 11.650267 |
Stream 1 | 149.502128 | 30.197806 | 50.786184 | 217.190836 | 283.545905 | 11.653171 | 73.321150 | 116.860455 |
Stream 2 | 245.783668 | 22.278841 | 50.578731 | 36.301810 | 181.405269 | 32.236754 | 57.631764 | 61.540533 |
Stream 3 | 377.782738 | 24.129319 | 84.097657 | 10.959661 | 171.698669 | 8.973519 | 54.532180 | 45.527142 |
Stream 4 | 341.148908 | 74.358770 | 85.782399 | 43.116347 | 151.146233 | 22.870727 | 74.439693 | 51.871535 |
Stream 5 | 72.259919 | 11.424035 | 79.310504 | 9.833135 | 562.871920 | 14.961209 | 127.861874 | 55.377721 |
Stream 6 | 373.301225 | 41.379753 | 81.983260 | 9.373200 | 95.039317 | 19.071346 | 76.159452 | 48.324504 |
Stream 7 | 449.871952 | 16.099152 | 48.047940 | 8.559784 | 211.094730 | 10.569071 | 26.710228 | 72.571454 |
Stream 8 | 395.771006 | 33.537585 | 54.850876 | 141.526389 | 153.763316 | 12.997092 | 127.961975 | 57.100346 |
Min Qi | 72.259919 | 11.424035 | 48.047940 | 8.559784 | 95.039317 | 8.973519 | 26.710228 | 45.527142 |
Max Qi | 449.871952 | 74.358770 | 85.782399 | 217.190836 | 562.871920 | 32.236754 | 127.961975 | 116.860455 |
Avg Qi | 300.677693 | 31.675658 | 66.929694 | 59.607645 | 226.320670 | 16.666611 | 77.327289 | 63.646711 |
Query | Q17 | Q18 | Q19 | Q20 | Q21 | Q22 | RF1 | RF2 |
---|---|---|---|---|---|---|---|---|
Stream 0 | 12.230334 | 70.991261 | 33.092797 | 17.517230 | 15.798438 | 19.743562 | 127.494687 | 5.893471 |
Stream 1 | 27.550293 | 14.970857 | 16.442806 | 111.138612 | 68.214095 | 7.884782 | 27.109441 | 6.087067 |
Stream 2 | 43.277918 | 12.748690 | 22.681844 | 92.835566 | 84.416610 | 14.661934 | 151.094498 | 153.285076 |
Stream 3 | 129.696125 | 13.435663 | 14.674499 | 129.179966 | 39.176513 | 6.286296 | 181.596838 | 416.052710 |
Stream 4 | 110.348816 | 7.080225 | 21.051910 | 85.758973 | 65.130356 | 7.292999 | 123.386514 | 151.000786 |
Stream 5 | 43.365006 | 9.847612 | 32.881770 | 94.752284 | 67.788314 | 9.035439 | 72.539334 | 223.967821 |
Stream 6 | 34.534280 | 36.347298 | 27.849276 | 122.736244 | 51.447492 | 25.051058 | 80.452175 | 84.519426 |
Stream 7 | 48.021860 | 30.594474 | 22.522426 | 99.245893 | 73.076698 | 7.260729 | 38.585852 | 5.697277 |
Stream 8 | 29.484201 | 12.368769 | 40.344043 | 84.137820 | 30.813313 | 4.856991 | 22.196547 | 4.600057 |
Min Qi | 27.550293 | 7.080225 | 14.674499 | 84.137820 | 30.813313 | 4.856991 | 22.196547 | 4.600057 |
Max Qi | 129.696125 | 36.347298 | 40.344043 | 129.179966 | 84.416610 | 25.051058 | 181.596838 | 416.052710 |
Avg Qi | 58.284812 | 17.174198 | 24.806072 | 102.473170 | 60.007924 | 10.291279 | 87.120150 | 130.651277 |
Report Date | September 29, 2014 |
---|---|
Database Scale Factor | 3000 |
Query Streams for Throughput Test |
8 |
Virt-H Power | 1292085.1 |
Virt-H Throughput | 1179391.6 |
Virt-H Composite Query-per-Hour Metric (Qph@100GB) |
1234453.1 |
Measurement Interval in Throughput Test (Ts) |
1611.779000 seconds |
Start Date/Time | End Date/Time | Duration | |
---|---|---|---|
Stream 0 | 09/29/2014 14:01:15 | 09/29/2014 14:06:48 | 0:05:33 |
Stream 1 | 09/29/2014 14:06:53 | 09/29/2014 14:30:22 | 0:23:29 |
Stream 2 | 09/29/2014 14:06:53 | 09/29/2014 14:32:30 | 0:25:37 |
Stream 3 | 09/29/2014 14:06:53 | 09/29/2014 14:31:23 | 0:24:30 |
Stream 4 | 09/29/2014 14:06:53 | 09/29/2014 14:31:34 | 0:24:41 |
Stream 5 | 09/29/2014 14:06:53 | 09/29/2014 14:32:53 | 0:26:00 |
Stream 6 | 09/29/2014 14:06:53 | 09/29/2014 14:29:51 | 0:22:58 |
Stream 7 | 09/29/2014 14:06:53 | 09/29/2014 14:31:34 | 0:24:41 |
Stream 8 | 09/29/2014 14:06:53 | 09/29/2014 14:30:35 | 0:23:42 |
Refresh 0 | 09/29/2014 14:01:15 | 09/29/2014 14:01:35 | 0:00:20 |
09/29/2014 14:06:49 | 09/29/2014 14:06:53 | 0:00:04 | |
Refresh 1 | 09/29/2014 14:33:16 | 09/29/2014 14:33:45 | 0:00:29 |
Refresh 2 | 09/29/2014 14:06:55 | 09/29/2014 14:12:28 | 0:05:33 |
Refresh 3 | 09/29/2014 14:12:29 | 09/29/2014 14:21:55 | 0:09:26 |
Refresh 4 | 09/29/2014 14:21:55 | 09/29/2014 14:27:40 | 0:05:45 |
Refresh 5 | 09/29/2014 14:27:43 | 09/29/2014 14:31:14 | 0:03:31 |
Refresh 6 | 09/29/2014 14:31:14 | 09/29/2014 14:31:51 | 0:00:37 |
Refresh 7 | 09/29/2014 14:31:51 | 09/29/2014 14:32:52 | 0:01:01 |
Refresh 8 | 09/29/2014 14:32:52 | 09/29/2014 14:33:16 | 0:00:24 |
Query | Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | Q7 | Q8 |
---|---|---|---|---|---|---|---|---|
Stream 0 | 9.451169 | 3.644118 | 18.419151 | 1.404395 | 15.740525 | 2.085038 | 15.171847 | 25.400834 |
Stream 1 | 19.558041 | 6.607300 | 85.774410 | 4.503525 | 81.448472 | 11.976129 | 92.140470 | 145.743853 |
Stream 2 | 31.042019 | 7.877299 | 71.958033 | 8.862111 | 142.452144 | 18.489193 | 81.003310 | 85.856529 |
Stream 3 | 38.833612 | 12.440326 | 86.063103 | 7.165120 | 84.707025 | 16.931531 | 100.442710 | 122.411252 |
Stream 4 | 15.751913 | 33.026762 | 50.457193 | 7.064220 | 114.130257 | 5.992556 | 66.035959 | 84.596973 |
Stream 5 | 18.462884 | 28.047942 | 110.690543 | 16.566547 | 104.403789 | 5.303453 | 72.552640 | 402.383383 |
Stream 6 | 17.858339 | 33.988800 | 110.431091 | 7.238431 | 72.229953 | 16.850955 | 68.231546 | 180.601000 |
Stream 7 | 23.055572 | 17.044813 | 96.105520 | 8.941132 | 171.130879 | 8.423100 | 70.634541 | 147.261648 |
Stream 8 | 19.840798 | 13.860740 | 74.961175 | 16.171566 | 56.165875 | 5.904921 | 47.646217 | 125.991819 |
Min Qi | 15.751913 | 6.607300 | 50.457193 | 4.503525 | 56.165875 | 5.303453 | 47.646217 | 84.596973 |
Max Qi | 38.833612 | 33.988800 | 110.690543 | 16.566547 | 171.130879 | 18.489193 | 100.442710 | 402.383383 |
Avg Qi | 23.050397 | 19.111748 | 85.805134 | 9.564082 | 103.333549 | 11.233980 | 74.835924 | 161.855807 |
Query | Q9 | Q10 | Q11 | Q12 | Q13 | Q14 | Q15 | Q16 |
---|---|---|---|---|---|---|---|---|
Stream 0 | 54.766945 | 5.551163 | 29.216632 | 3.035008 | 52.816902 | 3.346243 | 15.767022 | 10.066112 |
Stream 1 | 130.666380 | 9.658277 | 49.332720 | 103.036705 | 194.520370 | 12.166344 | 65.144599 | 97.158571 |
Stream 2 | 254.754936 | 22.605298 | 38.102466 | 21.121168 | 300.467330 | 12.262318 | 108.203491 | 50.696657 |
Stream 3 | 283.761567 | 19.327164 | 73.414574 | 7.431651 | 183.121904 | 12.573854 | 73.814766 | 46.802493 |
Stream 4 | 290.341947 | 57.452026 | 58.354221 | 13.066162 | 189.263163 | 18.998781 | 121.269774 | 54.831406 |
Stream 5 | 81.787025 | 8.410538 | 79.822552 | 16.005077 | 190.730342 | 21.697136 | 100.456487 | 46.744884 |
Stream 6 | 202.558515 | 39.360009 | 74.519981 | 15.960756 | 137.321631 | 26.583824 | 57.537668 | 60.758997 |
Stream 7 | 226.790801 | 44.175536 | 73.992368 | 7.561897 | 182.853851 | 17.597471 | 31.128055 | 44.389893 |
Stream 8 | 275.423934 | 21.980040 | 60.538239 | 39.736622 | 173.574795 | 58.786316 | 95.124912 | 25.564108 |
Min Qi | 81.787025 | 8.410538 | 38.102466 | 7.431651 | 137.321631 | 12.166344 | 31.128055 | 25.564108 |
Max Qi | 290.341947 | 57.452026 | 79.822552 | 103.036705 | 300.467330 | 58.786316 | 121.269774 | 97.158571 |
Avg Qi | 218.260638 | 27.871111 | 63.509640 | 27.990005 | 193.981673 | 22.583255 | 81.584969 | 53.368376 |
Query | Q17 | Q18 | Q19 | Q20 | Q21 | Q22 | RF1 | RF2 |
---|---|---|---|---|---|---|---|---|
Stream 0 | 13.620157 | 2.288504 | 4.166807 | 16.468447 | 9.991810 | 1.101775 | 20.152227 | 4.294680 |
Stream 1 | 44.026143 | 31.720525 | 25.684461 | 134.254716 | 30.797008 | 9.568594 | 24.328205 | 4.319533 |
Stream 2 | 40.283148 | 9.970277 | 29.731019 | 133.083785 | 29.322194 | 8.859556 | 73.251098 | 249.850045 |
Stream 3 | 44.288244 | 18.914661 | 38.162762 | 144.458624 | 22.556235 | 6.184842 | 117.267234 | 445.700238 |
Stream 4 | 67.147744 | 6.649451 | 27.876825 | 59.226248 | 69.373248 | 44.478703 | 61.381724 | 282.608075 |
Stream 5 | 36.403227 | 12.226129 | 21.997683 | 95.912670 | 44.219799 | 21.117974 | 106.473817 | 97.896971 |
Stream 6 | 42.114038 | 30.805969 | 25.929027 | 51.658733 | 26.475662 | 34.816500 | 31.309953 | 5.608395 |
Stream 7 | 48.601889 | 18.708127 | 18.893532 | 132.558026 | 50.476383 | 12.309402 | 22.661371 | 37.610815 |
Stream 8 | 34.413417 | 34.709883 | 37.058335 | 121.710608 | 44.676485 | 9.449332 | 19.311945 | 4.420232 |
Min Qi | 34.413417 | 6.649451 | 18.893532 | 51.658733 | 22.556235 | 6.184842 | 19.311945 | 4.319533 |
Max Qi | 67.147744 | 34.709883 | 38.162762 | 144.458624 | 69.373248 | 44.478703 | 117.267234 | 445.700238 |
Avg Qi | 44.659731 | 20.463128 | 28.166705 | 109.107926 | 39.737127 | 18.348113 | 56.998168 | 141.001788 |
Report Date | September 29, 2014 |
---|---|
Database Scale Factor | 3000 |
Query Streams for Throughput Test |
8 |
Virt-H Power | 1178534.1 |
Virt-H Throughput | 1092936.2 |
Virt-H Composite Query-per-Hour Metric (Qph@100GB) |
1134928.4 |
Measurement Interval in Throughput Test (Ts) |
1739.269000 seconds |
Start Date/Time | End Date/Time | Duration | |
---|---|---|---|
Stream 0 | 09/29/2014 14:33:48 | 09/29/2014 14:40:59 | 0:07:11 |
Stream 1 | 09/29/2014 14:41:04 | 09/29/2014 15:10:02 | 0:28:58 |
Stream 2 | 09/29/2014 14:41:04 | 09/29/2014 15:09:07 | 0:28:03 |
Stream 3 | 09/29/2014 14:41:04 | 09/29/2014 15:09:17 | 0:28:13 |
Stream 4 | 09/29/2014 14:41:04 | 09/29/2014 15:09:55 | 0:28:51 |
Stream 5 | 09/29/2014 14:41:04 | 09/29/2014 15:09:39 | 0:28:35 |
Stream 6 | 09/29/2014 14:41:04 | 09/29/2014 15:09:46 | 0:28:42 |
Stream 7 | 09/29/2014 14:41:04 | 09/29/2014 15:09:58 | 0:28:54 |
Stream 8 | 09/29/2014 14:41:04 | 09/29/2014 15:08:58 | 0:27:54 |
Refresh 0 | 09/29/2014 14:33:48 | 09/29/2014 14:34:07 | 0:00:19 |
09/29/2014 14:40:59 | 09/29/2014 14:41:04 | 0:00:05 | |
Refresh 1 | 09/29/2014 15:06:57 | 09/29/2014 15:09:49 | 0:02:52 |
Refresh 2 | 09/29/2014 14:41:05 | 09/29/2014 14:47:39 | 0:06:34 |
Refresh 3 | 09/29/2014 14:47:40 | 09/29/2014 14:56:46 | 0:09:06 |
Refresh 4 | 09/29/2014 14:56:49 | 09/29/2014 15:03:19 | 0:06:30 |
Refresh 5 | 09/29/2014 15:03:24 | 09/29/2014 15:06:45 | 0:03:21 |
Refresh 6 | 09/29/2014 15:06:46 | 09/29/2014 15:06:49 | 0:00:03 |
Refresh 7 | 09/29/2014 15:06:50 | 09/29/2014 15:06:53 | 0:00:03 |
Refresh 8 | 09/29/2014 15:06:53 | 09/29/2014 15:10:04 | 0:03:11 |
Query | Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | Q7 | Q8 |
---|---|---|---|---|---|---|---|---|
Stream 0 | 9.393632 | 5.001910 | 17.053567 | 1.427500 | 17.813839 | 2.230451 | 13.884490 | 25.610995 |
Stream 1 | 12.971454 | 9.383520 | 94.257760 | 1.603106 | 127.940946 | 20.791892 | 78.869819 | 138.521273 |
Stream 2 | 21.428177 | 31.431513 | 96.366083 | 5.611843 | 58.394596 | 11.279502 | 47.114473 | 407.135077 |
Stream 3 | 23.377920 | 37.474814 | 83.640621 | 9.152178 | 71.186158 | 11.001543 | 46.763758 | 110.015662 |
Stream 4 | 49.580860 | 31.979940 | 87.662950 | 8.983661 | 68.052295 | 14.367631 | 59.266063 | 301.788652 |
Stream 5 | 13.483836 | 20.203772 | 391.980128 | 12.505446 | 77.966993 | 10.487869 | 52.989448 | 226.837637 |
Stream 6 | 38.104903 | 21.271630 | 84.689348 | 8.626460 | 86.620802 | 11.981171 | 69.182098 | 111.810485 |
Stream 7 | 20.243617 | 12.298692 | 99.547203 | 6.020951 | 151.584400 | 17.528287 | 62.037348 | 101.023802 |
Stream 8 | 22.808294 | 17.583072 | 59.180595 | 5.618565 | 123.108771 | 11.477376 | 42.485363 | 92.035709 |
Min Qi | 12.971454 | 9.383520 | 59.180595 | 1.603106 | 58.394596 | 10.487869 | 42.485363 | 92.035709 |
Max Qi | 49.580860 | 37.474814 | 391.980128 | 12.505446 | 151.584400 | 20.791892 | 78.869819 | 407.135077 |
Avg Qi | 25.249883 | 22.703369 | 124.665586 | 7.265276 | 95.606870 | 13.614409 | 57.338546 | 186.146037 |
Query | Q9 | Q10 | Q11 | Q12 | Q13 | Q14 | Q15 | Q16 |
---|---|---|---|---|---|---|---|---|
Stream 0 | 146.487681 | 6.798942 | 29.834475 | 3.177879 | 55.067866 | 4.503738 | 17.215591 | 9.333281 |
Stream 1 | 177.581204 | 44.178095 | 69.746005 | 12.306166 | 215.602727 | 30.443709 | 64.276384 | 45.266949 |
Stream 2 | 211.311651 | 27.403143 | 61.412478 | 12.173058 | 216.879170 | 18.272234 | 96.753886 | 35.587072 |
Stream 3 | 482.581456 | 68.663026 | 60.354163 | 13.408513 | 187.921639 | 17.469237 | 62.337222 | 31.706120 |
Stream 4 | 178.297373 | 23.711312 | 67.129677 | 15.216904 | 328.149575 | 20.258853 | 78.891201 | 84.852368 |
Stream 5 | 209.496498 | 28.346366 | 55.584081 | 9.644075 | 131.622351 | 24.171156 | 80.046801 | 43.625932 |
Stream 6 | 521.691639 | 24.126176 | 72.964805 | 15.311409 | 146.152570 | 34.748843 | 71.957130 | 58.470644 |
Stream 7 | 580.320149 | 17.054563 | 56.172396 | 7.530832 | 200.100326 | 12.444021 | 25.910599 | 75.653693 |
Stream 8 | 472.231674 | 15.064398 | 89.875570 | 42.394675 | 166.589234 | 12.831209 | 81.697881 | 73.821769 |
Min Qi | 177.581204 | 15.064398 | 55.584081 | 7.530832 | 131.622351 | 12.444021 | 25.910599 | 31.706120 |
Max Qi | 580.320149 | 68.663026 | 89.875570 | 42.394675 | 328.149575 | 34.748843 | 96.753886 | 84.852368 |
Avg Qi | 354.188955 | 31.068385 | 66.654897 | 15.998204 | 199.127199 | 21.329908 | 70.233888 | 56.123068 |
Query | Q17 | Q18 | Q19 | Q20 | Q21 | Q22 | RF1 | RF2 |
---|---|---|---|---|---|---|---|---|
Stream 0 | 12.252670 | 2.593733 | 4.115862 | 16.895672 | 10.183350 | 1.240096 | 18.679685 | 4.876067 |
Stream 1 | 356.740980 | 21.197870 | 30.422216 | 81.779038 | 65.468650 | 3.947503 | 63.933750 | 107.563796 |
Stream 2 | 54.087768 | 10.152604 | 34.940701 | 113.510640 | 70.908809 | 12.316233 | 109.091578 | 283.076004 |
Stream 3 | 52.807104 | 18.525982 | 13.740089 | 212.364908 | 16.413964 | 17.998809 | 58.653503 | 483.718271 |
Stream 4 | 42.389062 | 36.157809 | 28.909260 | 86.427025 | 21.605419 | 7.608729 | 54.910853 | 331.074114 |
Stream 5 | 48.214794 | 15.778893 | 20.681799 | 130.560005 | 43.846752 | 33.905533 | 54.536966 | 139.563667 |
Stream 6 | 84.061840 | 26.224851 | 16.546432 | 117.265210 | 34.766856 | 39.037423 | 0.710642 | 1.645351 |
Stream 7 | 63.034890 | 15.966686 | 31.666488 | 112.689765 | 28.661943 | 12.828171 | 1.274731 | 1.780452 |
Stream 8 | 43.879104 | 8.596666 | 32.585746 | 177.928730 | 26.763334 | 6.112333 | 1.187693 | 0.533668 |
Min Qi | 42.389062 | 8.596666 | 13.740089 | 81.779038 | 16.413964 | 3.947503 | 0.710642 | 0.533668 |
Max Qi | 356.740980 | 36.157809 | 34.940701 | 212.364908 | 70.908809 | 39.037423 | 109.091578 | 483.718271 |
Avg Qi | 93.151943 | 19.075170 | 26.186591 | 129.065665 | 38.554466 | 16.719342 | 43.037465 | 168.619415 |
To be continued...
We take the prototypical cross partition join in Q13: Make a hash table of all customers
, partitioned by c_custkey
. This is independently done with full parallelism in each partition. Scan the orders
, get the customer
(in a different partition), and flag the customers
that had at least one order
. Then, to get the customers
with no orders
, return the customers
that were not flagged in the previous pass.
The single-server time in part 12 was 7.8 and 6.0 with a single user. We consider the better of the times. The difference is due to allocating memory on the first go; on the second go the memory is already in reserve.
With default settings, we get 4595 ms (microseconds), with per node resource utilization at:
Cluster 4 nodes, 4 s. 112405 m/s 742602 KB/s 2749% cpu 0% read 4% clw threads 1r 0w 0i buffers 8577766 287874 d 0 w 0 pfs cl 1: 27867 m/s 185654 KB/s 733% cpu 0% read 4% clw threads 1r 0w 0i buffers 2144242 71757 d 0 w 0 pfs cl 2: 28149 m/s 185372 KB/s 672% cpu 0% read 0% clw threads 0r 0w 0i buffers 2144640 71903 d 0 w 0 pfs cl 3: 28220 m/s 185621 KB/s 675% cpu 0% read 0% clw threads 0r 0w 0i buffers 2144454 71962 d 0 w 0 pfs cl 4: 28150 m/s 185837 KB/s 667% cpu 0% read 0% clw threads 0r 0w 0i buffers 2144430 72252 d 0 w 0 pfs
The top line is the summary; the lines below are per-process. The m/s
is messages-per-second; KB/s
is interconnect traffic per second; clw %
is idle time spent waiting for a reply from another process. The cluster is set up with 4 processes across 2 machines, each with 2 NUMA nodes. Each process has affinity to the NUMA node, so local memory only. The time is reasonable in light of the overall CPU of 2700%. The maximum would be 4800% with all threads of all cores busy all the time.
The catch here is that we do not have a steady half-platform utilization all the time, but full platform peaks followed by synchronization barriers with very low utilization. So, we set the batch size differently:
cl_exec ('__dbf_set (''cl_dfg_batch_bytes'', 50000000)');
This means that we set, on each process, the cl_dfg_batch_bytes
to 50M from a default of 10M. The effect is that each scan of orders
, one thread per slice, 48 slices total, will produce 50MB worth of o_custkeys
to be sent to the other partition for getting the customer
. After each 50M, the thread stops and will produce the next batch when all are done and a global continue message is sent by the coordinator.
The time is now 3173 ms with:
Cluster 4 nodes, 3 s. 158220 m/s 1054944 KB/s 3676% cpu 0% read 1% clw threads 1r 0w 0i buffers 8577766 287874 d 0 w 0 pfs cl 1: 39594 m/s 263962 KB/s 947% cpu 0% read 1% clw threads 1r 0w 0i buffers 2144242 71757 d 0 w 0 pfs cl 2: 39531 m/s 263476 KB/s 894% cpu 0% read 0% clw threads 0r 0w 0i buffers 2144640 71903 d 0 w 0 pfs cl 3: 39523 m/s 263684 KB/s 933% cpu 0% read 0% clw threads 0r 0w 0i buffers 2144454 71962 d 0 w 0 pfs cl 4: 39535 m/s 263586 KB/s 900% cpu 0% read 0% clw threads 0r 0w 0i buffers 2144430 72252 d 0 w 0 pfs
The platform utilization is better as we see. The throughput is nearly double that of the single-server, which is pretty good for a communication-heavy query.
This was done with a vector size of 10K. In other words, each partition gets 10K o_custkeys
and splits these 48 ways to go to every recipient. 1/4 are in the same process, 1/4 in a different process on the same machine, and 2/4 on a different machine. The recipient gets messages with an average of 208 o_custkey
values, puts them back together in batches of 10K, and passes these to the hash join with customer
.
We try different vector sizes, such as 100K:
cl_exec ('__dbf_set (''dc_batch_sz'', 100000)');
There are two metrics of interest here: The write block time, and the scheduling overhead. The write block time is microseconds, which increases whenever a thread must wait before it can write to a connection. The scheduling overhead is cumulative clocks spent by threads while waiting for a critical section that deals with dispatching messages to consumer threads. Long messages make blocking; short messages make frequent scheduling decisions.
SELECT cl_sys_stat ('local_cll_clk', clr=>1), cl_sys_stat ('write_block_usec', clr=>1) ;
cl_sys_stat
gets the counters from all processes and returns the sum. clr=>1
means that the counter is cleared after read.
We do Q13 with vector sizes of 10, 100, and 1000K.
Vector size | msec | mtx | wblock |
---|---|---|---|
10K | 3297 | 10,829,910,329 | 0 |
100K | 3150 | 1,663,238,367 | 59,132 |
1000K | 3876 | 414,631,129 | 4,578,003 |
So, 100K seems to strike the best balance between scheduling and blocking on write.
The times are measured after several samples with each setting. The times stabilize after a few runs, as the appropriate size memory blocks are in reserve. Calling mmap
to allocate these on the first run with each size has a very high penalty, e.g., 60s for the first run with 1M vector size. We note that blocking on write is really bad even though 1/3 of the time there is no network and 2/3 of the time there is a fast network (QDR IB) with no other load. Further, the affinities are set so that the thread responsible for incoming messages is always on core. Result variability on consecutive runs is under 5%, which is similar to single-server behavior.
It would seem that a mutex, as bad as it is, is still better than a distributed cause for going off core (blocking on write). The latency for continuing a thread thus blocked is of course higher than the latency for continuing one that is waiting for a mutex.
We note that a cluster with more machines can take a longer vector size because a vector spreads out to more recipients. The key seems to be to set the message size so that blocking on write is not common. This is a possible adaptive execution feature. We have seen no particular benefit from SDP (Sockets Direct Protocol) and its zero copy. This is a TCP replacement that comes with the InfiniBand drivers.
We will next look at replication/partitioning tradeoffs for hash joins. Then we can look at full runs.
To be continued...
The platform is one node of the CWI cluster which was also used for the 500Gt RDF experiments reported on this blog. The specification is dual Xeon E5 2650v2 (8 core, 16 thread, 2.6 GHz) with 256 GB RAM. The disk setup is a RAID-0 of three 2 TB rotating disks.
For the 100G, we go from 240 to 395, which is about 1.64x. The new platform has 16 vs 12 cores and a clock of 2.6 as opposed to 2.3. This makes a multiplier of 1.5. The rest of the acceleration is probably attributable to faster memory clock. Anyway, the point of more speed from larger platform is made.
The top level scores per run are as follows; the numerical quantities summaries are appended.
Run | Power | Throughput | Composite |
---|---|---|---|
Run 1 | 391,000.1 | 401,029.4 | 395,983.0 |
Run 2 | 388,746.2 | 404,189.3 | 396,392.6 |
Run | Power | Throughput | Composite |
---|---|---|---|
Run 1 | 61,988.7 | 384,883.7 | 154,461.6 |
Run 2 | 423,431.8 | 387,248.6 | 404,936.3 |
Run 3 | 417,672.0 | 389,719.5 | 403,453.7 |
The interested may reproduce the results using the feature/analytics branch of the v7fasttrack git repository on GitHub as described in Part 13.
For the 300G runs, we note a much longer load time; see below, as this is seriously IO bound.
The first power test at 300G is a non-starter, even though this comes right after bulk load. Still, the data is not in working set and getting it from disk is simply an automatic disqualification, unless maybe one had 300 separate disks. This happens in TPC benchmarks, but not very often in the field. Looking at the first power run, the first queries take the longest, but by the time the power run starts, the working set is there. By an artifact of the metric (use of geometric mean for the power test), long queries are penalized less there than in the throughput run.
So, we run 3 executions instead of the prescribed 2, to have 2 executions from warm state.
To do 300G well in 256 GB of RAM, one needs either to use several SSDs, or to increase compression and keep all in memory, so no secondary storage at all. In order to keep all in memory, one could have stream-compression on string columns. Stream-compressing strings (e.g., o_comment
, l_comment
) does not pay if one is already in memory, but if stream-compressing strings eliminates going to secondary storage, then the win is sure.
As before, all caveats apply; the results are unaudited and for information only. Therefore we do not use the official metric name.
Report Date | September 15, 2014 |
---|---|
Database Scale Factor | 100 |
Start of Database Load | 09/15/2014 07:04:08 |
End of Database Load | 09/15/2014 07:15:58 |
Database Load Time | 0:11:50 |
Query Streams for Throughput Test |
5 |
Virt-H Power | 391,000.1 |
Virt-H Throughput | 401,029.4 |
Virt-H Composite Query-per-Hour Metric (Qph@100GB) |
395,983.0 |
Measurement Interval in Throughput Test (Ts) |
98.846000 seconds |
Start Date/Time | End Date/Time | Duration | |
---|---|---|---|
Stream 0 | 09/15/2014 13:13:01 | 09/15/2014 13:13:28 | 0:00:27 |
Stream 1 | 09/15/2014 13:13:29 | 09/15/2014 13:15:06 | 0:01:37 |
Stream 2 | 09/15/2014 13:13:29 | 09/15/2014 13:15:07 | 0:01:38 |
Stream 3 | 09/15/2014 13:13:29 | 09/15/2014 13:15:07 | 0:01:38 |
Stream 4 | 09/15/2014 13:13:29 | 09/15/2014 13:15:04 | 0:01:35 |
Stream 5 | 09/15/2014 13:13:29 | 09/15/2014 13:15:08 | 0:01:39 |
Refresh 0 | 09/15/2014 13:13:01 | 09/15/2014 13:13:03 | 0:00:02 |
09/15/2014 13:13:28 | 09/15/2014 13:13:29 | 0:00:01 | |
Refresh 1 | 09/15/2014 13:14:10 | 09/15/2014 13:14:16 | 0:00:06 |
Refresh 2 | 09/15/2014 13:13:29 | 09/15/2014 13:13:42 | 0:00:13 |
Refresh 3 | 09/15/2014 13:13:42 | 09/15/2014 13:13:53 | 0:00:11 |
Refresh 4 | 09/15/2014 13:13:53 | 09/15/2014 13:14:02 | 0:00:09 |
Refresh 5 | 09/15/2014 13:14:02 | 09/15/2014 13:14:10 | 0:00:08 |
Query | Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | Q7 | Q8 |
---|---|---|---|---|---|---|---|---|
Stream 0 | 1.442477 | 0.304513 | 0.720263 | 0.351285 | 0.979414 | 0.479455 | 0.865992 | 0.875236 |
Stream 1 | 3.938133 | 0.920533 | 3.738724 | 2.769707 | 3.209728 | 1.339146 | 2.759384 | 3.626868 |
Stream 2 | 4.104738 | 0.952245 | 4.719658 | 0.865586 | 2.139267 | 0.850909 | 2.044402 | 2.600373 |
Stream 3 | 3.692119 | 1.024876 | 3.430172 | 1.579846 | 4.097845 | 1.859468 | 2.312921 | 6.238070 |
Stream 4 | 5.419537 | 0.531571 | 2.116176 | 1.256836 | 4.787617 | 2.117995 | 3.517466 | 3.982180 |
Stream 5 | 5.167029 | 0.746720 | 3.157557 | 1.255182 | 3.004802 | 2.131963 | 3.648316 | 2.835751 |
Min Qi | 3.692119 | 0.531571 | 2.116176 | 0.865586 | 2.139267 | 0.850909 | 2.044402 | 2.600373 |
Max Qi | 5.419537 | 1.024876 | 4.719658 | 2.769707 | 4.787617 | 2.131963 | 3.648316 | 6.238070 |
Avg Qi | 4.464311 | 0.835189 | 3.432457 | 1.545431 | 3.447852 | 1.659896 | 2.856498 | 3.856648 |
Query | Q9 | Q10 | Q11 | Q12 | Q13 | Q14 | Q15 | Q16 |
---|---|---|---|---|---|---|---|---|
Stream 0 | 2.606044 | 1.117063 | 1.847930 | 0.618534 | 4.327600 | 1.110908 | 0.995289 | 0.975910 |
Stream 1 | 7.463593 | 4.686463 | 4.549733 | 4.168129 | 15.759178 | 5.247666 | 4.495030 | 4.075198 |
Stream 2 | 9.398552 | 5.170904 | 3.934405 | 1.880683 | 19.968787 | 3.767992 | 6.965337 | 3.849845 |
Stream 3 | 7.581069 | 4.109905 | 4.301159 | 2.123634 | 17.683200 | 5.383603 | 4.376887 | 2.854777 |
Stream 4 | 9.927887 | 6.913209 | 3.351489 | 2.802724 | 16.985827 | 3.925148 | 4.691474 | 4.080586 |
Stream 5 | 7.035080 | 3.921425 | 6.844778 | 2.899238 | 14.839509 | 4.986742 | 6.629664 | 4.089547 |
Min Qi | 7.035080 | 3.921425 | 3.351489 | 1.880683 | 14.839509 | 3.767992 | 4.376887 | 2.854777 |
Max Qi | 9.927887 | 6.913209 | 6.844778 | 4.168129 | 19.968787 | 5.383603 | 6.965337 | 4.089547 |
Avg Qi | 8.281236 | 4.960381 | 4.596313 | 2.774882 | 17.047300 | 4.662230 | 5.431678 | 3.789991 |
Query | Q17 | Q18 | Q19 | Q20 | Q21 | Q22 | RF1 | RF2 |
---|---|---|---|---|---|---|---|---|
Stream 0 | 1.215956 | 0.745257 | 0.699801 | 1.281834 | 1.291110 | 0.518425 | 1.827192 | 1.014431 |
Stream 1 | 5.779854 | 2.383264 | 2.396793 | 6.130511 | 5.002700 | 1.968425 | 4.172437 | 2.427047 |
Stream 2 | 7.828176 | 1.833416 | 3.175649 | 4.785709 | 5.385834 | 1.403290 | 6.383005 | 6.366525 |
Stream 3 | 5.880139 | 1.797383 | 3.258024 | 5.601364 | 6.373216 | 1.977848 | 5.235542 | 6.385010 |
Stream 4 | 3.989621 | 1.252891 | 2.478303 | 4.678629 | 3.212176 | 2.740586 | 5.037995 | 3.911379 |
Stream 5 | 5.030440 | 2.010988 | 4.188428 | 6.221990 | 5.418788 | 2.187718 | 3.589915 | 3.517380 |
Min Qi | 3.989621 | 1.252891 | 2.396793 | 4.678629 | 3.212176 | 1.403290 | 3.589915 | 2.427047 |
Max Qi | 7.828176 | 2.383264 | 4.188428 | 6.221990 | 6.373216 | 2.740586 | 6.383005 | 6.385010 |
Avg Qi | 5.701646 | 1.855588 | 3.099439 | 5.483641 | 5.078543 | 2.055573 | 4.883779 | 4.521468 |
Report Date | September 15, 2014 |
---|---|
Database Scale Factor | 100 |
Total Data Storage/Database Size | 87,312M |
Start of Database Load | 09/15/2014 07:04:08 |
End of Database Load | 09/15/2014 07:15:58 |
Database Load Time | 0:11:50 |
Query Streams for Throughput Test | 5 |
Virt-H Power | 388,746.2 |
Virt-H Throughput | 404,189.3 |
Virt-H Composite Query-per-Hour Metric (Qph@100GB) | 396,392.6 |
Measurement Interval in Throughput Test (Ts) | 98.074000 seconds |
Start Date/Time | End Date/Time | Duration | |
---|---|---|---|
Stream 0 | 09/15/2014 13:15:11 | 09/15/2014 13:15:38 | 0:00:27 |
Stream 1 | 09/15/2014 13:15:39 | 09/15/2014 13:17:13 | 0:01:34 |
Stream 2 | 09/15/2014 13:15:39 | 09/15/2014 13:17:16 | 0:01:37 |
Stream 3 | 09/15/2014 13:15:39 | 09/15/2014 13:17:15 | 0:01:36 |
Stream 4 | 09/15/2014 13:15:39 | 09/15/2014 13:17:17 | 0:01:38 |
Stream 5 | 09/15/2014 13:15:39 | 09/15/2014 13:17:15 | 0:01:36 |
Refresh 0 | 09/15/2014 13:15:11 | 09/15/2014 13:15:12 | 0:00:01 |
09/15/2014 13:15:38 | 09/15/2014 13:15:39 | 0:00:01 | |
Refresh 1 | 09/15/2014 13:16:13 | 09/15/2014 13:16:20 | 0:00:07 |
Refresh 2 | 09/15/2014 13:15:39 | 09/15/2014 13:15:47 | 0:00:08 |
Refresh 3 | 09/15/2014 13:15:47 | 09/15/2014 13:15:56 | 0:00:09 |
Refresh 4 | 09/15/2014 13:15:56 | 09/15/2014 13:16:03 | 0:00:07 |
Refresh 5 | 09/15/2014 13:16:03 | 09/15/2014 13:16:12 | 0:00:09 |
Query | Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | Q7 | Q8 |
---|---|---|---|---|---|---|---|---|
Stream 0 | 1.467681 | 0.277665 | 0.766102 | 0.365185 | 0.941206 | 0.549381 | 0.938998 | 0.803514 |
Stream 1 | 3.883169 | 1.488521 | 3.366920 | 1.627478 | 3.632321 | 2.065565 | 2.911138 | 2.444544 |
Stream 2 | 3.294589 | 1.138066 | 3.260775 | 1.899615 | 5.367725 | 1.820374 | 3.655119 | 2.186642 |
Stream 3 | 3.797641 | 0.995877 | 3.239690 | 2.483035 | 2.737690 | 1.505998 | 4.058083 | 4.268644 |
Stream 4 | 4.099187 | 0.402685 | 4.704959 | 1.469825 | 5.367910 | 2.783018 | 2.706164 | 2.551061 |
Stream 5 | 3.651273 | 1.598314 | 2.051899 | 1.283754 | 4.711897 | 1.519763 | 2.851300 | 2.484093 |
Min Qi | 3.294589 | 0.402685 | 2.051899 | 1.283754 | 2.737690 | 1.505998 | 2.706164 | 2.186642 |
Max Qi | 4.099187 | 1.598314 | 4.704959 | 2.483035 | 5.367910 | 2.783018 | 4.058083 | 4.268644 |
Avg Qi | 3.745172 | 1.124693 | 3.324849 | 1.752741 | 4.363509 | 1.938944 | 3.236361 | 2.786997 |
Query | Q9 | Q10 | Q11 | Q12 | Q13 | Q14 | Q15 | Q16 |
---|---|---|---|---|---|---|---|---|
Stream 0 | 2.734812 | 1.115539 | 1.679910 | 0.633239 | 4.391739 | 1.130082 | 1.137284 | 0.919646 |
Stream 1 | 9.271071 | 5.664855 | 3.377869 | 2.148228 | 16.046021 | 2.935643 | 4.897009 | 2.891040 |
Stream 2 | 10.272523 | 4.578427 | 4.086788 | 2.312762 | 16.295728 | 2.714776 | 6.393897 | 2.414951 |
Stream 3 | 7.095213 | 4.544636 | 4.073433 | 2.710320 | 18.789088 | 3.903873 | 5.471600 | 2.994184 |
Stream 4 | 7.567924 | 3.691088 | 3.951049 | 2.207944 | 18.189014 | 4.985841 | 6.568935 | 3.965322 |
Stream 5 | 8.173577 | 4.959777 | 4.736593 | 3.507469 | 17.106990 | 5.405699 | 7.357104 | 3.125788 |
Min Qi | 7.095213 | 3.691088 | 3.377869 | 2.148228 | 16.046021 | 2.714776 | 4.897009 | 2.414951 |
Max Qi | 10.272523 | 5.664855 | 4.736593 | 3.507469 | 18.789088 | 5.405699 | 7.357104 | 3.965322 |
Avg Qi | 8.476062 | 4.687757 | 4.045146 | 2.577345 | 17.285368 | 3.989166 | 6.137709 | 3.078257 |
Query | Q17 | Q18 | Q19 | Q20 | Q21 | Q22 | RF1 | RF2 |
---|---|---|---|---|---|---|---|---|
Stream 0 | 1.206347 | 0.792013 | 0.699476 | 1.349182 | 1.505387 | 0.543947 | 1.549135 | 0.824344 |
Stream 1 | 5.135036 | 1.873195 | 4.978155 | 5.988226 | 4.705365 | 1.211049 | 4.175947 | 3.579242 |
Stream 2 | 7.656125 | 2.229819 | 2.805272 | 6.629781 | 4.138014 | 1.423334 | 5.165700 | 3.197300 |
Stream 3 | 6.385983 | 2.086301 | 3.450305 | 3.292353 | 5.503905 | 2.302992 | 4.860041 | 3.865383 |
Stream 4 | 6.514967 | 2.876895 | 3.481100 | 1.629007 | 5.715903 | 2.121692 | 3.681208 | 3.347289 |
Stream 5 | 4.100205 | 2.400816 | 2.142291 | 4.710677 | 5.765320 | 1.616445 | 6.095817 | 3.007436 |
Min Qi | 4.100205 | 1.873195 | 2.142291 | 1.629007 | 4.138014 | 1.211049 | 3.681208 | 3.007436 |
Max Qi | 7.656125 | 2.876895 | 4.978155 | 6.629781 | 5.765320 | 2.302992 | 6.095817 | 3.865383 |
Avg Qi | 5.958463 | 2.293405 | 3.371425 | 4.450009 | 5.165701 | 1.735102 | 4.795743 | 3.399330 |
Report Date | September 25, 2014 |
---|---|
Database Scale Factor | 300 |
Start of Database Load | 09/25/2014 16:38:20 |
End of Database Load | 09/25/2014 18:32:06 |
Database Load Time | 1:53:46 |
Query Streams for Throughput Test | 6 |
Virt-H Power | 61,988.7 |
Virt-H Throughput | 384,883.7 |
Virt-H Composite Query-per-Hour Metric (Qph@300GB) | 154,461.6 |
Measurement Interval in Throughput Test (Ts) | 370.498000 seconds |
Start Date/Time | End Date/Time | Duration | |
---|---|---|---|
Stream 0 | 09/25/2014 19:00:29 | 09/25/2014 19:22:25 | 0:21:56 |
Stream 1 | 09/25/2014 19:22:27 | 09/25/2014 19:28:23 | 0:05:56 |
Stream 2 | 09/25/2014 19:22:27 | 09/25/2014 19:28:23 | 0:05:56 |
Stream 3 | 09/25/2014 19:22:27 | 09/25/2014 19:28:26 | 0:05:59 |
Stream 4 | 09/25/2014 19:22:27 | 09/25/2014 19:28:13 | 0:05:46 |
Stream 5 | 09/25/2014 19:22:27 | 09/25/2014 19:28:38 | 0:06:11 |
Stream 6 | 09/25/2014 19:22:27 | 09/25/2014 19:28:38 | 0:06:11 |
Refresh 0 | 09/25/2014 19:00:29 | 09/25/2014 19:03:56 | 0:03:27 |
09/25/2014 19:22:25 | 09/25/2014 19:22:27 | 0:00:02 | |
Refresh 1 | 09/25/2014 19:25:22 | 09/25/2014 19:25:58 | 0:00:36 |
Refresh 2 | 09/25/2014 19:22:27 | 09/25/2014 19:23:11 | 0:00:44 |
Refresh 3 | 09/25/2014 19:23:10 | 09/25/2014 19:23:40 | 0:00:30 |
Refresh 4 | 09/25/2014 19:23:40 | 09/25/2014 19:24:21 | 0:00:41 |
Refresh 5 | 09/25/2014 19:24:21 | 09/25/2014 19:24:58 | 0:00:37 |
Refresh 6 | 09/25/2014 19:24:59 | 09/25/2014 19:25:22 | 0:00:23 |
Query | Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | Q7 | Q8 |
---|---|---|---|---|---|---|---|---|
Stream 0 | 183.735463 | 95.826361 | 79.826802 | 87.603164 | 47.099641 | 1.301704 | 2.606488 | 52.667426 |
Stream 1 | 9.400003 | 1.983777 | 15.839250 | 3.001843 | 15.593335 | 6.067716 | 8.870516 | 11.679706 |
Stream 2 | 12.634711 | 3.472203 | 13.683075 | 8.057952 | 16.500741 | 5.403771 | 11.181661 | 12.393932 |
Stream 3 | 10.807287 | 3.793587 | 15.844244 | 3.214977 | 15.960600 | 7.099744 | 10.424530 | 21.001623 |
Stream 4 | 11.900829 | 3.741707 | 14.219904 | 5.616907 | 16.487144 | 14.229782 | 11.100193 | 8.769539 |
Stream 5 | 13.933423 | 2.916529 | 19.453452 | 5.258843 | 16.706269 | 7.948711 | 8.982104 | 17.566729 |
Stream 6 | 17.084445 | 0.738683 | 11.503079 | 8.324812 | 23.483917 | 20.101834 | 9.207737 | 10.311292 |
Min Qi | 9.400003 | 0.738683 | 11.503079 | 3.001843 | 15.593335 | 5.403771 | 8.870516 | 8.769539 |
Max Qi | 17.084445 | 3.793587 | 19.453452 | 8.324812 | 23.483917 | 20.101834 | 11.181661 | 21.001623 |
Avg Qi | 12.626783 | 2.774414 | 15.090501 | 5.579222 | 17.455334 | 10.141926 | 9.961123 | 13.620470 |
Query | Q9 | Q10 | Q11 | Q12 | Q13 | Q14 | Q15 | Q16 |
---|---|---|---|---|---|---|---|---|
Stream 0 | 41.997798 | 2.727870 | 21.651730 | 25.704209 | 293.103984 | 3.171437 | 2.886688 | 5.298823 |
Stream 1 | 29.662265 | 22.788618 | 12.979253 | 7.121358 | 62.774323 | 22.132581 | 22.616793 | 21.625334 |
Stream 2 | 28.041750 | 22.481172 | 19.262140 | 5.790272 | 58.105179 | 16.809177 | 32.813330 | 12.692499 |
Stream 3 | 32.534297 | 15.460256 | 12.038047 | 7.012926 | 59.413740 | 18.540284 | 25.968635 | 16.716208 |
Stream 4 | 28.759993 | 15.123651 | 21.734471 | 6.920480 | 63.119744 | 12.848884 | 21.372432 | 11.662102 |
Stream 5 | 18.315308 | 21.781800 | 26.141212 | 8.230858 | 60.985590 | 22.369824 | 27.098660 | 25.283066 |
Stream 6 | 31.455961 | 27.078707 | 12.954580 | 11.081669 | 72.483462 | 12.376376 | 22.129120 | 11.439147 |
Min Qi | 18.315308 | 15.123651 | 12.038047 | 5.790272 | 58.105179 | 12.376376 | 21.372432 | 11.439147 |
Max Qi | 32.534297 | 27.078707 | 26.141212 | 11.081669 | 72.483462 | 22.369824 | 32.813330 | 25.283066 |
Avg Qi | 28.128262 | 20.785701 | 17.518284 | 7.692927 | 62.813673 | 17.512854 | 25.333162 | 16.569726 |
Query | Q17 | Q18 | Q19 | Q20 | Q21 | Q22 | RF1 | RF2 |
---|---|---|---|---|---|---|---|---|
Stream 0 | 7.793403 | 81.545934 | 41.648484 | 4.638731 | 25.003179 | 0.536267 | 206.980380 | 2.501589 |
Stream 1 | 27.058060 | 3.894254 | 8.664394 | 25.315007 | 11.921265 | 3.561859 | 22.936601 | 13.235777 |
Stream 2 | 25.718500 | 6.140657 | 8.856586 | 14.761290 | 11.870351 | 7.728217 | 13.882613 | 29.328859 |
Stream 3 | 15.896774 | 8.631035 | 15.742406 | 20.621604 | 13.370582 | 5.536313 | 14.677463 | 14.772753 |
Stream 4 | 22.458327 | 5.319241 | 11.973431 | 22.344017 | 11.534642 | 2.402683 | 24.214115 | 16.236299 |
Stream 5 | 13.407745 | 5.413278 | 8.800650 | 18.055743 | 17.528827 | 4.173171 | 15.927165 | 21.636801 |
Stream 6 | 8.069721 | 5.531066 | 13.233927 | 21.321389 | 7.622026 | 12.064182 | 11.457848 | 12.342336 |
Min Qi | 8.069721 | 3.894254 | 8.664394 | 14.761290 | 7.622026 | 2.402683 | 11.457848 | 12.342336 |
Max Qi | 27.058060 | 8.631035 | 15.742406 | 25.315007 | 17.528827 | 12.064182 | 24.214115 | 29.328859 |
Avg Qi | 18.768188 | 5.821588 | 11.211899 | 20.403175 | 12.307949 | 5.911071 | 17.182634 | 17.925471 |
Report Date | September 25, 2014 |
---|---|
Database Scale Factor | 300 |
Start of Database Load | 09/25/2014 16:38:20 |
End of Database Load | 09/25/2014 18:32:06 |
Database Load Time | 1:53:46 |
Query Streams for Throughput Test | 6 |
Virt-H Power | 423,431.8 |
Virt-H Throughput | 387,248.6 |
Virt-H Composite Query-per-Hour Metric (Qph@300GB) | 404,936.3 |
Measurement Interval in Throughput Test (Ts) | 368.236000 seconds |
Start Date/Time | End Date/Time | Duration | |
---|---|---|---|
Stream 0 | 09/25/2014 19:28:42 | 09/25/2014 19:29:58 | 0:01:16 |
Stream 1 | 09/25/2014 19:30:00 | 09/25/2014 19:36:04 | 0:06:04 |
Stream 2 | 09/25/2014 19:30:00 | 09/25/2014 19:36:00 | 0:06:00 |
Stream 3 | 09/25/2014 19:30:00 | 09/25/2014 19:36:06 | 0:06:06 |
Stream 4 | 09/25/2014 19:30:00 | 09/25/2014 19:36:07 | 0:06:07 |
Stream 5 | 09/25/2014 19:30:00 | 09/25/2014 19:35:53 | 0:05:53 |
Stream 6 | 09/25/2014 19:30:00 | 09/25/2014 19:36:08 | 0:06:08 |
Refresh 0 | 09/25/2014 19:28:41 | 09/25/2014 19:28:46 | 0:00:05 |
09/25/2014 19:29:58 | 09/25/2014 19:30:00 | 0:00:02 | |
Refresh 1 | 09/25/2014 19:32:23 | 09/25/2014 19:32:55 | 0:00:32 |
Refresh 2 | 09/25/2014 19:30:00 | 09/25/2014 19:30:31 | 0:00:31 |
Refresh 3 | 09/25/2014 19:30:31 | 09/25/2014 19:31:00 | 0:00:29 |
Refresh 4 | 09/25/2014 19:31:01 | 09/25/2014 19:31:23 | 0:00:22 |
Refresh 5 | 09/25/2014 19:31:23 | 09/25/2014 19:31:54 | 0:00:31 |
Refresh 6 | 09/25/2014 19:31:55 | 09/25/2014 19:32:23 | 0:00:28 |
Query | Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | Q7 | Q8 |
---|---|---|---|---|---|---|---|---|
Stream 0 | 4.197427 | 1.011516 | 2.535959 | 0.858781 | 2.857279 | 1.293530 | 2.682266 | 2.260502 |
Stream 1 | 15.467757 | 3.517499 | 13.820864 | 4.157259 | 13.141556 | 10.902710 | 16.899687 | 8.986535 |
Stream 2 | 15.639991 | 6.026485 | 13.521624 | 3.918031 | 17.336458 | 1.975310 | 9.718194 | 15.165247 |
Stream 3 | 14.891929 | 4.481383 | 15.322621 | 5.272911 | 15.266543 | 6.771253 | 13.430646 | 20.171084 |
Stream 4 | 14.560526 | 2.464157 | 11.567112 | 5.526629 | 20.531540 | 5.225971 | 16.288606 | 17.209475 |
Stream 5 | 10.390577 | 3.549165 | 9.598328 | 8.783847 | 17.351211 | 6.308214 | 12.606512 | 13.035716 |
Stream 6 | 16.275922 | 4.086475 | 14.109963 | 4.385887 | 10.174709 | 6.703266 | 8.936217 | 16.798526 |
Min Qi | 10.390577 | 2.464157 | 9.598328 | 3.918031 | 10.174709 | 1.975310 | 8.936217 | 8.986535 |
Max Qi | 16.275922 | 6.026485 | 15.322621 | 8.783847 | 20.531540 | 10.902710 | 16.899687 | 20.171084 |
Avg Qi | 14.537784 | 4.020861 | 12.990085 | 5.340761 | 15.633670 | 6.314454 | 12.979977 | 15.227764 |
Query | Q9 | Q10 | Q11 | Q12 | Q13 | Q14 | Q15 | Q16 |
---|---|---|---|---|---|---|---|---|
Stream 0 | 8.300092 | 2.598145 | 5.168418 | 1.619399 | 11.958836 | 3.191672 | 3.097822 | 2.497410 |
Stream 1 | 26.412829 | 17.354745 | 12.942454 | 8.169447 | 58.600101 | 15.227942 | 32.985324 | 13.914978 |
Stream 2 | 34.523245 | 17.635531 | 15.193748 | 8.435375 | 62.442800 | 16.276300 | 26.533303 | 12.414575 |
Stream 3 | 25.334301 | 18.595422 | 11.663933 | 10.029387 | 63.664992 | 20.378320 | 24.760768 | 15.710589 |
Stream 4 | 36.971957 | 15.645673 | 14.672851 | 13.196301 | 58.214728 | 17.375053 | 26.581101 | 11.624989 |
Stream 5 | 30.891797 | 12.993365 | 14.089049 | 10.515091 | 65.232712 | 20.807026 | 26.920526 | 11.362095 |
Stream 6 | 38.143281 | 21.106772 | 15.152299 | 18.845766 | 66.240343 | 12.295624 | 22.510610 | 18.081103 |
Min Qi | 25.334301 | 12.993365 | 11.663933 | 8.169447 | 58.214728 | 12.295624 | 22.510610 | 11.362095 |
Max Qi | 38.143281 | 21.106772 | 15.193748 | 18.845766 | 66.240343 | 20.807026 | 32.985324 | 18.081103 |
Avg Qi | 32.046235 | 17.221918 | 13.952389 | 11.531894 | 62.399279 | 17.060044 | 26.715272 | 13.851388 |
Query | Q17 | Q18 | Q19 | Q20 | Q21 | Q22 | RF1 | RF2 |
---|---|---|---|---|---|---|---|---|
Stream 0 | 4.016212 | 1.603004 | 1.836489 | 3.542383 | 3.901876 | 0.515102 | 4.759612 | 2.358873 |
Stream 1 | 22.162387 | 10.067834 | 15.772705 | 22.091355 | 12.974776 | 8.354196 | 19.342171 | 12.771250 |
Stream 2 | 25.647926 | 4.263008 | 11.590737 | 19.179326 | 17.899770 | 4.137031 | 15.720245 | 14.719776 |
Stream 3 | 14.511279 | 7.484608 | 20.735250 | 13.041037 | 17.139046 | 6.014141 | 16.234122 | 13.454647 |
Stream 4 | 19.297494 | 10.110707 | 10.907458 | 19.649066 | 15.206251 | 3.423503 | 11.268082 | 11.852223 |
Stream 5 | 17.445165 | 5.582309 | 15.266324 | 19.788382 | 14.245770 | 2.810949 | 16.601461 | 14.019717 |
Stream 6 | 25.115339 | 6.896503 | 11.661563 | 21.900028 | 5.520025 | 3.093050 | 15.436258 | 13.353446 |
Min Qi | 14.511279 | 4.263008 | 10.907458 | 13.041037 | 5.520025 | 2.810949 | 11.268082 | 11.852223 |
Max Qi | 25.647926 | 10.110707 | 20.735250 | 22.091355 | 17.899770 | 8.354196 | 19.342171 | 14.719776 |
Avg Qi | 20.696598 | 7.400828 | 14.322339 | 19.274866 | 13.830940 | 4.638812 | 15.767057 | 13.361843 |
Report Date | September 25, 2014 |
---|---|
Database Scale Factor | 300 |
Total Data Storage/Database Size | 258,888M |
Start of Database Load | 09/25/2014 16:38:20 |
End of Database Load | 09/25/2014 18:32:06 |
Database Load Time | 1:53:46 |
Query Streams for Throughput Test | 6 |
Virt-H Power | 417,672.0 |
Virt-H Throughput | 389,719.5 |
Virt-H Composite Query-per-Hour Metric (Qph@300GB) | 403,453.7 |
Measurement Interval in Throughput Test (Ts) | 365.902000 seconds |
Start Date/Time | End Date/Time | Duration | |
---|---|---|---|
Stream 0 | 09/25/2014 19:36:11 | 09/25/2014 19:37:29 | 0:01:18 |
Stream 1 | 09/25/2014 19:37:32 | 09/25/2014 19:43:13 | 0:05:41 |
Stream 2 | 09/25/2014 19:37:32 | 09/25/2014 19:43:31 | 0:05:59 |
Stream 3 | 09/25/2014 19:37:32 | 09/25/2014 19:43:37 | 0:06:05 |
Stream 4 | 09/25/2014 19:37:32 | 09/25/2014 19:43:33 | 0:06:01 |
Stream 5 | 09/25/2014 19:37:32 | 09/25/2014 19:43:32 | 0:06:00 |
Stream 6 | 09/25/2014 19:37:32 | 09/25/2014 19:43:37 | 0:06:05 |
Refresh 0 | 09/25/2014 19:36:12 | 09/25/2014 19:36:16 | 0:00:04 |
09/25/2014 19:37:29 | 09/25/2014 19:37:31 | 0:00:02 | |
Refresh 1 | 09/25/2014 19:40:02 | 09/25/2014 19:40:33 | 0:00:31 |
Refresh 2 | 09/25/2014 19:37:31 | 09/25/2014 19:38:01 | 0:00:30 |
Refresh 3 | 09/25/2014 19:38:01 | 09/25/2014 19:38:30 | 0:00:29 |
Refresh 4 | 09/25/2014 19:38:30 | 09/25/2014 19:38:58 | 0:00:28 |
Refresh 5 | 09/25/2014 19:38:58 | 09/25/2014 19:39:27 | 0:00:29 |
Refresh 6 | 09/25/2014 19:39:27 | 09/25/2014 19:40:01 | 0:00:34 |
Query | Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | Q7 | Q8 |
---|---|---|---|---|---|---|---|---|
Stream 0 | 4.305006 | 1.083442 | 2.502758 | 0.845763 | 2.840824 | 1.346166 | 2.659511 | 2.233550 |
Stream 1 | 11.513360 | 3.732513 | 14.530428 | 3.819517 | 14.821291 | 7.561547 | 10.435082 | 8.984230 |
Stream 2 | 13.486433 | 3.373689 | 9.620363 | 3.914320 | 16.857542 | 5.837487 | 10.695443 | 17.901191 |
Stream 3 | 11.015942 | 1.780220 | 4.830412 | 9.073543 | 15.587709 | 9.661989 | 12.374931 | 15.262485 |
Stream 4 | 13.600461 | 0.820899 | 12.254226 | 7.799415 | 19.860761 | 13.145017 | 14.404345 | 11.807583 |
Stream 5 | 13.358000 | 3.885118 | 11.099935 | 4.845043 | 18.286721 | 6.424272 | 9.735255 | 15.041608 |
Stream 6 | 13.588873 | 3.789631 | 13.503399 | 5.130389 | 13.104065 | 3.517076 | 14.929079 | 19.831639 |
Min Qi | 11.015942 | 0.820899 | 4.830412 | 3.819517 | 13.104065 | 3.517076 | 9.735255 | 8.984230 |
Max Qi | 13.600461 | 3.885118 | 14.530428 | 9.073543 | 19.860761 | 13.145017 | 14.929079 | 19.831639 |
Avg Qi | 12.760511 | 2.897012 | 10.973127 | 5.763705 | 16.419681 | 7.691231 | 12.095689 | 14.804789 |
Query | Q9 | Q10 | Q11 | Q12 | Q13 | Q14 | Q15 | Q16 |
---|---|---|---|---|---|---|---|---|
Stream 0 | 8.553183 | 3.215484 | 4.652364 | 1.620089 | 11.936052 | 2.916132 | 3.219969 | 2.374276 |
Stream 1 | 29.441108 | 20.348266 | 9.994556 | 14.965432 | 60.537168 | 13.302875 | 30.159402 | 10.277570 |
Stream 2 | 41.799347 | 18.197400 | 16.773638 | 6.510347 | 67.461446 | 20.362328 | 0.109929 | 9.908769 |
Stream 3 | 24.306937 | 20.555376 | 17.140758 | 16.715188 | 61.724168 | 22.469230 | 27.967206 | 13.434167 |
Stream 4 | 34.820796 | 11.795664 | 18.015120 | 7.176057 | 63.134711 | 11.427374 | 23.959842 | 16.759246 |
Stream 5 | 23.139366 | 12.655317 | 13.152401 | 7.258740 | 64.273225 | 22.854106 | 28.803059 | 12.832364 |
Stream 6 | 27.955059 | 24.633526 | 11.046285 | 5.995041 | 74.965966 | 15.636579 | 22.803890 | 13.221303 |
Min Qi | 23.139366 | 11.795664 | 9.994556 | 5.995041 | 60.537168 | 11.427374 | 0.109929 | 9.908769 |
Max Qi | 41.799347 | 24.633526 | 18.015120 | 16.715188 | 74.965966 | 22.854106 | 30.159402 | 16.759246 |
Avg Qi | 30.243769 | 18.030925 | 14.353793 | 9.770134 | 65.349447 | 17.675415 | 22.300555 | 12.738903 |
Query | Q17 | Q18 | Q19 | Q20 | Q21 | Q22 | RF1 | RF2 |
---|---|---|---|---|---|---|---|---|
Stream 0 | 4.298092 | 1.702071 | 1.894548 | 4.118591 | 3.922889 | 0.491145 | 4.519734 | 2.347913 |
Stream 1 | 16.432222 | 6.908918 | 17.749058 | 18.756674 | 11.148628 | 5.464975 | 18.300673 | 12.972871 |
Stream 2 | 20.588544 | 4.387662 | 14.527229 | 23.844364 | 15.500462 | 15.543458 | 13.666574 | 15.240662 |
Stream 3 | 14.008049 | 6.222633 | 12.833421 | 22.811602 | 16.013232 | 9.449069 | 16.486111 | 12.974515 |
Stream 4 | 16.964699 | 8.106044 | 11.207675 | 22.483826 | 17.354675 | 4.641183 | 14.583941 | 13.679087 |
Stream 5 | 25.243144 | 7.359437 | 16.986615 | 19.855391 | 17.183725 | 5.750937 | 14.759597 | 13.052316 |
Stream 6 | 12.986721 | 10.160993 | 17.496662 | 19.267026 | 17.300224 | 4.955930 | 19.267721 | 15.421241 |
Min Qi | 12.986721 | 4.387662 | 11.207675 | 18.756674 | 11.148628 | 4.641183 | 13.666574 | 12.972871 |
Max Qi | 25.243144 | 10.160993 | 17.749058 | 23.844364 | 17.354675 | 15.543458 | 19.267721 | 15.421241 |
Avg Qi | 17.703896 | 7.190948 | 15.133443 | 21.169814 | 15.750158 | 7.634259 | 16.177436 | 13.890115 |
To be continued...
In practice, things are not quite so simple. Larger data, particularly a different data-to-memory ratio, and the fact of having no shared memory, all play a role. There is also a network, so partitioned operations, which also existed in the single-server case, now have to send messages across machines, not across threads. For data loading and refreshes, there is generally no shared file system, so data distribution and parallelism have to be considered.
As an initial pass, we look at 100G and 1000G scales on the same test system as before. This is two machines, each with dual Xeon E5-2630, 192 GB RAM, 2 x 512 GB SSD, and QDR InfiniBand. We will also try other platforms, but if nothing else is said, this is the test system.
As of this writing, there is a working implementation, but it is not guaranteed to be optimal as yet. We will adjust it as we go through the workload. One outcome of the experiment will be a precise determination of the data-volume-to-RAM ratio that still gives good performance.
A priori, we know of the following things that complicate life with clusters:
Distributed memory — The working set must be in memory for a run to have a competitive score. A cluster can have a lot of memory, and the data is such that it partitions very evenly, so this appears at first not a problem. The difficulty comes with query memory: If each machine has 1/16th of the total RAM and a hash table would be 1/64th of the working set, on a single-server it is no problem just building the hash table. On a scale-out system, the hash table would be 1/4 of the working set if replicated on each node, which will not fit, especially if there are many such hash tables at the same time. Two main approaches exist: The hash table can be partitioned, but this will force the probe to go cross-partition, which takes time. The other possibility is to build the hash table many times, each time with a fraction of the data, and to run the probe side many times. Since hash tables often have Bloom filters, it is sometimes possible to replicate the Bloom filter and partition the hash table. One has also heard of hash tables that go to secondary storage, but should this happen, the race is already lost; so, we do not go there.
We must evaluate different combinations of these techniques and have a cost model that accurately predicts the performance of each variant. Adding to realism is always safe but halfway difficult to do.
NUMA — Most servers are NUMA (non-uniform memory architecture), where each CPU socket has its own local memory. For single-server cases, we use all the memory for the process. Some implementations have special logic for memory affinity between threads. With scale-out there is the choice of having a server process per-NUMA-node or per-physical-machine. If per-NUMA-node, we are guaranteed only local memory accesses. This is a tradeoff to be evaluated.
Network and Scheduling — Execution on a cluster is always vectored, for the simple reason that sending single-tuple messages is unfeasible in terms of performance. With an otherwise vectored architecture, the message batching required on a cluster comes naturally. However, the larger the cluster, the more partitions there are, which rapidly gets into shorter messages. Increasing the vector size is possible and messages become longer, but indefinite increase in vector size has drawbacks for cache locality and takes memory. To run well, each thread must stay on core. There are two ways of being taken off core ahead of time: Blocking for a mutex, and blocking for network. Lots of short messages run into scheduling overhead, since the recipient must decide what to do with each, which is not really possible without some sort of critical section. This is more efficient if messages are longer, as the decision time does not depend on message length. Longer messages are however liable to block on write at the sender side. So one pays in either case. This is another tradeoff to be balanced.
Flow control — A query is a pipeline of producers and consumers. Sometimes the consumer is in a different partition. The producer must not get indefinitely ahead of the consumer because this would run out of memory, but it must stay sufficiently ahead so as not to stop the consumer. In practice, there are synchronization barriers to check even progress. These will decrease platform utilization, because two threads never finish at exactly the same time. The price of not having these is having no cap on transient memory consumption.
Un-homogenous performance — Identical machines do not always perform identically. This is seen especially with disk, where wear on SSDs can affect write speed, and where uncontrollable hazards of data placement will get uneven read speeds on rotating media. Purely memory-bound performance is quite close, though. Un-anticipatable and uncontrollable hazards of scheduling cause different times of arrival of network messages, which introduces variation in run time on consecutive runs. Single-servers have some such variation from threading, but the effects are larger with a network.
The logical side of query optimization stays the same. Pushing down predicates is always good, and all the logical tricks with moving conditions between subqueries stay the same.
Schema design stays much the same, but there is the extra question of partitioning keys. In this implementation, there are only indices on identifiers, not on dates, for example. So, for a primary key to foreign key join, if there is an index on the foreign key, the index should be partitioned the same way as the primary key. So, joining from orders
to lineitem
on orderkey
will be co-located. Joining from customer
to orders
by index will be colocated for the c_custkey = o_custkey
part (assuming an index on o_custkey
) and cross-partition for getting the customer
row on c_custkey
, supposing that the query needs some property of the customer
other than c_custkey
or c_orderkey
.
A secondary question is the partition granularity. For good compression, nearby values should be consecutive, so here we leave the low 12 bits out of the partitioning. This has effect on bulk load and refreshes, for example, so that a batch of 10,000 lineitems
, ordered on l_orderkey
will go to only 2 or 3 distinct destinations, thus getting longer messages and longer insert batches, which is more efficient.
This is a quick overview of the wisdom so far. In subsequent installments, we will take a quantitative look at the tradeoffs and consider actual queries. As a conclusion, we will show a full run on a couple of different platforms, and likely provide Amazon machine images for the interested to see for themselves. Virtuoso Cluster is not open source, but the cloud will provide easy access.
To be continued...
The specializations converge. The RDBMS becomes more adaptable and less schema-first. Of course the RDBMS also take new data models beside the relational. RDF and other property graph models, for instance.
The schema-last-ness is now well in evidence. For example, PostgreSQL has an hstore
column type which is a list of key-value pairs. Vertica has a feature called flex tables where a column can be added on a row-by-row basis.
Specialized indexing for text and geometries is a well established practice. However, dedicated IR systems, often Lucene derivatives, can offer more transparency in the IR domain for things like vector-space-models and hit-scoring. There is specialized faceted search support which is quite good. I do not know of an RDBMS that would do the exact same trick as Lucene for facets, but, of course, in the forever expanding scope of RDB, this is added easily enough.
JSON is all the rage in the web developer world. Phil Archer even said in his keynote, as a parody of the web developer: " I will never touch that crap of RDF or the semantic web; this is a pipe dream of reality ignoring academics and I will not have it. I will only use JSON-LD."
XML and JSON are much the same thing. While most databases have had XML support for over a decade, there is a crop of specialized JSON systems like MongoDB. PostgreSQL also has a JSON datatype. Unsurprisingly, MarkLogic too has JSON, as this is pretty much the same thing as their core competence of XML.
Virtuoso, too, naturally has a JSON parser, and mapping this to the native XML data type is a non-issue. This should probably be done.
Stefano Bertolo of the EC, also LOD2 project officer, used the word Cambrian explosion when talking about the proliferation of new database approaches in recent years.
Hadoop is a big factor in some environments. Actian Vector (née VectorWise), for example, can use this as its file system. HDFS is singularly cumbersome for this but still not impossible and riding the Hadoop bandwagon makes this adaptation likely worthwhile.
Graphs are popular in database research. We have a good deal of exposure to this via LDBC. Going back to an API for database access, as is often done in graph database, can have its point, especially as a reaction to the opaque and sometimes hard to predict query optimization of declarative languages. This just keeps getting more complex, so a counter-reaction is understandable. APIs are good if crossed infrequently and bad otherwise. So, graph database APIs will develop vectoring, is my prediction and even recommendation in LDBC deliverables.
So, there are diverse responses to the same evolutionary pressures. These are of initial necessity one-off special-purpose systems, since the time to solution is manageable. Doing these things inside an RDBMS usually takes longer. The geek also likes to start from scratch. Well, not always, as there have been some cases of grafting some entirely non-MySQL-like functionality, e.g. Infobright and Kickfire, onto MySQL.
From the Virtuoso angle, adding new data and control structures has been done many times. There is no reason why this cannot continue. The next instances will consist of some graph processing (BSP, or Bulk Synchronous Processing) in the query languages. Another recent example is an interface for pluggable specialized content indices. One can make chemical structure indices, use alternate full text indices, etc., with this.
Most of this diversification has to do with physical design. The common logical side is a demand for more flexibility in schema and sometimes in scaling, e.g., various forms of elasticity in growing scale-out clusters, especially with the big web players.
The diversification is a fact, but the results tend to migrate into the RDBMS given enough time.
On the other hand, when a new species like the RDF store emerges, with products that do this and no other thing and are numerous enough to form a market, the RDBMS functionality seeps in. Bigdata has a sort of multicolumn table feature, if I am not mistaken. We just heard about the wish for strict schema, views, and triggers. By all means.
From the Virtuoso angle, with structure awareness, the difference of SQL and RDF gradually fades, and any advance can be exploited to equal effect on either side.
Right now, I would say we have convergence when all the experimental streams feel many of the same necessities.
Of course you cannot have a semantic tech conference without the matter of the public SPARQL end point coming up. The answer is very simple: If you have operational need for SPARQL accessible data, you must have your own infrastructure. No public end points. Public end points are for lookups and discovery; sort of a dataset demo. If operational data is in all other instances the responsibility of the one running the operation, why should it be otherwise here? Outsourcing is of course possible, either for platform (cloud) or software (SaaS). To outsource something with a service level, the service level must be specifiable. A service level cannot be specified in terms of throughput with arbitrary queries but in terms of well defined transactions; hence the services world runs via APIs, as in the case of Open PHACTS. For arbitrary queries (i.e., analytics on demand), with the huge variation in performance dependent on query plans and configuration of schema, the best is to try these things with platform on demand in a cloud. Like this, there can be a clear understanding of performance, which cannot be had with an entirely uncontrolled concurrent utilization. For systems in constant operation, having one's own equipment is cheaper, but still might be impossible to procure due to governance.
Having clarified this, the incentives for operators also become clearer. A public end point is a free evaluation; a SaaS deal or product sale is the commercial offering.
Anyway, common datasets like DBpedia are available preconfigured on AWS with a Virtuoso server. For larger data, there is a point to making ready-to-run cluster configurations available for evaluation, now that AWS has suitable equipment (e.g., dual E5 2670 with 240 GB RAM and SSD for USD 2.8 an hour). According to Amazon, up to five of these are available at a time without special request. We will try this during the fall and make the images available.
After the talk, my answer was that naturally the existence of something that expressed the same sort of thing as SQL DDL, with W3C backing, can only be a good thing and will give the structure awareness work by OpenLink in Virtuoso and probably others a more official seal of approval. Quite importantly, this will be a facilitator of interoperability and will raise this from a product specific optimization trick to a respectable, generally-approved piece of functionality.
This is the general gist of the matter and can hardly be otherwise. But underneath is a whole world of details, which we discussed at the reception.
Phil noted that there was controversy around whether a lightweight OWL-style representation or SPIN should function as the basis for data shapes.
Phil stated in the keynote that the W3C considered the RDF series of standards as good and complete, but would still have working groups for filling in gaps as these came up. This is what I had understood from my previous talks with him at the Linking Geospatial Data workshop in London earlier this year.
So, against this backdrop, as well as what I had discussed with Ralph Hodgson of Top Quadrant at a previous LDBC TUC meeting in Amsterdam, SPIN seems to me a good fit.
Now, it turns out that we are talking about two different use cases. Phil said that the RDF Data Shapes use case was about making explicit what applications required of data. For example, all products should have a unit price, and this should have one value that is a number.
The SPIN proposition on the other hand, as Ralph himself put it in the LDBC meeting, is providing to the linked data space functionality that roughly corresponds to SQL views. Well, this is one major point, but SPIN involves more than this.
So, is it DDL or views? These are quite different. I proposed to Phil that there was in fact little point in fighting over this; best to just have two profiles.
To be quite exact, even SQL DDL equivalence is tricky, since enforcing this requires a DBMS; consider, for instance, foreign key and check constraints. At the reception, Phil stressed that SPIN was certainly good but since it could not be conceived without a SPARQL implementation, it was too heavy to use as a filter for an application that, for example, just processed a stream of triples.
The point, as I see it, is that there is a wish to have data shape enforcement, at least to a level, in a form that can apply to a stream without random access capability or general purpose query language. This can make sense for some big data style applications, like an ETL-stage pre-cooking of data before the application. Applications mostly run against a DBMS, but in some cases, this could be a specialized map-reduce or graph analytics job also, so no low cost random access.
My own take is that views are quite necessary, especially for complex query; this is why Virtuoso has the SPARQL macro extension. This will do, by query expansion, a large part of what general purpose inference will do, except for complex recursive cases. Simple recursive cases come down to transitivity and still fit the profile. SPIN is a more generic thing, but has a large intersection with SPARQL macro functionality.
My other take is that structure awareness needs a way of talking about structure. This is a use case that is clearly distinct from views.
A favorite example of mine is the business rule that a good customer is one that has ordered more than 5 times in the last year, for a total of more than so much, and has no returns or complaints. This can be stated as a macro or SPIN rule with some aggregates and existences. This cannot be stated in any of the OWL profiles. When presented with this, Phil said that this was not the use case. Fair enough. I would not want to describe what amounts to SQL DDL in these terms either.
A related topic that has come up in other conversations is the equivalent of the trigger. One use case of this is enforcement of business rules and complex access rights for updates. So, we see that the whole RDBMS repertoire is getting recreated.
Now, talking from the viewpoint of the structure-aware RDF store, or the triple-stream application for that matter, I will outline some of what data shapes should do. The triggers and views matter is left out, here.
The commonality of bulk-load, ETL, and stream processing, is that they should not rely on arbitrary database access. This would slow them down. Still, they must check the following sorts of things:
All these checks depend on previous triples about the subject; for example, these checks may be conditional on the subject having a certain RDF type. In a data model with a join per attribute, some joining cannot be excluded. Checking conditions that can be resolved one triple at a time is probably not enough, at least not for the structure-aware RDF store case.
But, to avoid arbitrary joins which would require a DBMS, we have to introduce a processing window. The triples in the window must be cross-checkable within the window. With RDF set semantics, some reference data may be replicated among processing windows (e.g., files) with no ill effect.
A version of foreign key declarations is useful. To fit within a processing window, complete enforcement may not be possible but the declaration should still be possible, a little like in SQL where one can turn off checking.
In SQL, it is conventional to name columns by prefixing them with an abbreviation of the table name. All the TPC schemas are like that, for example. Generally in coding, it is good to prefix names with data type or subsystem abbreviation. In RDF, this is not the practice. For reuse of vocabularies, where a property may occur in anything, the namespace or other prefix denotes where the property comes from, not where it occurs.
So, in TPC-H, l_partkey
and ps_partkey
are both foreign keys that refer to part, plus that l_partkey
is also a part of a composite foreign key to partsupp
. By RDF practices, these would be called rdfh:hasPart
. So, depending on which subject type we have, rdfh:hasPart
is 30:1 or 4:1. (distinct subjects:distinct objects) Due to this usage, the property's features are not dependent only on the property, but on the property plus the subject/object where it occurs.
In the relational model, when there is a parent and a child item (one to many), the child item usually has a composite key prefixed with the parent's key, with a distinguishing column appended, e.g., l_orderkey, l_linenumber
. In RDF, this is rdfh:hasOrder
as a property of the lineitem
subject. In SQL, there is no single part lineitem
subject at all, but in RDF, one must be made since everything must be referenceable with a single value. This does not have to matter very much, as long as it is possible to declare that lineitems
will be primarily accessed via their order. It is either this or a scan of all lineitems
. Sometimes a group of lineitems
are accessed by the composite foreign key of l_partkey, l_suppkey
. There could be a composite index on these. Furthermore, for each l_partkey, l_suppkey
in lineitem
there exists a partsupp
. In an RDF translation, the rdfh:hasPart
and rdfh:hasSupplier
, when they occur in a lineitem
subject, specify exactly one subject of type partsupp
. When they occur in a partsupp
subject, they are unique as a pair. Again, because names are not explicit as to where they occur and what role they play, the referential properties do not depend only on the name, but on the name plus included data shape. Declaring and checking all this is conventional in the mainstream and actually useful for query optimization also.
Take the other example of a social network where the foaf:knows
edge is qualified by a date when this edge was created. This may be by reification, or more usually by an "entitized" relationship where the foaf:knows
is made into a subject with the persons who know each other and the date of acquaintance as properties. In a SQL schema, this is a key person1, person2 -> date
. In RDF, there are two join steps to go from person1
to person2
; in SQL, 1. This is eliminated by saying that the foaf:knows
entity is usually referenced by the person1 Object or person2 Object, not the Subject identifier of the foaf:knows
.
This allows making the physical storage by O, S, G -> O2, O3, …
. A secondary index with S, G, O
still allows access by the mandatory subject identifier. In SQL, a structure like this is called a clustered table. In other words, the row is arranged contiguous with a key that is not necessarily the primary key.
So, identifying a clustering key in RDF can be important.
Identifying whether there are value-based accesses on a given Object
without making the Object
a clustering key is also important. This is equivalent to creating a secondary index in SQL. In the tradition of homogenous access by anything, such indexing may be on by default, except if the property is explicitly declared of low cardinality. For example, an index on gender makes no sense. The same is most often true of rdfs:type
. Some properties may have many distinct values (e.g., price), but are still not good for indexing, as this makes for the extreme difference in load time between SQL and the all-indexing RDF.
Identifying whether a column will be frequently updated is another useful thing. This will turn off indexing and use an easy-to-update physical representation. Plus, properties which are frequently updated are best put physically together. This may, for example, guide the choice between row-wise and column-wise representation. A customer's account balance and orders year-to-date would be an example of such properties.
Some short string valued properties may be frequently returned or used as sorting keys. This requires accessing the literal via an ID in the dictionary table. Non-string literals, numbers, dates, etc., are always inlined (at least in most implementations), but strings are a special question. Bigdata and early versions of Virtuoso would inline short ones; later versions of Virtuoso would not. So specifying, per property/class combination, a length limit for an inlined string is very high gain and trivial to do. The BSBM explore score at large scales can get a factor of 2 gain just from inlining one label. BSBM is out of its league here, but this is still really true and yields benefits across the board. The simpler the application, the greater the win.
If there are foreign keys, then data should be loaded with the referenced entities first. This makes dimensional clustering possible at load time. If the foreign key is frequently used for accessing the referencing item (for example, if customers are often accessed by country), then loading customers so that customers of the same country end up next to each other can result in great gains. The same applies to a time dimension, which in SQL is often done as a dimension table, but rarely so in linked data. Anyhow, if date is a frequent selection criterion, physically putting items in certain date ranges together can give great gains.
The trick here is not necessarily to index on date, but rather to use zone maps (aka min/max index). If nearby values are together, then just storing a min-max value for thousands of consecutive column values is very compact and fast to check, provided that the rows have nearby values. Actian Vector's (VectorWise) prowess in TPC-H is in part from smart use of date order in this style.
To recap, the data shapes desiderata from the viewpoint of guiding physical storage is as follows:
(I will use "data shape" to mean "characteristic set," or "set of Subjects subject to the same set of constraints." A Subject belonging to a data shape may be determined either by its rdfs:type
or by the fact of it having, within the processing window, all or some of a set of properties.)
UNIQUE constraint) and mandatory (as with SQL's NOT NULL
constraint) is good.
l_partkey, l_suppkey
in lineitem
with the matching primary key of ps_partkey, ps_suppkey
in partsupp
. This can be used for checking and for query optimization: Looking at l_partkey
and l_suppkey
as independent properties, the guess would be that there hardly ever exists a partsupp
, whereas one does always exist. The XML standards stack also has a notion of a composite key for random access on multiple attributes.
These things have the semantic of "hint for physical storage" and may all be ignored without effect on semantics, at least if the data is constraint-compliant to start with.
These things will have some degree of reference implementation through the evolution of Virtuoso structure awareness, though not necessarily immediately. These are, to the semanticist, surely dirty low-level disgraceful un-abstractions, some of the very abominations the early semanticists abhorred or were blissfully ignorant of when they first raised their revolutionary standard.
Still, these are well-established principles of the broader science of database. SQL does not standardize some of these, nor does it have much need to, as the use of these features is system-specific. The support varies widely and the performance impacts are diverse. However, since RDF excels as a reference model and as a data interchange format, giving these indications as hints to back-end systems cannot hurt, and can make a difference of night and day in load and query time.
As Phil Archer said, the idea of RDF Data Shapes is for an application to say that "it will barf if it gets data that is not like this." An extension is for the data to say what the intended usage pattern is so that the system may optimize for this.
All these things may be learned from static analysis and workload traces. The danger of this is over-fitting a particular profile. This enters a gray area in benchmarking. For big data, if RDF is to be used as the logical model and the race is about highest absolute performance, never mind what the physical model ends up being, all this and more is necessary. And if one is stretching the envelope for scale, the race is always about highest absolute performance. For this reason, these things will figure at the leading edge with or without standardization. I would say that the build-up of experience in the RDBMS world is sufficient for these things to be included as hints in a profile of data shapes. The compliance cost will be nil if these are ignored, so for the W3C, these will not make the implementation effort for compliance with an eventual data shapes recommendation prohibitive.
The use case is primarily the data warehouse to go. If many departments or organizations publish data for eventual use by their peers, users within the organization may compose different combinations of extractions for different purposes. Exhaustive indexing of everything by default makes the process slow and needlessly expensive, as we have seen. Much of such exploration is bounded by load time. Federated approaches for analytics are just not good, even though they may work for infrequent lookups. If datasets are a commodity to be plugged in and out, the load and query investment must be minimized without the user/DBA having to run workload analysis and manual schema optimization. Therefore, bundling guidelines such as these with data shapes in a dataset manifest can do no harm and can in cases provide 10-50x gains in load speeds and 2-4x in space consumption, not to mention unbounded gains in query time, as good and bad plans easily differ by 10-100x, especially in analytics.
So, here is the pitch:
To be continued...
The first part of the talk was under the heading of the promise and the practice. The promise we know well and find no fault with: Schema-last-ness, persistent unique identifiers, self-describing data, some but not too much inference. The applications usually involve some form of integration and often have a mix of strictly structured content with semi-structured or textual content.
These values are by now uncontroversial and embraced by many; however, most instances of this embracing do not occur in the context of RDF as such. For example, the big online systems on the web: all have some schema-last (key-value) functionality. Applications involving long-term data retention have diverse means of having persistent IDs and self description, from UUIDs to having the table name in a column so that one can tell where a CSV dump came from.
The practice involves competing with diverse alternative technologies: SQL, key-value, information retrieval (often Lucene-derived). In some instances, graph databases occur as alternatives: Young semanticist, do or die.
In this race, linked data is often the prettiest and most flexible, but gets a hit on different aspects of performance and scalability. This is a database gig, and database is a performance game; make no mistake.
After these preliminaries we come to the "RDF tax," or the more or less intrinsic overheads of describing all as triples. The word "triple" is used by habit. In fact, we nearly always talk about quads, i.e., subject-predicate-object-graph (SPOG
). The next slide is provocatively titled the Bane of the Triple, and is about why having all as triples is, on the surface, much like relational, except it makes life hard, where tables make it at least manageable, if still not altogether trivial.
The very first statement on the tax slide reads "90% of bad performance comes from non-optimal query plans." If one does triples in the customary way (i.e., a table of quads plus dictionary tables to map URIs and literal strings to internal IDs), one incurs certain fixed costs.
These costs are deemed acceptable by users who deploy linked data. If these costs were not acceptable, the proof of concept would have already disqualified linked data.
The support cases that come my way are nearly always about things taking too much time. Much less frequently, are these about something unambiguously not working. Database has well defined semantics, so whether something works or not is clear cut.
So, support cases are overwhelmingly about query optimization. The problems fall in two categories:
Getting no plan at all or getting a clearly wrong result is much less frequent.
If the RDF overheads incurred with a good query plan were show stoppers, the show would have already stopped.
So, let's look at this in more detail; then we will talk about the fixed overheads.
The join selectivity of triple patterns is correlated. Some properties occur together all the time; some occur rarely; some not at all. Some property values can be correlated, i.e., order number and order date. Capturing these by sampling in a multicolumn table is easy; capturing this in triples would require doing the join in the cost model, which is not done since it would further extend compilation times. When everything is a join, selectivity estimation errors build up fast. When everything is a join, the space of possible graph query plans explodes as opposed to tables; thus, while the full plan space can be covered with 7 tables, it cannot be covered with 18 triple patterns. This is not factorial (number of permutations). For different join types (index/hash) and the different compositions of the hash build side, this is much worse, in some nameless outer space fringe of non-polynomiality.
TPC-H can be run with success because the cost model hits the right plan every time. The primary reason for this is the fact that the schema and queries unambiguously suggest the structure, even without foreign key declarations. The other reason is that with a handful of tables, all plans can be reviewed, and the cost model reliably tells how many rows will result from each sequence of operations.
Try this with triples; you will know what I mean.
Now, some people have suggested purely rule-based models of SPARQL query compilation. These are arguably faster to run and more predictable. But the thing that must be done, yet will not be done with these, is the right trade-off between index and hash. This is the crux of the matter, and without this, one can forget about anything but lookups. The choice depends on reliable estimation of cardinality (number of rows, number of distinct keys) on either side of the join. Quantity, not pattern matching.
Well, many linked data applications are lookups. The graph database API world is sometimes attractive because it gives manual control. Map reduce in the analytical space is sometimes attractive for the same reason.
On the other hand, query languages also give manual control, but then this depends on system specific hints and cheats. People are often black and white: Either all declarative or all imperative. We stand for declarative, but still allow physical control of plan, like most DBMS.
To round off, I will give a concrete example:
{ ?thing rdfs:label ?lbl .
?thing dc:title ?title .
?lbl bif:contains "gizmo" .
?title bif:contains "widget" .
?thing a xx:Document .
?thing dc:date ?dt .
FILTER ( ?dt > "2014-01-01"^^xsd:date )
}
There are two full text conditions, one date, and one class, all on the same subject. How do you do this? Most selective text first, then get the data and check, then check the second full text given the literal and the condition, then check the class? Wrong. If widgets and gizmos are both frequent and most documents new, this is very bad because using a text index to check for a specific ID having a specific string is not easily vectorable. So, the right plan is: Take the more selective text expression, then check the date and class for the results, put the ?things
in a hash table. Then do the less selective text condition, and drop the ones that are not in the hash table. Easily 10x better. Simple? In the end yes, but you do not know this unless you know the quantities.
This gives the general flavor of the problem. Doing this with TPC-H in RDF is way harder, but you catch my drift.
Each individual instance is do-able. Having closer and closer alignment between reality and prediction will improve the situation indefinitely, but since the space is as good as infinite there cannot be a guarantee of optimality except for toy cases.
The Gordian Knot shall not be defeated with pincers but by the sword.
We will come to this in a bit.
Now, let us talk of the fixed overheads. The embarrassments are in the query optimization domain; the daily grind, relative cost, and provisioning are in this one.
The overheads come from:
These all fall under the category of having little to no physical design room.
In the indexing everything department, we load 100 GB TPC-H in 15 minutes in SQL with ordering only on primary keys and almost no other indexing. The equivalent with triples is around 12 hours. This data can be found on this blog (TPC-H series and Meeting the Challenges of Linked Data in the Enterprise). This is on the order of confusing a screwdriver with a hammer. If the nail is not too big, the wood not too hard, and you hit it just right — the nail might still go in. The RDF bulk load is close to the fastest possible given the general constraints of what it does. The same logic is used for the record-breaking 15 minutes of TPC-H bulk load, so the code is good. But indexing everything is just silly.
The second, namely the dictionary of URIs and literals, is a dual edge. I talked to Bryan Thompson of SYSTAP (Bigdata RDF store) in D.C. at the ICDE there. He said that they do short strings inline and long ones via dictionary. I said we used to do the same but stopped in the interest of better compression. What is best depends on workload and working-set-to-memory ratio. But if you must make the choice once and for all, or at least as a database-wide global setting, you are between a rock and a hard place. Physical vs. logical design, again.
The other aspect of this is the applications that do regexps on URI strings or literals. Doing this is like driving a Formula 1 race in reverse gear. Use a text index. Always. This is why most implementations have one even though SPARQL itself makes no provisions for this. If you really need regexps, and on supposedly opaque URIs at that, tokenize them and put them in a text index as a text literal. Or if an inverted-file-word index is really not what you need, use a trigram one. So far, nobody has wanted one hard enough for us to offer this, even though this is easy enough. But special indices for special data types (e.g., chemical structure) are sometimes wanted, and we have a generic solution for all this, to be introduced shortly on this blog. Again, physical design.
I deliberately name the self-join-per-attribute point last, even though this is often the first and only intrinsic overhead that is named. True, if the physical model is triples, each attribute is a join against the triple table. Vectored execution and right use of hash-join help, though. The Star Schema Benchmark SQL to SPARQL gap is only 2.5x, as documented last year on this blog. This makes SPARQL win by 100+x against MySQL and lose by only 0.8x against column store pioneer MonetDB. Let it be said that this is so far the best case and that the gap is wider in pretty much all other cases. This gap is well and truly due to the self-join matter, even after the self-joins are done vectored, local, ordered; in one word, right. The literal and URI translation matter plays no role here. The needless indexing hurts at load but has no effect at query time, since none of the bloat participates in the running. Again, physical design.
Triples are done right, so?
In the summer of 2013, after the Star Schema results, it became clear that maybe further gains could be had and query optimization made smoother and more predictable, but that these would be paths of certain progress but with diminishing returns per effort. No, not the pincers; give me the sword. So, between fall 2013 and spring 2014, aside from doing diverse maintenance, I did the TPC-H series. This is the proficiency run for big league databases; the America's Cup, not a regatta on the semantic lake.
Even if the audience is principally Linked Data, the baseline must be that of the senior science of SQL.
It stands to reason and has been demonstrated by extensive experimentation at CWI that RDF data, by and large, has structure. This structure will carry linked data through the last mile to being a real runner against the alternative technologies (SQL, IR, key value) mentioned earlier.
The operative principles have been mentioned earlier and are set forth on the slides. In forthcoming articles I will display some results.
One important proposal for structure awareness was by Thomas Neumann in an RDF3X paper introducing characteristic sets. There, the application was creation of more predictable cost estimates. Neumann correctly saw this as possibly the greatest barrier to predictable RDF performance. Peter Boncz and I discussed the use of this for physical optimization once when driving back to Amsterdam from a LOD2 review in Luxembourg. Pham Minh Duc of CWI did much of the schema discovery research, documented in the now published LOD2 book (Linked Open Data -- Creating Knowledge Out of Interlinked Data). The initial Virtuoso implementation had to wait for the TPC-H and general squeezing of the quads model to be near complete. It will likely turn out that the greatest gain of all with structure awareness will be bringing optimization predictability to SQL levels. This will open the whole bag of tricks known to data warehousing to safe deployment for linked data. Of course, much of this has to do with exploiting physical layout; hence it also needs the physical model to be adapted. Many of these techniques have high negative impact if used in the wrong place; hence the cost model must guess right. But they work in SQL and, as per Thomas Neumann's initial vision, there is no reason why these would not do so in a schema-less model if adapted in a smart enough manner.
All this gives rise to some sociological or psychological observations. Jens Lehmann asked me why now, why not earlier; after all, over the years many people have suggested property tables and other structured representations. This is now because there is no further breakthroughs within an undifferentiated physical model.
For completeness, we must here mention other approaches to alternative, if still undifferentiated, physical models. A number of research papers mention memory-only, pointer-based (i.e., no index, no hash-join) implementations of triples or quads. Some of these are on graph processing frameworks, some stand-alone. Yarc Data is a commercial implementation that falls in this category. These may have higher top speeds than column stores, even after all vectoring and related optimizations. However the space utilization is perforce larger than with optimum column compression and this plus the requirement of 100% in memory makes these more expensive to scale. The linked data proposition is usually about integration, and this implies initially large data even if not all ends up being used.
The graph analytics, pointer-based item will be specially good for a per-application extraction, as suggested by Oracle in their paper at GRADES 13. No doubt this will come under discussion at LDBC, where Oracle Labs is now a participant.
But back to physical model. What we have in mind is relational column store — multicolumn-ordered column-wise compressed tables — a bit like Vertica and Virtuoso in SQL mode for the regular parts and quads for the rest. What is big is regular, since a big thing perforce comes from something that happens a lot, like click streams, commercial transactions, instrument readings. For the 8-lane-motorway of regular data, you get the F1 racer with the hardcore best in column store tech. When the autobahn ends and turns into the mountain trail, the engine morphs into a dirt bike.
This is complex enough, and until all the easy gains have been extracted from quads, there is little incentive. Plus this has the prerequisite of quads done right, plus the need for top of the line relational capability for not falling on your face once the speedway begins.
Steve Buxton of MarkLogic gave a talk right before mine. Coming from a document-centric world, it stands to reason that MarkLogic would have a whole continuum of different mixes between SPARQL and document oriented queries. Steve correctly observed that some users found this great; others found this a near blasphemy, an unholy heterodoxy of confusing distinct principles.
This is our experience as well, since usage of XML fragments in SPARQL with XPath and such things in Virtuoso is possible but very seldom practiced. This is not the same as MarkLogic, though, as MarkLogic is about triples-in-documents, and the Virtuoso take is more like documents-in-triples. Not to mention that use of SQL and stored procedures in Virtuoso is rare among the SPARQL users.
The whole thing about the absence of physical design in RDF is a related, but broader instance of such purism.
In my talk, I had a slide titled The Cycle of Adventure, generally philosophizing on the dynamics of innovation. All progress begins with an irritation with the status quo; to mention a few examples: the No-SQL rebellion; the rejection of parallel SQL database in favor of key-value and map-reduce; the admission that central schema authority at web scale is impossible; the anti-ACID stance when having wide-area geographies to deal with. The stage of radicalism tends to discard the baby with the bathwater. But when the purists have their own enclave, free of the noxious corruption of the rejected world, they find that life is hard and defects of human character persist, even when all subscribe to the same religion. Of course, here we may have further splinter groups. After this, the dogma adapts to reality: the truly valuable insights of the original rebellion gain in appreciation, and the extremism becomes more moderate. Finally there is integration with mainstream, which becomes enriched by new content.
By the time the term Linked Data came to broad use, the RDF enterprise had its break-away colonies that started to shed some of the initial zeal. By now, we have the last phase of reconciliation in its early stages.
This process is in principle complete when linked data is no longer a radical bet, but a technology to be routinely applied to data when the nature of the data fits the profile. The structure awareness and other technology discussed here will mostly eliminate the differential in deployment cost.
The spreading perception of an expertise gap in this domain will even-out the cost in terms of personnel. The flexibility gains that were the initial drive for the movement will be more widely enjoyed when these factors fuel broader adoption.
To help this along, we have LDBC, the Linked Data Benchmark Council, with the agenda of creating industry consensus on measuring progress across the linked data and graph DB frontiers. I duly invited MarkLogic to join.
There were many other interesting conversations at the conference, I will later comment on these.
To be continued...
mmaps
. At this point, we notice that the dataset is missing the implied types
of products
; i.e., the most specific type
is given but its superclasses are not directly associated with the product
. We have always run this with this unique inference materialized, which is also how the data generator makes the data, with the right switch. But the switch was not used. So a further 10 Gt (Giga-triples) are added, by running a SQL script to make the superclasses explicit.
At this point, we run BSBM explore for the first time. To what degree does the 37.5 Gt predict the 500 Gt behavior? First, there is an overflow that causes a query plan cost to come out negative if the default graph is specified. This is a bona fide software bug you don't get unless a sample is quite large. Also, we note that starting the databases takes a few minutes due to disk. Further, the first query takes a long time to compile, again because of sampling the database for overall statistics.
The statistics are therefore gathered by running a few queries, and then saved. Subsequent runs will reload the stats from the file system, saving some minutes of start time. There is a function for this, stat_import
and stat_export
. These are used for a similar purpose by some users.
On day 10, Wednesday August 20, we have some results of BSBM explore.
Then, we get into BSBM updates. The BSBM generator makes an update dataset, but it cannot be made large enough. The BSBM test driver suite is by now hated and feared in equal measure. Is it bad in and of itself? Depends. It was certainly not made for large data. Anyway, no fix will be attempted this time. Instead, a couple of SQL procedures are made to drive a random update workload. These can run long enough to get a steady state with warm cache, which is what any OLTP measurement needs.
On day 12, some updates are measured, with a one hour ramp-up to steady-state, but these are not quite the right mix, since these are products
only and the mix needs to contain offers
and reviews
also. The first steady-state rate was 109 Kt/s, a full 50x less than the bulk load, but then this was very badly bound by latency. So, the updates are adjusted to have more variety. The final measurement was on day 17. Now the steady-state rate is 2563 Kt/s, which is better but still quite bound by network. By adding diversity to the dataset, we get slammed by a sharp rise in warm-up time (now 2 hours to be at 230 Kt/s), at which point we launch the explore mix to be timed during update. Time is short and we do not want to find out exactly how long it takes to get the plateau in insert rate. As it happens, the explore mix is hardly slowed down by the updates, but the updates get hit worse, so that the rate goes to about 1/3 of what it was, then comes back up when the explore is finished. Finally, half an hour after this, there is a steady state of 263 Kt/s update rate.
Of course, the main object of the festivities is still the business intelligence (BI) mix. This is our (specifically, Orri's) own invention from years back, subsequently formulated in SPARQL by FU Berlin (Andreas Schultz). Well, it is already something to do big joins with 150 Gt, all on index and vectored random access, as was done in January 2013, the last time results were published on the CWI cluster. You may remember that there was an aborted attempt in January 2014. So now, with the LOD2 end date under two weeks away, we will take the BI racer out for a spin with 500 Gt. This is now a very different proposition from Jan 2013, as we have by now done the whole TPC-H work documented on this blog. This serves to show, inter alia, that we can run with the best in the much bigger and harder mainstream database sports. The full benefits of this will be realized for the semantic data public still this year, so this is more than personal vanity.
So we will see. The BI mix is not exactly TPC-H, but what is good for one is good for the other. Checking that the plans are good on the 37 Gt scale model is done around day 12. On day 13, we try this on the larger cluster. You never know — pushing the envelope, even when you know what you are doing and have written the whole thing, is still a dive in the fog. Claiming otherwise would be a lie lacking credibility. The iceberg which first emerges is overflow and partition skew. Well, there can be a lot of messages if all messages go via the same path. So we make the data structure different and retry and now die from out of memory. On the scale model, this looks like a little imbalance you don't bother to notice; at 13x scale, this kills. So, as is the case with most database problems, the query plan is bad. Instead of using a PSOG
index, it uses a POSG
index, and there is a constant for O
. Partitioning is on either S
or O
, whichever is first. Not hard to fix, but still needs a cost-model adjustment to penalize low-cardinality partition columns. This is something you don't get with TPC-H, where there are hardly any indices. Once this is fixed there are other problems, such as Q5, which we ended up leaving out. The scale model is good; the large one does not produce a plan, because some search-space corner is visited that is not visited in the scale model, due to different ratios of things in the cost model. Could be a couple of days to track; this is complex stuff. So we dropped it. It is not a big part of the metric, and its omission is immaterial to the broader claim of handling 500 Gt in all safety and comfort. The moral is: never get stuck; only do what is predictable, insofar as anything in this shadowy frontier is such.
So, on days 15 and 16, the BI mix that is reported was run. The multiuser score was negatively impacted by memory skew, so some swapping on one of the nodes, but the run finished in about 2 hours anyway. The peak of transient memory consumption is another thing that you cannot forecast with exact precision. There is no model for that; the query streams are in random order, and you just have to try. And it is a few hours per iteration, so you don't want to be stuck doing that either. A rerun would get a higher multiuser BI score; maybe one will be made but not before all the rest is wrapped up.
Now we are talking 2 hours, versus 9 hours with the 150 Gt set back in January 2013. So 3.3x the data, 4.5x less time, 1.5x the gear. This comes out at one order of magnitude. With a better score from better memory balance and some other fixes, a 15x improvement for BSBM BI is in the cards.
The final explore runs were made on day 18, while writing the report to be published at the LOD2 deliverables repository. The report contains in depth discussion on the query plans and diverse database tricks and their effectiveness.
The overall moral of this trip into these uncharted spaces is this: Expect things to break. You have to be the designer and author of the system to take it past its limits. You will cut it or you won't, and nobody can do anything about it, not with the best intentions, nor even with the best expertise, which both were present. This is true of the last minute daredevil stuff like this; if you have a year full time instead of the last 20 days of a project, all is quite different, and these things are more leisurely. This might then become a committee affair, though, which has different problems. In the end, the Virtuoso DBMS has never thrown anything at us we could not handle. The uncertainty in trips of this sort is with the hardware platform, of which we had to replace 2 units to get on the way, and with how fast you can locate and fix a software problem. So you pick the quickest ones and leave the uncertain aside. There is another category of rare events like network failures that in theory cannot happen. Yet they do. So, to program a cluster, you have to have some recovery things for these. We saw a couple of these along the way. Duplication of these can take days, and whether this correlates with specific links or is a bona fide software thing is time consuming to prove, and getting into this is a sure way to lose the race. These seem to be load peaks outside of steady-state; steady-state is in fact very steady once it is there. Except at the start, network glitches were not a big factor in these experiments. The bulk of these went away after replacing a machine. After this we twice witnessed something that cannot exist but knew better than to get stuck with that. Neither incident happened again. This is days of running at a cross sectional 1 GB/s of traffic. These are the truly unpredictable, and, in a crash course like this, can sink the whole gig no matter how good you are.
Thanks are due to CWI and especially Peter Boncz for providing the race track as well as advice and support.
In the next installments of this series, we will look at how schema and characteristic sets will deliver the promise of RDF without its cost. All the experiments so far were done with a quads table, as always before. So we could say that the present level is close to the limit of the achievable within this physical model. The future lies beyond the misconception of triples/quads as primary physical model.
To be continued...
Now, from last time, we know to generate the data without 10 GB of namespace prefixes per file and with many short files. So we have 1.5 TB of gzipped data in 40,000 files, spread over 12 machines. The data generator has again been modified. Now the generation was about 4 days. Also from last time, we know to treat small integers specially when they occur as partition keys: 1 and 2 are very common values and skew becomes severe if they all go to the same partition; hence consecutive small INTs
each go to a different partition, but for larger ones the low 8 bits are ignored, which is good for compression: Consecutive values must fall in consecutive places, but not for small INTs
. Another uniquely brain-dead feature of the BSBM generator has also been rectified: When generating multiple files, the program would put things in files in a round-robin manner, instead of putting consecutive numbers in consecutive places, which is how every other data generator or exporter does it. This impacts bulk load locality and as you, dear reader, ought to know by now, performance comes from (1) locality and (2) parallelism.
The machines are similar to last time: each a dual E5 2650 v2 with 256 GB RAM and QDR InfiniBand (IB). No SSD this time, but a slightly higher clock than last time; anyway, a different set of machines.
The first experiment is with triples, so no characteristic sets, no schema.
So, first day (Monday), we notice that one cannot allocate more than 9 GB of memory. Then we figure out that it cannot be done with malloc
, whether in small or large pieces, but it can with mmap
. Ain't seen that before. One day shot. Then, towards the end of day 2, load begins. But it does not run for more than 15 minutes before a network error causes the whole thing to abort. All subsequent tries die within 15 minutes. Then, in the morning of day 3, we switch from IB to Gigabit Ethernet (GigE). For loading this is all the same; the maximal aggregate throughput is 800 MB/s, which is around 40% of the nominal bidirectional capacity of 12 GigE's. So, it works better, for 30 minutes, and one can even stop the load and do a checkpoint. But after resuming, one box just dies; does not even respond to ping. We change this to another. After this, still running on GigE, there are no more network errors. So, at the end of day 3, maybe 10% of the data are in. But now it takes 2h21min to make a checkpoint, i.e., make the loaded data durable on disk. One of the boxes manages to write 2 MB/s to a RAID-0 of three 2 TB drives. Bad disk, seen such before. The data can however be read back once the write is finally done.
Well, this is a non-starter. So, by mid-day of day 4, another machine has been replaced. Now writing to disk is possible within expected delays.
In the afternoon of day 4, the load rate is about 4.3 Mega-triples (Mt) per second, all going in RAM.
In the evening of day 4, adding more files to load in parallel increases the load rate to between 4.9 and 5.2 Mt/s. This is about as fast as this will go, since the load is not exactly even. This comes from the RDF stupidity of keeping an index on everything, so even object values where an index is useless get indexed, leading to some load peaks. For example, there is an index on POSG
for triples were the predicate is rdf:type
and the object is a common type. Use of characteristic sets will stop this nonsense.
But let us not get ahead of the facts: At 9:10 PM of day 4, the whole cluster goes unreachable. No, this is not a software crash or swapping; this also affects boxes on which nothing of the experiment was running. A whole night of running is shot.
A previous scale model experiment of loading 37.5 Gt in 192 GB of RAM, paging to a pair of 2 TB disks, has been done a week before. This finishes in time, keeping a load rate of above 400 Kt/s on a 12-core box.
At 10AM on day 5 (Friday), the cluster is rebooted; a whole night's run missed. The cluster starts and takes about 30 minutes to get to its former 5 Mt/s load rate. We now try switching the network back to InfiniBand. The whole ethernet network seemed to have crashed at 9PM on day 4. This is of course unexplained but the experiment had been driving the ethernet at about half its cross-sectional throughput, so maybe a switch crashed. We will never know. We will now try IB rather than risk this happening again, especially since if it did repeat, the whole weekend would be shot, as we would have to wait for the admin to reboot the lot on Monday (day 8).
So, at noon on day 5, the cluster is restarted with IB. The cruising speed is now 6.2 Mt/s, thanks to the faster network. The cross sectional throughput is about 960 MB/s, up from 720 MB/s, which accounts for the difference. CPU load is correspondingly up. This is still not full platform since there is load unbalance as noted above.
At 9PM on day 5, the rate is around 5.7 Mt/s with the peak node at 1500% CPU out of a possible 1600%. The next one is under 800%, which is just to show what it means to index everything. In specific, the node that has the highest CPU is the one in whose partition the bsbm:offer
class falls, so that there is a local peak since one of every 9 or so triples says that something is an offer
. The stupidity of the triple store is to index garbage like this to begin with. The reason why the performance is still good is that a POSG
index where P
and O
are fixed and the S
is densely ascending is very good, with everything but the S
represented as run lengths and the S
as bitmaps. Still, no representation at all is better for performance than even the most efficient representation.
The journey consists of 3 different parts. At 10PM, the 3rd and last part is started. The triples have more literals, but the load is more even. The cruising speed is 4.3 Mt/s down from 6.2, but the data has a different shape, including more literals.
The last stretch of the data is about reviews. This stretch of the data has less skew. So we increase parallelism, running 8 x 24 files at a time. The load rate goes above 6.3 Mt/s.
At 6:45 in the morning of day 6, the data is all loaded. The count of triples is 490.0 billion. If the load were done in a single stretch without stops and reconfiguration, it would likely go in under 24h. The average rate for a 4 hour sample between midnight and 4AM of day 6 is 6.8 MT/s. The resulting database files add up to 10.9 TB, with about 20% of the volume in unallocated pages.
At this time, noon of day 6, we find that some cross-partition joins need more distinct pieces of memory than the default kernel settings allow per process. A large number of partitions makes a large number of sometimes long messages which makes many mmaps
. So we will wait until morning of day 8 (Monday) for the administrator to set these. In the meantime, we analyze the behavior of the workload on the 37 Gt scale model cluster on my desktop.
To be continued...
In a nutshell, LOD2 went like this:
Triples were done right, taking the best of the column store world and adapting it to RDF. This is now in widespread use.
SQL was done right, as I have described in detail in the TPC-H series. This is generally available as open source in v7fasttrack. SQL is the senior science and a runner-up like sem-tech will not carry the day without mastering this.
RDF is now breaking free of the triple store. RDF is a very general, minimalistic way of talking about things. It is not a prescription on how to do database. Confusing these two things has given rise to RDF’s relative cost against alternatives. To cap off LOD2, we will have the flexibility of triples with the speed of the best SQL.
In this post we will look at accomplishments so far and outline what is to follow during August. We will also look at what in fact constitutes the RDF overhead, why this is presently so, and why this does not have to stay thus.
This series will be of special interest to anybody concerned with RDF efficiency and scalability.
At the beginning of LOD2, I wrote a blog post discussing the RDF technology and its planned revolution in terms of the legend of Perseus. The classics give us exemplars and archetypes, but actual histories seldom follow them one-to-one; rather, events may have a fractal nature where subplots reproduce the overall scheme of the containing story.
So it is also with LOD2: The Promethean pattern of fetching the fire (state of the art of the column store) from the gods (the DB world) and bringing it to fuel the campfires of the primitive semantic tribes is one phase, but it is not the totality. This is successfully concluded, and Virtuoso 7 is widely used at present. Space efficiency gains are about 3x over the previous, with performance gains anywhere from 3 to 100x. As pointed out in the Star Schema Benchmark series (part 1 and part 2), in the good case one can run circles in SPARQL around anything but the best SQL analytics databases.
In the larger scheme of things, this is just preparation. In the classical pattern, there is the call or the crisis: Presently this is that having done triples about as right as they can be done, the mediocre in SQL can be vanquished, but the best cannot. Then there is the actual preparation: Perseus talking to Athena and receiving the shield of polished brass and the winged sandals. In the present case, this is my second pilgrimage to Mount Database, consisting of the TPC-H series. Now, the incense has been burned and libations offered at each of the 22 stations. This is not reading papers, but personally making one of the best-ever implementations of this foundational workload. This establishes Virtuoso as one of the top-of-the-line SQL analytics engines. The RDF public, which is anyway the principal Virtuoso constituency today, may ask what this does for them.
Well, without this step, the LOD2 goal of performance parity with SQL would be both meaningless and unattainable. The goal of parity is worth something only if you compare the RDF contestant to the very best SQL. And the comparison cannot possibly be successful unless it incorporates the very same hard core of down-to-the-metal competence the SQL world has been pursuing now for over forty years.
It is now time to cut the Gorgon’s head. The knowledge and prerequisite conditions exist.
The epic story is mostly about principles. If it is about personal combat, the persons stand for values and principles rather than for individuals. Here the enemy is actually an illusion, an error of perception, that has kept RDF in chains all this time. Yes, RDF is defined as a data model with triples in named graphs, i.e., quads. If nothing else is said, an RDF Store is a thing that can take arbitrary triples and retrieve them with SPARQL. The naïve implementation is to store things as rows in a quad table, indexed in any number of ways. There have been other approaches suggested, such as property tables or materialized views of some joins, but these tend to flush the baby with the bathwater: If RDF is used in the first place, it is used for its schema-less-ness and for having global identifiers. In some cases, there is also some inference, but the matter of schema-less-ness and identifiers predominates.
We need to go beyond a triple table and a dictionary of URI names while maintaining the present semantics and flexibility. Nobody said that physical structure needs to follow this. Everybody just implements things this way because this is the minimum that will in any case be required. Combining this with a SQL database for some other part of the data/workload hits basically insoluble problems of impedance mismatch between the SQL and SPARQL type systems, maybe using multiple servers for different parts of a query, etc. But if you own one of the hottest SQL racers in DB city and can make it do anything you want, most of these problems fall away.
The idea is simple: Put the de facto rectangular part of RDF data into tables; do not naively index everything in places where an index gives no benefit; keep the irregular or sparse part of the data as quads. Optimize queries according to the table-like structure, as that is where the volume is and where getting the best plan is a make or break matter, as we saw in the TPC-H series. Then, execute in a way where the details of the physical plan track the data; i.e., sometimes the operator is on a table, sometimes on triples, for the long tail of exceptions.
In the next articles we will look at how this works and what the gains are.
These experiments will for the first time showcase the adaptive schema features of the Virtuoso RDF store. Some of these features will be commercial only, but the interested will be able to reproduce the single server experiments themselves using the v7fasttrack open source preview. This will be updated around the second week of September to give a preview of this with BSBM and possibly some other datasets, e.g., Uniprot. Performance gains for regular datasets will be very large.
To be continued...
Query optimization is hard. It is a set of mutually interacting tricks and special cases. Execution is also hard, but there the tricks do not interact quite as much or as unpredictably. So, if there is a few percent of score to be had from optimization of either execution or query, I will take execution first. It is less likely to break things and will probably benefit a larger set of use cases.
As we see from the profile in the previous article, hash join is the main piece of execution in TPC-H. So between the article on late projection and the first result preview, I changed the hash table used in HASH JOIN
and GROUP BY
from cuckoo to linear.
Let's see how the hash tables work: Cuckoo hash is a scheme where an entry can be in one of two possible places in the table. If a new entry is inserted and either of the possible places is unoccupied, it goes there. If both are occupied, it could be that one contains an entry whose other possible location is free -- and then that entry may be relocated. Thus an insert may push the previous occupant of the place somewhere else, which in turn may push another, and so on. It may happen that an insert is still not possible, in which case the entry to insert goes into an exceptions list.
To look up an entry, you get a hash number, and use different fields of it to pick the two places. Look in one, then the other, then the exceptions. If there is no match and the table is reasonably close to capacity, you will have looked in at least 3 widely-separated places to determine the absence of a match. In practice, the hash table consists on a prime number of distinct arrays of a fixed size (partitions), and each partition has its own exception list. A modulo of the hash number picks the array, then two further modulos of different parts of the number pick the places in the array.
In most cases in TPC-H, the hash joins are selective; i.e., most items on the probe side find no match in the hash table.
So, quite often you have 3 cache misses to show that there is no hit. This is, at least in theory, quite bad.
There are Bloom filters before the hash table. The Bloom filter will prune most of the probes that would miss. A Bloom filter is an array of bits. Given a hash number, the Bloom filter will very efficiently tell you whether the entry is sure not to be in the hash table. If the Bloom filter says it can be in the hash table, you must look.
In the Virtuoso case, for each entry in the hash table, the Bloom filter has 8 bits. The Bloom check uses a field of the hash number to pick a 64-bit word from the Bloom filter. Then different fields of the hash number are used to set up-to 4 bits in a 64-bit bit-mask. When building the hash table, the masks are OR-ed into the Bloom filter. When probing, before looking in the hash table, the system checks to see if the bits corresponding to the hash number are all on in the appropriate word. If they are not, the hash lookup is sure to miss.
Most expositions of Bloom filters talk about setting two bits for every value. With two bits set, we found 8 bits-per-value to work best. More bits makes a larger filter and misses the cache more; fewer bits makes too many collisions, and the Bloom filter produces too many false positives. A significant finding is that with 8 bits-per-value, setting 4 bits instead of 2 causes the filter to be twice as selective. The simple trick of setting 4 bits cuts the number of hash lookups for items that passed the Bloom filter to half in many selective hash joins. Examples are the many joins of lineitem
and part
or supplier
where there is a condition on the smaller table.
Still, even with Bloom filters, a cuckoo hash will make too many cache misses.
So, enter linear hash. The idea is simple: The hash number picks a place in an array. Either the entry being sought is in the vicinity, or it is not in the hash table. If the vicinity is full of other entries, the entry can still be in an exception list.
With this and cuckoo alike, there are 3 different variants of hash table:
A set of single unique integers
A single-integer key with 0 or more dependent values, possibly with a next link if the key is not unique
A key of n arbitrary values, 0 or more dependent values, optional next link if the key is not unique
In the first case, the hash table is an array of values; in the two other cases, it is an array of pointers. But since a pointer is 64 bits, of which the high 16 are not in the address space of x86_64, these high bits can be used to keep a part of the hash number. It will be necessary to dereference the pointer only if the high bits match the hash number. This means that nearly all lookups that do not find a match are handled with a single cache miss.
Each cache miss brings in a cache line of 8 words. The lookup starts at a point given by the hash number and wraps around at the end of the cache line. Only in the case that all 8 words are occupied but do not match does one need to look at the exceptions. There is one exception list for each partition of the hash table, like in the cuckoo scheme.
A hash lookup is always done on a vector of keys; the loop that takes most of the time is in fact the Bloom filter check. It goes as follows:
#define CHB_VAR(n) \
uint64 h##n, w##n, mask##n;
#define CHB_INIT(n, i) \
MHASH_STEP_1 (h##n, i); \
w##n = bf[BF_WORD (h##n, sz)]; \
mask##n = BF_MASK (h##n);
#define CHB_CK(n) \
{ matches[mfill] = inx + n; \
mfill += (w##n & mask##n) == mask##n; }
for (inx = inx; inx < last; inx ++)
{
CHB_VAR (0);
CHB_INIT (0, REF ((ce_first + sizeof (ELT_T) * inx)));
CHB_CK (0);
}
This is the perfect loop for out-of-order execution. Now, I have tried every variation you can imagine, and this does not get better. The loop calculates a hash number, fetches the corresponding word from the Bloom filter, calculates a mask, stores the index of the key in a results array, and increments the results counter if all the bits were set. There is no control dependency anywhere, just a data dependency between successive iterations; i.e., to know where the result must go, you must know if the previous was a hit.
You can unroll this loop very easily, so, for example, take 4 keys, do the numbers, fetch the words, and then check them one after the other. One would think this would have more misses in flight at any one time, which it does. But it does not run any faster.
Maybe the loop is too long. Circumstantial evidence suggests that short loops are better for instruction prefetching. So, one can also make a loop that gets any number of words of the Bloom filter and puts them in one local array and the hash numbers in another array. A subsequent loop then reads the hash number, calculates the mask, and checks if there is a hit. In this way one can generate as many misses as one wants and check them as late as one wants. It so happens that doing 8 misses and then checking them is better than either 4 or 16. But 8 is still marginally worse than the loop first mentioned.
One can also vary the test. Instead of adding a truth value to the result counter, one can have
if ((word & mask) == mask) result[fill++] = inx;
There is no clear difference between predication (incrementing the fill by truth value) and a conditional jump. The theory of out-of-order execution would predict predication to be better, but the difference is lost in measurement noise. This is true on both Intel Nehalem (Xeon 55xx) and Sandy Bridge (E5 26xx), but could be different on other architectures.
The multicore scalability of the test will give some information about platform utilization.
This is the ultimately simplified selective hash join:
SELECT COUNT (*)
FROM lineitem,
part
WHERE l_partkey = p_partkey
AND p_size < 15
;
This is the second simplest hash join but misses the cache much more; since this now has a key and a dependent part in the hash table, there is an extra pointer to follow, and the hash entry is two words plus the pointer to these in the hash table array.
SELECT SUM (p_retailprice)
FROM lineitem,
part
WHERE l_partkey = p_partkey
AND p_size < 15
;
By adjusting the number of parts selected, we can vary the Bloom filter selectivity and the size of the hash table. Below, we show the times for the two queries with single-thread and 24-thread execution, with different percentages of the part
table on the build side of the hash join. All runs are against warm 100G TPC-H on the same test system as in the rest of the TPC-H series (dual Xeon E5-2630).
This table compares the performance of the linear and cuckoo implementations on the above queries (count vs. sum) on either 24 threads or 1 thread. Four data points are given for different sizes of hash table, given as percentage of the part
table (having 400K - 20M entries) in the hash table. The rightmost column, which represents the case where the entire part
table is on the build side does not have a Bloom filter; the other cases do. The Bloom bits are 8/4 for linear and 8/2 for cuckoo. The times are all in milliseconds, and the thousands separator is a comma.
Hash type | Query type | Threads | 2% (ms) |
10% (ms) |
30% (ms) |
100% (ms) |
---|---|---|---|---|---|---|
Linear | COUNT |
24 | 1,204 | 1,683 | 3,100 | 6,214 |
Linear | SUM |
24 | 1,261 | 2,447 | 5,059 | 13,086 |
Linear | COUNT |
1 | 15,286 | 22,451 | 38,863 | 66,722 |
Linear | SUM |
1 | 17,575 | 33,664 | 81,927 | 179,013 |
Cuckoo | COUNT |
24 | 1,849 | 2,840 | 4,105 | 6,203 |
Cuckoo | SUM |
24 | 2,833 | 4,903 | 9,446 | 19,652 |
Cuckoo | COUNT |
1 | 25,146 | 39,064 | 57,383 | 85,105 |
Cuckoo | SUM |
1 | 33,647 | 67,089 | 121,989 | 240,941 |
We clearly see cache effects on the two first lines, where the SUM
and COUNT
run in almost the same time on a small hash table but have a 2x difference on the larger hash table. The instruction path length is not very different for SUM
and COUNT
, but the memory footprint has a 3x difference.
We note that the SMP scalability of linear is slightly better, contrasting the ratio of 24-thread SUM
to single-thread SUM
. Both numbers are over 12x, indicating net benefit from core multithreading. (The test system has 12 physical cores.) The linear hash systematically outperforms cuckoo, understandably, since it makes a smaller net number of cache misses. The overall effect on the TPC-H score is noticeable, at around 15-20K units of composite score at 100G.
In conclusion, the Virtuoso hash join implementation is certainly on the level, with only small gains to be expected from further vectoring and prefetching. These results may be reproduced using the v7fasttrack Virtuoso Open Source releases from GitHub; develop/7 for cuckoo and feature/analytics for linear hash.
To be continued...
In this article we look at what the server actually does. The execution profiles for all the queries are available for download. To experiment with parallelism, you may download the software and run it locally. An Amazon image may be provided later.
Below is the top of the oprofile
output for a run of the 22 queries with qualification parameters against the 100G database. The operation in TPC-H terms is given under each heading.
CPU: Intel Sandy Bridge microarchitecture, speed 2299.98 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 1406009 9.5117 ce_vec_int_range_hash
-- Bloom filter before selective hash join where selective hash join is best or only condition on a scan, e.g.,lineitem
scan wherel_partkey
checked against a Bloom filter730935 4.9448 ce_vec_int_sets_hash
-- Bloom filter check where another condition is applied first, e.g.,lineitem
scan with condition onl_shipdate
, then Bloom filter check onl_partkey
617091 4.1746 hash_source_chash_input_1i_n
-- Q13 right outer hash join with probe fromorders
, build fromcustomer
,NOT EXISTS
test in Q16586938 3.9706 cs_decode
-- Generic reading of a column, all queries536273 3.6279 ce_intd_range_ltgt
-- Date range comparison, most queries, e.g., Q1, 3, 4, 5, 6, 7, 8, 20479898 3.2465 cha_cmp_1i
-- Q13GROUP BY
onc_custkey
, Q15GROUP BY
onS_suppkey
. Indicates missing the cache on high cardinalityGROUP BY
singleINT
grouping key473721 3.2047 cha_inline_1i_n_int
-- Selective hash join after prefiltering with Bloom filter. Check only that key in hash table, no dependent part. For example Q8, Q9, Q17, Q20, withlineitem
filtered bypart
463723 3.1371 ce_dict_generic_range_filter
-- Range condition on low cardinality column (dictionary encoded), e.g.,l_quantity
,l_discount
425149 2.8761 cha_inline_1i_int
-- Hash join check that a singleINT
key with a dependent part is in a hash table, fetch the dependent part for hits365040 2.4695 setp_chash_run
--GROUP BY
, e.g.,GROUP BY
onc_custkey
in Q13359645 2.4330 clrg_partition_dc
-- Partitioning a vector of values, occurs in all high cardinalityGROUP BYs
, e.g., Q13, Q15349473 2.3642 gb_aggregate
-- Updating aggregates after the grouping keys are resolved, e.g., Q1331926 2.2455 ce_dict_any_sets_decode
-- Fetching non-contiguous dictionary encoded strings, e.g.,l_returnflag
,l_linestatus
in Q1316731 2.1427 cha_insert_1i
-- Building a hash join hash table from a 1INT
key to a dependent part, e.g., Q14 froml_partkey
tolineitems
in a time window with a givenl_partkey
286865 1.9406 ce_search_rld
-- Index lookup for run length delta compressed keys, e.g.,l_orderkey
,ps_partkey
231390 1.5654 cha_insert
-- Build of a hash join hash table for multipart keys, e.g., hash join withpartsupp
(Q9) orlineitem
(Q17, Q20) on build side224070 1.5158 ce_dict_int64_sets_decode
-- Fetching a non-contiguous set of double column values from dictionary encoding, e.g.,l_discount
,l_quantity
218506 1.4782 ce_intd_any_sets_decode
-- Fetching a non-contiguous set of date column values, e.g.,l_shipdate
in Q9200686 1.3576 page_wait_access
-- Translating page numbers into buffers in buffer pool for column access198854 1.3452 itc_col_seg
-- Generic part of table scan/index access197645 1.3371 cha_insert_1i_n
-- Hash join build for hash tables with a 1INT
key and no dependent, e.g.,lineitem
topart
join in Q9, Q17, Q20195300 1.3212 cha_bloom_unroll_a
-- Bloom filter for hash based in predicate192309 1.3010 hash_source_chash_input
-- Hash join probe with multi-part key, e.g., Q9 againstpartsupp
191313 1.2942 strstr_sse42
-- SSE 4.2 substring match, e.g.NOT LIKE
condition in Q13186325 1.2605 itc_fetch_col_vec
-- Translating page numbers into buffers in buffer pool for column access159198 1.0770 ce_vec_int64_sets_decode
-- Fetching non-contiguous 64-bit values from array represented column, e.g.,l_extendedprice
So what does TPC-H do? It is all selective hash joins. Then it is table scans with a condition on a DATE
column. The DATE
column part is because this implementation does not order the big tables (lineitem
, orders
) on their DATE
columns. Third, TPC-H does big GROUP BYs
, with Q13 representing most of this; all other GROUP BYs
have at least 10x fewer groups. Then it just extracts column values, most often after selecting the rows on some condition; quite often the condition is a foreign key column of the row finding a hit in a hash table. This last is called invisible hash join.
Then there are some index lookups, but there the join is usually a merge pattern, like reading orders
in order of o_orderkey
and then getting l_orderkey
matches from lineitem
(e.g., Q21). Then there are quite often partitioning operations, i.e., many threads produce tuples and these tuples go to a consumer thread selected based on a partitioning column on the tuple. This is also called exchange. This is in all the high cardinality GROUP BYs
and in the RIGHT OUTER JOIN
of Q13.
Whether the implementation is RAM-only or paging from disk makes next to no difference. Virtuoso and Actian Vector both have a buffer pool, with Virtuoso using substantially smaller pages (8K vs. 256K or 512K). The page-number-to-buffer translation and the latching that goes with it is under 3% (itc_fetch_col_vec
, page_wait_access
). Of course, actually accessing secondary storage will kill the score, but checking that something is in memory is safe as long as this is always a hit.
So, once the query plans are right, the problem resolves into a few different loops. The bulk of the effort in making a TPC-H implementation is in query optimization so that the right loops run in the right order. I will further explain what the loops should contain in the next article.
Many of the functions seen in the profile are instantiations of a template for a specific data type. There are also different templates for variants of a data structure, like a hash table with different keys.
Compilation is a way of generating exactly the right loop for any set of data types. We do not find much need for this here, though. The type-specific operations that are in fact needed are anticipatable and can be predefined based on templates.
There are possible gains in the following domains:
Query optimization - Q15, Q17 do some work in duplicate, so reuse will roughly cut the time in half. In addition to reuse, there is a possibility to convert a scalar subquery into a derived table with a GROUP BY
(decorrelation). Decorrelation also applies in Q20. Reuse will give some 10-12K more score; decorrelation is probably below measurement noise.
Group join - Merging the GROUP BY
and the hash probe in Q13 will save 2% of time on throughput, maybe 5K more score.
Better parallelism in power test - A good power test takes 42s, and 5 times the work takes 165s, so the platform utilization of the power test is roughly 4/5. This could be slightly better.
NUMA - Up to 20% performance gains have been seen on scale-out configurations from binding a process to one CPU socket. This gain will be readily seen in a scale-out setting when running one process per physical CPU. Gains in the NUMA department are rather fickle and fragile and are worth less in real life than in benchmarks. So I would say that work in single-server NUMA optimization is not immediately needed since scale-out configurations will get these gains as soon as one sets affinities for processes. But whether binding processes to CPUs makes sense depends on how even the workload is. TPC-H is even, but reality is less so.
In principle, a composite score of a good run could go from 250K to 280K by diverse incremental improvements. There is some noise in run times, so two consecutive runs can deviate by a few percent, which is already significant when talking of small improvements.
For a 100 GB run with 5 throughput streams, the peak query memory consumption is 3.5 GB. This includes all the GROUP BY
and hash join tables that are made. Q13 is the most expensive in terms of space and allocates a peak of 1 GB for a single execution. This is trivial in comparison with the working set. But at 100x greater scale (10,000 GB or 10 TB) this becomes about 630 GB, as there will be a minimum of 9 concurrent streams instead of 5.
Now 10 TB is clearly a scale-out size, so the 630 GB transient memory is not on a single machine.
Still going to large scale-out will change the nature of some queries and introduce significant data transfer. We will know soon enough.
A small-scale TPC-H, e.g., 100 GB, starts to have features of a lookup workload. This means that there is high variability between consecutive executions, and that the pre-execution state of the system does have large effect on a measurement.
The rules say that power test follows bulk load with maybe some checking of correctness of load in between. The bulk load is basically unregulated and usually will include statistics gathering.
The first power test shows significant variation, anything from 220 K to 240 K, while the second power test is steadily around 255 K. Since the reported score is the lower of the two, the biggest return to the implementor is in making sure the first power test is good.
The second throughput test is usually 5-10 K higher than the first; the throughput test is less sensitive. The difference does not come from I/O, but from system time for memory allocation. The memory and the same quantities and block sizes is reused by the second run.
A good power run is 42s from 100% warm RAM. A power run that gets the data from the OS disk buffers is 70s or so. A power run that gets data from SSD is worse, maybe 120s.
To cite an example, increasing the buffer pool size from 64 GB to 72 GB gets the first post-load power test from 120-150K to 230-240K while having no appreciable effect on subsequent tests. The effect is exacerbated by the fact that the power score is based on a geometric mean of run times. Very short queries (e.g., Q2) vary between consecutive in-memory executions from 120ms to 220ms. A similar variation occurs in Q13 which is on either side of 6s. Due to geometric mean, the same variability has very different impact depending on which query it hits. A Q2 that reads data from out of process can take 2s instead of the expected under 200ms. This kills the score even if a delay of 1.8s as such did not. So increasing the buffer pool in the example just serves to make sure the small supplier
table is in memory. Fetching it from the OS is simply not an option in the first Q2 even if it were an option in a longer running query. Remember, the the lower of the two scores is reported, and the first power test will be bad unless it is somehow primed by some trick like bulk load order.
When differences between implementations are small, variation between consecutive runs becomes important. This is why OLTP benchmarks are required to run for a relatively long time and only measure the steady state portion. This would also be appropriate for small TPC-H runs.
Query performance reduces to a few loops when all the conditions are right. Getting the conditions to be right depends on query optimization and the right architecture choices. These results would be unthinkable without vectored execution and a compressed column store design. These in and of themselves guarantee nothing unless all plans are right. A previous Virtuoso 7 takes over 10x longer because of bad plans and missing some key execution techniques like the RIGHT OUTER JOIN
in Q13 and partitioned GROUP BY
.
Last summer, we did runs of the much simpler Star Schema Benchmark. The results were very good, because the engine had the much smaller set of tricks and the right plans. Repeating these tests now would show some gains from a still better hash join but nothing dramatic.
In the next article we will look at the finer points of hash join. After this we move to larger scales and clusters.
To be continued...
All the code is in the feature/analytics branch of the v7fasttrack git repository on GitHub.
Start by checking out and compiling Virtuoso Open Source (VOS).
git clone https://github.com/v7fasttrack/virtuoso-opensource
cd virtuoso-opensource
git checkout feature/analytics
./autogen.sh
export CFLAGS="-msse4.2 -DSSE42"
./configure
make -j 24
make install
The system should be an x86_64, Intel Core i7 or later, with SSE 4.2 support. (Running without SSE 4.2 is possible, but for better score you need to define it before doing configure.) The gcc may be any that supports SSE 4.2.
To have a good result, the system should have at least 96 GB of RAM and SSD.
To get a good load time, both the database files and the CSV files made by dbgen
should be on SSD.
Copy the binsrc/tests/tpc-h
directory from the check-out to an SSD, if it is not already on one. Set $HOME
to point to the root directory of the check-out. Rename the virtuoso.ini-100G
in the tpc-h
directory to virtuoso.ini
. Edit the database file paths in the virtuoso.ini
. Where it says --
Segment1 = 1024, /1s1/dbs/tpch100cp-1.db = q1, /1s2/dbs/tpch100cp-2.db = q2
-- change the file names, and add or remove files as appropriate, each file with a different = qn
, until you have one file per independent device. If this is a RAID, one file per distinct device in the RAID usually brings improvement. Edit the TransactionFile
entry, and replace /1s2/dbs/
with a suitable path.
Edit ThreadsPerQuery
to be the number of threads on the machine. For i7, this is double the number of cores; your environment may vary. AsyncQueueMaxThreads
should be set to double ThreadsPerQuery
.
The memory settings (NumberOfBuffers
and MaxDirtyBuffers
) are OK for 100G scale. For larger scale, make the memory settings correspondingly larger, not to exceed 75% of system memory. Count 8.5 KB per buffer. If you have less memory, you can decrease these. If so, the first power test will be hit the worst, so the scores will not be as good.
The default BIOS settings are usually OK. Disabling prefetch of adjacent cache line does not help, and turning off core threads does not help either.
For 100G, the data files for loading will take 100 GB, and the database files will take 88 GB divided among however many files. Be sure there is enough space before you start.
In the tpc-h
directory (copy of binsrc/tests/tpc-h
) --
./gen.sh 100 5 2
The first parameter is the scale factor; the second is the number of streams to use in the throughput test; the last is the number of consecutive test runs. The minimum number of streams is 5 for 100G; each successive scale adds one more. A larger number of streams is allowed, but will not make a better result in this case. A test always consists of 2 runs; you could specify more, but the extra tests will not influence the score.
Making the data files takes the longest time. You may run dbgen
multithreaded to make the dataset in parallel, but then the load scripts will have to be changed to match.
Start Virtuoso.
./load.sh 100
Looking at iostat
, you will see a read rate of about 140 MB/s from the source files.
./run.sh 100 5 2
The parameters have the same meaning as in gen.sh
, and the same values must be specified.
The run produces two main files, report1.txt
and report2.txt
. These are the numerical quantity summaries for the first and second run. Additionally, there are output files for each query stream. The suppfiles.sh
script can be used to collect the supporting files archive for a TPC-H full disclosure report.
The database is left running. To reuse the loaded data for another experiment, kill the Virtuoso process, delete the transaction log, and restart. This will have the data in the post-load state. To get warm cache, use the warm.sql
script in the tpc-h
directory.
On the test system used in this series, 12 core E5 at 2.3GHz, we expect 240K for the first run and 250K for the second. With a top-of-the-line E5, we expect around 400K. For an 8-core 2.26GHz Nehalem, we expect 150K.
If you get scores that are significantly different, something is broken; we would like to know about this.
If you have this audited according to the TPC rules, you will be allowed to call this a TPC-H result. Without such audit, the result should be labeled Virt-H.
To be continued...
The test consists of a bulk load followed by two runs. Each run consists of a single user power test and a multi-user throughput test. The number of users in the throughput test is up to the test sponsor but must be at least 5 for the 100 GB scale. The reported score is the lower of the two scores.
Scale Factor | 100 GB |
---|---|
dbgen | version 2.15 |
Lload time | 0:15:02 |
Composite qph | 241,482.3 |
System Availability Date | 2014-04-22 |
The price/performance is left open. The hardware costs about 5000 euros and the software is open source so the cost per performance would be a minimum of 0.02 euros per qph at 100G. This is not compliant with the TPC pricing rules though. These require 3 year maintenance contracts for all parts.
The software configuration did not use RAID. Otherwise the software would be auditable to the best of my knowledge. The hardware would have to be the same from Dell, HP, or other large brand to satisfy the TPC pricing rule.
Report Date | 2014-04-21 |
---|---|
Database Scale Factor | 100 |
Total Data Storage/Database Size | 1 TB / 87,496 MB |
Start of Database Load | 2014-04-21 21:02:43 |
End of Database Load | 2014-04-21 21:17:45 |
Database Load Time | 0:15:02 |
Query Streams for Throughput Test | 5 |
Virt-H Power | 239,785.1 |
Virt-H Throughput | 243,191.4 |
Virt-H Composite Query-per-Hour Metric (Qph@100GB) | 241,482.3 |
Measurement Interval in Throughput Test (Ts) | 162.935000 seconds |
Start Date/Time | End Date/Time | Duration | |
---|---|---|---|
Stream 0 | 2014-04-21 21:17:46 | 2014-04-21 21:18:33 | 0:00:47 |
Stream 1 | 2014-04-21 21:18:33 | 2014-04-21 21:21:13 | 0:02:40 |
Stream 2 | 2014-04-21 21:18:33 | 2014-04-21 21:21:13 | 0:02:40 |
Stream 3 | 2014-04-21 21:18:33 | 2014-04-21 21:21:06 | 0:02:33 |
Stream 4 | 2014-04-21 21:18:33 | 2014-04-21 21:21:10 | 0:02:37 |
Stream 5 | 2014-04-21 21:18:33 | 2014-04-21 21:21:16 | 0:02:43 |
Refresh 0 | 2014-04-21 21:17:46 | 2014-04-21 21:17:49 | 0:00:03 |
2014-04-21 21:17:50 | 2014-04-21 21:17:51 | 0:00:01 | |
Refresh 1 | 2014-04-21 21:19:25 | 2014-04-21 21:19:38 | 0:00:13 |
Refresh 2 | 2014-04-21 21:18:33 | 2014-04-21 21:18:48 | 0:00:15 |
Refresh 3 | 2014-04-21 21:18:49 | 2014-04-21 21:19:01 | 0:00:12 |
Refresh 4 | 2014-04-21 21:19:01 | 2014-04-21 21:19:13 | 0:00:12 |
Refresh 5 | 2014-04-21 21:19:13 | 2014-04-21 21:19:25 | 0:00:12 |
Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | Q7 | Q8 | |
---|---|---|---|---|---|---|---|---|
Stream 0 | 2.311882 | 0.383459 | 1.143286 | 0.439926 | 1.594027 | 0.736482 | 1.440826 | 1.198925 |
Stream 1 | 5.192341 | 0.952574 | 6.184940 | 1.194804 | 6.998207 | 5.122059 | 5.962717 | 6.773401 |
Stream 2 | 7.354001 | 1.191604 | 4.238262 | 1.770639 | 5.782669 | 1.357578 | 4.034697 | 6.354747 |
Stream 3 | 6.489788 | 1.585291 | 4.645022 | 3.358926 | 7.904636 | 3.220767 | 5.694622 | 7.431067 |
Stream 4 | 5.609555 | 1.066582 | 6.740518 | 2.503038 | 9.439980 | 3.424101 | 4.404849 | 4.256317 |
Stream 5 | 10.346825 | 1.787459 | 4.391000 | 3.151059 | 4.974037 | 2.932079 | 6.191782 | 3.619255 |
Min Qi | 5.192341 | 0.952574 | 4.238262 | 1.194804 | 4.974037 | 1.357578 | 4.034697 | 3.619255 |
Max Qi | 10.346825 | 1.787459 | 6.740518 | 3.358926 | 9.439980 | 5.122059 | 6.191782 | 7.431067 |
Avg Qi | 6.998502 | 1.316702 | 5.239948 | 2.395693 | 7.019906 | 3.211317 | 5.257733 | 5.686957 |
Q9 | Q10 | Q11 | Q12 | Q13 | Q14 | Q15 | Q16 | |
Stream 0 | 4.476940 | 2.004782 | 2.070967 | 1.015134 | 7.995799 | 2.142581 | 1.989357 | 1.581758 |
Stream 1 | 11.351299 | 6.657059 | 7.719765 | 5.157236 | 25.156379 | 8.566067 | 7.028898 | 8.146883 |
Stream 2 | 13.954105 | 8.341359 | 10.265949 | 3.289724 | 25.249435 | 6.370577 | 11.262650 | 7.684574 |
Stream 3 | 13.597277 | 5.783821 | 5.944240 | 5.214661 | 24.253991 | 8.742896 | 7.701709 | 5.801641 |
Stream 4 | 15.612070 | 6.126494 | 4.533748 | 5.733828 | 23.021583 | 6.423207 | 8.358223 | 6.866477 |
Stream 5 | 8.421209 | 9.040726 | 7.799425 | 3.908758 | 23.342975 | 9.934672 | 11.455598 | 8.258504 |
Min Qi | 8.421209 | 5.783821 | 4.533748 | 3.289724 | 23.021583 | 6.370577 | 7.028898 | 5.801641 |
Max Qi | 15.612070 | 9.040726 | 10.265949 | 5.733828 | 25.249435 | 9.934672 | 11.455598 | 8.258504 |
Avg Qi | 12.587192 | 7.189892 | 7.252625 | 4.660841 | 24.204873 | 8.007484 | 9.161416 | 7.351616 |
Q17 | Q18 | Q19 | Q20 | Q21 | Q22 | RF1 | RF2 | |
Stream 0 | 2.258070 | 0.981896 | 1.161602 | 1.933124 | 2.203497 | 1.042949 | 3.349407 | 1.296630 |
Stream 1 | 8.213340 | 4.070175 | 5.662723 | 12.260503 | 7.792825 | 3.323136 | 9.296430 | 3.939927 |
Stream 2 | 16.754827 | 3.895688 | 4.413773 | 7.529466 | 6.288539 | 2.717479 | 11.222082 | 4.135510 |
Stream 3 | 8.486809 | 2.615640 | 7.426936 | 7.274289 | 6.706145 | 3.402654 | 8.278881 | 4.260483 |
Stream 4 | 12.604905 | 7.735042 | 5.627039 | 6.343302 | 7.242370 | 3.492640 | 6.503095 | 3.698821 |
Stream 5 | 8.221733 | 2.670036 | 5.866626 | 13.108081 | 9.428098 | 4.282014 | 8.213320 | 4.088321 |
Min Qi | 8.213340 | 2.615640 | 4.413773 | 6.343302 | 6.288539 | 2.717479 | 6.503095 | 3.698821 |
Max Qi | 16.754827 | 7.735042 | 7.426936 | 13.108081 | 9.428098 | 4.282014 | 11.222082 | 4.260483 |
Avg Qi | 10.856323 | 4.197316 | 5.799419 | 9.303128 | 7.491595 | 3.443585 | 8.702762 | 4.024612 |
Report Date | 2014-04-21 |
---|---|
Database Scale Factor | 100 |
Total Data Storage/Database Size | 1 TB / 87,496 MB |
Start of Database Load | 2014-04-21 21:02:43 |
End of Database Load | 2014-04-21 21:17:45 |
Database Load Time | 0:15:02 |
Query Streams for Throughput Test | 5 |
Virt-H Power | 257,944.7 |
Virt-H Throughput | 240,998.0 |
Virt-H Composite Query-per-Hour Metric (Qph@100GB) | 249,327.4 |
Measurement Interval in Throughput Test (Ts) | 164.417000 seconds |
Start Date/Time | End Date/Time | Duration | |
---|---|---|---|
Stream 0 | 2014-04-21 21:21:20 | 2014-04-21 21:22:01 | 0:00:41 |
Stream 1 | 2014-04-21 21:22:02 | 2014-04-21 21:24:41 | 0:02:39 |
Stream 2 | 2014-04-21 21:22:02 | 2014-04-21 21:24:41 | 0:02:39 |
Stream 3 | 2014-04-21 21:22:02 | 2014-04-21 21:24:41 | 0:02:39 |
Stream 4 | 2014-04-21 21:22:02 | 2014-04-21 21:24:44 | 0:02:42 |
Stream 5 | 2014-04-21 21:22:02 | 2014-04-21 21:24:46 | 0:02:44 |
Refresh 0 | 2014-04-21 21:21:20 | 2014-04-21 21:21:22 | 0:00:02 |
&$160; | 2014-04-21 21:21:22 | 2014-04-21 21:21:23 | 0:00:01 |
Refresh 1 | 2014-04-21 21:22:49 | 2014-04-21 21:23:04 | 0:00:15 |
Refresh 2 | 2014-04-21 21:22:01 | 2014-04-21 21:22:14 | 0:00:13 |
Refresh 3 | 2014-04-21 21:22:14 | 2014-04-21 21:22:27 | 0:00:13 |
Refresh 4 | 2014-04-21 21:22:26 | 2014-04-21 21:22:39 | 0:00:13 |
Refresh 5 | 2014-04-21 21:22:39 | 2014-04-21 21:22:49 | 0:00:10 |
Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | Q7 | Q8 | |
---|---|---|---|---|---|---|---|---|
Stream 0 | 2.437262 | 0.227516 | 1.172620 | 0.541201 | 1.542084 | 0.743255 | 1.459368 | 1.183166 |
Stream 1 | 5.205225 | 0.499833 | 4.854558 | 4.818087 | 5.920773 | 3.347414 | 5.446411 | 3.723247 |
Stream 2 | 5.833803 | 0.659051 | 6.023266 | 3.123523 | 4.358200 | 3.371315 | 6.772453 | 4.978415 |
Stream 3 | 6.308935 | 0.662744 | 7.573807 | 5.000859 | 5.282467 | 4.391930 | 5.280472 | 7.852718 |
Stream 4 | 5.791856 | 0.421592 | 5.953592 | 4.688037 | 9.949038 | 3.098282 | 4.153124 | 4.824209 |
Stream 5 | 13.537098 | 1.760386 | 3.308982 | 2.299178 | 4.882695 | 2.652497 | 5.383128 | 10.178447 |
Min Qi | 5.205225 | 0.421592 | 3.308982 | 2.299178 | 4.358200 | 2.652497 | 4.153124 | 3.723247 |
Max Qi | 13.537098 | 1.760386 | 7.573807 | 5.000859 | 9.949038 | 4.391930 | 6.772453 | 10.178447 |
Avg Qi | 7.335383 | 0.800721 | 5.542841 | 3.985937 | 6.078635 | 3.372288 | 5.407118 | 6.311407 |
Q9 | Q10 | Q11 | Q12 | Q13 | Q14 | Q15 | Q16 | |
Stream 0 | 4.441940 | 1.948770 | 2.154384 | 1.148494 | 6.014453 | 1.647725 | 1.437587 | 1.585284 |
Stream 1 | 14.127674 | 7.824844 | 7.100679 | 3.586457 | 28.216115 | 7.587547 | 9.859152 | 5.829869 |
Stream 2 | 16.102880 | 7.676986 | 5.887327 | 2.796729 | 24.847035 | 7.146757 | 11.408922 | 7.641239 |
Stream 3 | 15.678701 | 5.786427 | 9.221883 | 2.692321 | 28.434916 | 6.657457 | 8.219745 | 7.706585 |
Stream 4 | 11.985421 | 10.182807 | 5.667618 | 6.875264 | 27.547492 | 7.438075 | 9.065924 | 8.895070 |
Stream 5 | 6.913707 | 7.662703 | 8.657333 | 3.282895 | 24.126612 | 10.963691 | 12.138564 | 7.962654 |
Min Qi | 6.913707 | 5.786427 | 5.667618 | 2.692321 | 24.126612 | 6.657457 | 8.219745 | 5.829869 |
Max Qi | 16.102880 | 10.182807 | 9.221883 | 6.875264 | 28.434916 | 10.963691 | 12.138564 | 8.895070 |
Avg Qi | 12.961677 | 7.826753 | 7.306968 | 3.846733 | 26.634434 | 7.958705 | 10.138461 | 7.607083 |
Q17 | Q18 | Q19 | Q20 | Q21 | Q22 | RF1 | RF2 | |
Stream 0 | 2.275267 | 1.139390 | 1.165591 | 2.073658 | 2.261869 | 0.703055 | 2.327755 | 1.146501 |
Stream 1 | 13.720792 | 4.428528 | 3.651645 | 9.841610 | 6.710473 | 2.595879 | 9.783844 | 3.800103 |
Stream 2 | 12.532257 | 2.312755 | 6.182661 | 8.666967 | 9.383983 | 1.414853 | 7.570509 | 4.539598 |
Stream 3 | 7.578779 | 3.342352 | 8.155356 | 4.925493 | 6.590047 | 2.612912 | 8.497542 | 4.638512 |
Stream 4 | 10.967178 | 2.173935 | 6.382803 | 5.082562 | 8.744671 | 3.074768 | 7.577794 | 4.435140 |
Stream 5 | 9.438581 | 2.551124 | 8.375607 | 8.339441 | 8.201650 | 1.982935 | 7.334306 | 3.404017 |
Min Qi | 7.578779 | 2.173935 | 3.651645 | 4.925493 | 6.590047 | 1.414853 | 7.334306 | 3.404017 |
Max Qi | 13.720792 | 4.428528 | 8.375607 | 9.841610 | 9.383983 | 3.074768 | 9.783844 | 4.638512 |
Avg Qi | 10.847517 | 2.961739 | 6.549614 | 7.371215 | 7.926165 | 2.336269 | 8.152799 | 4.163474 |
Hardware |
|
---|---|
Chassis | Supermicro 2U |
Motherboard | Supermicro X9DR3-LN4F+ |
CPU | 2 x Intel Xeon E5-2630 @ 2.3 GHz (6 cores, 12 threads each; total 12 cores, 24 threads) |
RAM | 192 GB DDR3 (24 x 8 GB, 1066MHz) |
Storage | 2 x Crucial 512 GB SSD |
Software |
|
DBMS | Virtuoso Open Source 7.11.3209 (feature/analytics on v7fasttrack on GitHub) |
OS | CentOS 6.2 |
This experiment places Virtuoso in the ballpark with Actian Vector (formerly branded Vectorwise), which has dominated the TPC-H score board in recent years. The published Vector results are on more cores and/or faster clock; one would have to run on the exact same platform to make precise comparisons.
Virtuoso ups the ante by providing this level of performance in open source. For a comparison with EXASolution and Actian Matrix (formerly ParAccel), we will have to go to the Virtuoso scale-out configuration, to follow shortly.
The next articles will provide a detailed analysis of performance and instructions for reproducing the results. The run outputs and scripts are available for download.
To be continued...
TOP k
operator (i.e., show only the 10 best by some metric) and/or by grouping and aggregation (i.e., for a set of items, show some attributes of these items and a sum, count, or other aggregate of dependent items for each).
In this installment we will look at late projection, also sometimes known as late materialization. If many attributes are returned and there is a cutoff of some sort, then the query does not need to be concerned about attributes on which there are no conditions, except for fetching them at the last moment, only for the entities which in fact will be returned to the user.
We look at TPC-H Q2 and Q10.
SELECT TOP 100
s_acctbal,
s_name,
n_name,
p_partkey,
p_mfgr,
s_address,
s_phone,
s_comment
FROM part,
supplier,
partsupp,
nation,
region
WHERE p_partkey = ps_partkey
AND s_suppkey = ps_suppkey
AND p_size = 15
AND p_type LIKE '%BRASS'
AND s_nationkey = n_nationkey
AND n_regionkey = r_regionkey
AND r_name = 'EUROPE'
AND ps_supplycost =
( SELECT MIN(ps_supplycost)
FROM partsupp,
supplier,
nation,
region
WHERE p_partkey = ps_partkey
AND s_suppkey = ps_suppkey
AND s_nationkey = n_nationkey
AND n_regionkey = r_regionkey
AND r_name = 'EUROPE'
)
ORDER BY s_acctbal DESC,
n_name,
s_name,
p_partkey
The intent is to return information about parts
and suppliers
, such that the part
is available from a supplier
in Europe, and the supplier
has the lowest price
for the part
among all European suppliers
.
SELECT TOP 20
c_custkey,
c_name,
SUM(l_extendedprice * (1 - l_discount)) AS revenue,
c_acctbal,
n_name,
c_address,
c_phone,
c_comment
FROM customer,
orders,
lineitem,
nation
WHERE c_custkey = o_custkey
AND l_orderkey = o_orderkey
AND o_orderdate >= CAST ('1993-10-01' AS DATE)
AND o_orderdate < DATEADD ('month', 3, CAST ('1993-10-01' AS DATE))
AND l_returnflag = 'R'
AND c_nationkey = n_nationkey
GROUP BY c_custkey,
c_name,
c_acctbal,
c_phone,
n_name,
c_address,
c_comment
ORDER BY revenue DESC
The intent is to list the customers
who cause the greatest loss of revenue in a given quarter by returning items
ordered in said quarter.
We notice that both queries return many columns on which there are no conditions, and that both have a cap on returned rows. The difference is that in Q2 the major ORDER BY
is on a grouping column, and in Q10 it is on the aggregate of the GROUP BY
. Thus the TOP k
trick discussed in the previous article does apply to Q2 but not to Q10.
The profile for Q2 follows:
{
time 6.1e-05% fanout 1 input 1 rows
time 1.1% fanout 1 input 1 rows
{ hash filler
Subquery 27
{
time 0.0012% fanout 1 input 1 rows
REGION 1 rows(t10.R_REGIONKEY)
R_NAME = <c EUROPE>
time 0.00045% fanout 5 input 1 rows
NATION 5 rows(t9.N_NATIONKEY)
N_REGIONKEY = t10.R_REGIONKEY
time 1.6% fanout 40107 input 5 rows
SUPPLIER 4.2e+04 rows(t8.S_SUPPKEY)
S_NATIONKEY = t9.N_NATIONKEY
After code:
0: t8.S_SUPPKEY := := artm t8.S_SUPPKEY
4: BReturn 0
time 0.1% fanout 0 input 200535 rows
Sort hf 49 (t8.S_SUPPKEY)
}
}
time 0.0004% fanout 1 input 1 rows
{ fork
time 21% fanout 79591 input 1 rows
PART 8e+04 rows(.P_PARTKEY)
P_TYPE LIKE <c %BRASS> LIKE <c > , P_SIZE = 15
time 44% fanout 0.591889 input 79591 rows
Precode:
0: {
time 0.083% fanout 1 input 79591 rows
time 0.13% fanout 1 input 79591 rows
{ fork
time 24% fanout 0.801912 input 79591 rows
PARTSUPP 3.5 rows(.PS_SUPPKEY, .PS_SUPPLYCOST)
inlined PS_PARTKEY = k_.P_PARTKEY
hash partition+bloom by 62 (tmp)hash join merged always card 0.2 -> ()
time 1.3% fanout 0 input 63825 rows
Hash source 49 merged into ts not partitionable 0.2 rows(.PS_SUPPKEY) -> ()
After code:
0: min min.PS_SUPPLYCOSTset no set_ctr
5: BReturn 0
}
After code:
0: aggregate := := artm min
4: BReturn 0
time 0.19% fanout 0 input 79591 rows
Subquery Select(aggregate)
}
8: BReturn 0
PARTSUPP 5e-08 rows(.PS_SUPPKEY)
inlined PS_PARTKEY = k_.P_PARTKEY PS_SUPPLYCOST = k_scalar
time 5.9% fanout 0.247023 input 47109 rows
SUPPLIER unq 0.9 rows (.S_ACCTBAL, .S_NATIONKEY, .S_NAME, .S_SUPPKEY)
inlined S_SUPPKEY = .PS_SUPPKEY
top k on S_ACCTBAL
time 0.077% fanout 1 input 11637 rows
NATION unq 1 rows (.N_REGIONKEY, .N_NAME)
inlined N_NATIONKEY = .S_NATIONKEY
time 0.051% fanout 1 input 11637 rows
REGION unq 0.2 rows ()
inlined R_REGIONKEY = .N_REGIONKEY R_NAME = <c EUROPE>
time 0.42% fanout 0 input 11637 rows
Sort (.S_ACCTBAL, .N_NAME, .S_NAME, .P_PARTKEY) -> (.S_SUPPKEY)
}
time 0.0016% fanout 100 input 1 rows
top order by read (.S_SUPPKEY, .P_PARTKEY, .N_NAME, .S_NAME, .S_ACCTBAL)
time 0.02% fanout 1 input 100 rows
PART unq 0.95 rows (.P_MFGR)
inlined P_PARTKEY = .P_PARTKEY
time 0.054% fanout 1 input 100 rows
SUPPLIER unq 1 rows (.S_PHONE, .S_ADDRESS, .S_COMMENT)
inlined S_SUPPKEY = k_.S_SUPPKEY
time 6.7e-05% fanout 0 input 100 rows
Select (.S_ACCTBAL, .S_NAME, .N_NAME, .P_PARTKEY, .P_MFGR, .S_ADDRESS, .S_PHONE, .S_COMMENT)
}
128 msec 1007% cpu, 196992 rnd 2.53367e+07 seq 50.4135% same seg 45.3574% same pg
The query starts with a scan looking for the qualifying parts
. It then looks for the best price
for each part
from a European supplier
. All the European suppliers
have been previously put in a hash table by the hash filler subquery at the start of the plan. Thus, to find the minimum price
, the query takes the partsupp
for the part
by index, and then eliminates all non-European suppliers
by a selective hash join. After this, there is a second index lookup on partsupp
where we look for the part
and the price
equal to the minimum price
found earlier. These operations could in principle be merged, as the minimum price
partsupp
has already been seen. The gain would not be very large, though.
Here we note that the cost model guesses that very few rows will survive the check of ps_supplycost =
minimum cost
. It does not know that the minimum is not just any value, but one of the values that do occur in the ps_supplycost
column for the part. Because of this, the remainder of the plan is carried out by index, which is just as well. The point is that if very few rows of input are expected, it is not worthwhile to make a hash table for a hash join. The hash table made for the European suppliers
could be reused here, maybe with some small gain. It would however need more columns, which might make it not worthwhile. We note that the major order with the TOP k
is on the supplier
s_acctbal
, hence as soon as there are 100 suppliers
found, one can add a restriction on the s_acctbal
for subsequent ones.
At the end of the plan, after the TOP k ORDER BY
and the reading of the results, we have a separate index-based lookup for getting only the columns that are returned. We note that this is done on 100 rows whereas the previous operations are done on tens-of-thousands of rows. The TOP k
restriction produces some benefit, but it is relatively late in the plan, and not many operations follow it.
The plan is easily good enough, with only small space for improvement. Q2 is one of the fastest queries of the set.
Let us now consider the execution of Q10:
{
time 1.1e-06% fanout 1 input 1 rows
time 4.4e-05% fanout 1 input 1 rows
{ hash filler
time 1.6e-05% fanout 25 input 1 rows
NATION 25 rows(.N_NATIONKEY, .N_NAME)
time 6.7e-06% fanout 0 input 25 rows
Sort hf 35 (.N_NATIONKEY) -> (.N_NAME)
}
time 1.5e-06% fanout 1 input 1 rows
{ fork
time 2.4e-06% fanout 1 input 1 rows
{ fork
time 13% fanout 5.73038e+06 input 1 rows
ORDERS 5.1e+06 rows(.O_ORDERKEY, .O_CUSTKEY)
O_ORDERDATE >= <c 1993-10-01> < <c 1994-01-01>
time 4.8% fanout 2.00042 input 5.73038e+06 rows
LINEITEM 1.1 rows(.L_EXTENDEDPRICE, .L_DISCOUNT)
inlined L_ORDERKEY = .O_ORDERKEY L_RETURNFLAG = <c R>
time 25% fanout 1 input 1.14632e+07 rows
Precode:
0: temp := artm 1 - .L_DISCOUNT
4: temp := artm .L_EXTENDEDPRICE * temp
8: BReturn 0
CUSTOMER unq 1 rows (.C_NATIONKEY, .C_CUSTKEY)
inlined C_CUSTKEY = k_.O_CUSTKEY
hash partition+bloom by 39 (tmp)hash join merged always card 1 -> (.N_NAME)
time 0.0023% fanout 1 input 1.14632e+07 rows
Hash source 35 merged into ts 1 rows(.C_NATIONKEY) -> (.N_NAME)
time 2.3% fanout 1 input 1.14632e+07 rows
Stage 2
time 3.6% fanout 0 input 1.14632e+07 rows
Sort (q_.C_CUSTKEY, .N_NAME) -> (temp)
}
time 0.6% fanout 3.88422e+06 input 1 rows
group by read node
(.C_CUSTKEY, .N_NAME, revenue)in each partition slice
time 0.57% fanout 0 input 3.88422e+06 rows
Sort (revenue) -> (.N_NAME, .C_CUSTKEY)
}
time 6.9e-06% fanout 20 input 1 rows
top order by read (.N_NAME, revenue, .C_CUSTKEY)
time 0.00036% fanout 1 input 20 rows
CUSTOMER unq 1 rows (.C_PHONE, .C_NAME, .C_ACCTBAL, .C_ADDRESS, .C_COMMENT)
inlined C_CUSTKEY = .C_CUSTKEY
time 1.1e-06% fanout 0 input 20 rows
Select (.C_CUSTKEY, .C_NAME, revenue, .C_ACCTBAL, .N_NAME, .C_ADDRESS, .C_PHONE, .C_COMMENT)
}
2153 msec 2457% cpu, 1.71845e+07 rnd 1.67177e+08 seq 76.3221% same seg 21.1204% same pg
The plan is by index, except for the lookup of nation name
for the customer
. The most selective condition is on order date
, followed by the returnflag
on lineitem
. Getting the customer
by index turns out to be better than by hash, even though almost all customers
are hit. See the input cardinality above the first customer
entry in the plan -- over 10M. The key point here is that only the c_custkey
and c_nationkey
get fetched, which saves a lot of time. In fact the c_custkey
is needless since this is anyway equal to the o_custkey
, but this makes little difference.
One could argue that customer
should be between lineitem
and orders
in join
order. Doing this would lose the ORDER BY
on orders
and lineitem
, but would prevent some customer
rows from being hit twice for a single order
. The difference would not be large, though. For a scale-out setting, one definitely wants to have orders
and lineitem
without customer
in between if the former are partitioned on the same key.
The c_nationkey
is next translated into a n_name
by hash, and there is a partitioned GROUP BY
on c_custkey
. The GROUP BY
is partitioned because there are many different c_custkey
values (155M for 100G scale).
The most important trick is fetching all the many dependent columns of c_custkey
only after the TOP k ORDER BY
. The last access to customer
in the plan does this and is only executed on 20 rows.
Without the TOP k
trick, the plan is identical, except that the dependent columns are fetched for nearly all customers
. If this is done, the run time is 16s, which is bad enough to sink the whole score.
There is another approach to the challenge of this query: If foreign keys are declared and enforced, the system will know that every order
has an actually existing customer
and that every customer
has a country
. If so, the whole GROUP BY
and TOP k
can be done without any reference to customer
, which is a notch better still, at least for this query. In this implementation, we do not declare foreign keys, thus the database must check that the customer
and its country
in fact exist before doing the GROUP BY
. This makes the late projection trick mandatory, but does save the expense of checking foreign keys on updates. In both cases, the optimizer must recognize that the columns to be fetched at the end (late projected) are functionally dependent on a grouping key (c_custkey
).
The late projection trick is generally useful, since almost all applications aside from bulk data export have some sort of limit on result set size. A column store especially benefits from this, since some columns of a row can be read without even coming near to other ones. A row store can also benefit from this in the form of decreased intermediate result size. This is especially good when returning long columns, such as text fields or blobs, on which there are most often no search conditions. If there are conditions of such, then these will most often be implemented via a special text index and not a scan.
* * * * *
In the next installment we will have a look at the overall 100G single server performance. After this we will recap the tricks so far. Then it will be time to look at implications of scale out for performance and run at larger scales. After the relational ground has been covered, we can look at implications of schema-lastness, i.e., triples for this type of workload.
So, while the most salient tricks have been at least briefly mentioned, we are far from having exhausted this most foundational of database topics.
To be continued...
Paul Groth gave a talk about the stellar success of the the initial term of Open PHACTS.
"The reincarnation of Steve Jobs," commented someone from the audience. "Except I am a nice guy," retorted Paul.
Commented one attendee, "The semantic web…., I just was in Boston at a semantic web meeting – so nerdy, something to make you walk out of the room… so it is a definite victory for Open PHACTS and why not also semantic web, that something based on these principles actually works."
It is a win anyhow, so I did not say anything at the meeting. So I will say something here, where I have more space as the message bears repeating.
We share part of the perception, so we hardly ever say "semantic web." The word is "linked data," and it means flexible schema and global identifiers. Flexible schema means that everything does not have to be modeled upfront. Global identifiers means that data, when transferred out of its silo of origin, remains interpretable and self-describing, so you can mix it with other data without things getting confused. "Desiloization" is a wonderful new word for describing this.
This ties right into FAIRport and FAIR data: Findable, Accessible, Interoperable, Reusable. Barend Mons talked a lot about this: open just means downloadable; fair means something you can do science with. Barend’s take is that RDF with a URI for everything is the super wire format for exchanging data. When you process it, you will diversely cook it, so an RDF store is one destination but not the only possibility. It has been said before: there is a range of choices between storing triples verbatim, and making application specific extractions, including ones with a schema, whether graph DB or relational.
Nanopublications are also moving ahead. Christine Chichester told me about pending publications involving Open PHACTS nanopublictions about post-translation modification of proteins and their expression in different tissues. So there are nanopublications out there and they can be joined, just as intended. Victory of e-science and data integration.
The Open PHACTS project is now officially extended for another two-year term, bringing the total duration to five years. The Open PHACTS Foundation exists as a legal entity and has its first members. This is meant to be a non-profit industry association for sharing of pre-competitive data and services around these between players in the pharma space, in industry as well as academia. There are press releases to follow in due time.
I am looking forward to more Open PHACTS. From the OpenLink and Virtuoso side, there are directly relevant developments that will enter production in the next few months, including query caching discussed earlier on this blog, as well as running on the TPC-H tuned analytics branch for overall better query optimization. Adaptive schema is something of evident value to Open PHACTS, as much of the integrated data comes from relational sources, so is regular enough. Therefore taking advantage of this for storage cannot hurt. We will see this still within the scope of the project extension.
Otherwise, more cooperation in formulating the queries for the business questions will also help.
All in all, Open PHACTS is the celebrated beauty queen of all the Innovative Medicine Initiative, it would seem. Superbly connected, unparalleled logo cloud, actually working and useful data integration, delivering on time on all in fact very complex business questions.
]]>Once you go to hash join, one side of the join will be materialized, which takes space, which ipso facto is bad. So, the predicate games are about moving conditions so that the hash table made for the hash join will be as small as possible. Only items that may in fact be retrieved should be put in the hash table. If you know that the query deals with shipments of green parts
, putting lineitems
of parts
that are not green in a hash table makes no sense since only green ones are being looked for.
So, let's consider Q9. The query is:
SELECT nation,
o_year,
SUM(amount) AS sum_profit
FROM ( SELECT
n_name AS nation,
EXTRACT ( YEAR FROM o_orderdate ) AS o_year,
l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity AS amount
FROM
part,
supplier,
lineitem,
partsupp,
orders,
nation
WHERE s_suppkey = l_suppkey
AND ps_suppkey = l_suppkey
AND ps_partkey = l_partkey
AND p_partkey = l_partkey
AND o_orderkey = l_orderkey
AND s_nationkey = n_nationkey
AND p_name like '%green%'
) AS profit
GROUP BY nation,
o_year
ORDER BY nation,
o_year DESC
;
The intent is to calculate profit from the sale of a type of part
, broken down by year
and supplier nation
. All orders
, lineitems
, partsupps
, and suppliers
involving the parts
of interest are visited. This is one of the longest running of the queries. The query is restricted by part
only, and the condition selects 1/17 of all parts
.
The execution plan is below. First the plan builds hash tables of all nations
and suppliers
. We expect to do frequent lookups, thus making a hash is faster than using the index. Partsupp
is the 3rd largest table in the database. This has a primary key of ps_partkey, ps_suppkey
, referenced by the compound foreign key l_partkey, l_suppkey
in lineitem
. This could be accessed by index, but we expect to hit each partsupp
row multiple times, hence hash is better. We further note that only partsupp
rows where the part
satisfies the condition will contribute to the result. Thus we import the join with part
into the hash build. The ps_partkey
is not directly joined to p_partkey
, but rather the system must understand that this follows from l_partkey = ps_partkey
and l_partkey = p_partkey
. In this way, the hash table is 1/17th of the size it would otherwise be, which is a crucial gain.
Looking further into the plan, we note a scan of lineitem
followed by a hash join with part
. Restricting the build of the partsupp
hash would have the same effect, hence part
is here used twice while it occurs only once in the query. This is deliberate, since the selective hash join with part
restricts lineitem
faster than the more complex hash join with a 2 part key (l_partkey, l_suppkey)
. Both joins perform the identical restriction, but doing the part
first is faster since this becomes a single-key, invisible hash join, merged into the lineitem
scan, done before even accessing the l_suppkey
and other columns.
{
time 3.9e-06% fanout 1 input 1 rows
time 4.7e-05% fanout 1 input 1 rows
{ hash filler
time 3.6e-05% fanout 25 input 1 rows
NATION 25 rows(.N_NATIONKEY, nation)
time 8.8e-06% fanout 0 input 25 rows
Sort hf 35 (.N_NATIONKEY) -> (nation)
}
time 0.16% fanout 1 input 1 rows
{ hash filler
time 0.011% fanout 1e+06 input 1 rows
SUPPLIER 1e+06 rows(.S_SUPPKEY, .S_NATIONKEY)
time 0.03% fanout 0 input 1e+06 rows
Sort hf 49 (.S_SUPPKEY) -> (.S_NATIONKEY)
}
time 0.57% fanout 1 input 1 rows
{ hash filler
Subquery 58
{
time 1.6% fanout 1.17076e+06 input 1 rows
PART 1.2e+06 rows(t1.P_PARTKEY)
P_NAME LIKE <c %green%> LIKE <c >
time 1.1% fanout 4 input 1.17076e+06 rows
PARTSUPP 3.9 rows(t4.PS_SUPPKEY, t4.PS_PARTKEY, t4.PS_SUPPLYCOST)
inlined PS_PARTKEY = t1.P_PARTKEY
After code:
0: t4.PS_SUPPKEY := := artm t4.PS_SUPPKEY
4: t4.PS_PARTKEY := := artm t4.PS_PARTKEY
8: t1.P_PARTKEY := := artm t1.P_PARTKEY
12: t4.PS_SUPPLYCOST := := artm t4.PS_SUPPLYCOST
16: BReturn 0
time 0.33% fanout 0 input 4.68305e+06 rows
Sort hf 82 (t4.PS_SUPPKEY, t4.PS_PARTKEY) -> (t1.P_PARTKEY, t4.PS_SUPPLYCOST)
}
}
time 0.18% fanout 1 input 1 rows
{ hash filler
time 1.6% fanout 1.17076e+06 input 1 rows
PART 1.2e+06 rows(.P_PARTKEY)
P_NAME LIKE <c %green%> LIKE <c >
time 0.017% fanout 0 input 1.17076e+06 rows
Sort hf 101 (.P_PARTKEY)
}
time 5.1e-06% fanout 1 input 1 rows
{ fork
time 4.1e-06% fanout 1 input 1 rows
{ fork
time 59% fanout 3.51125e+07 input 1 rows
LINEITEM 6e+08 rows(.L_PARTKEY, .L_ORDERKEY, .L_SUPPKEY, .L_EXTENDEDPRICE, .L_DISCOUNT, .L_QUANTITY)
hash partition+bloom by 108 (tmp)hash join merged always card 0.058 -> ()
hash partition+bloom by 56 (tmp)hash join merged always card 1 -> (.S_NATIONKEY)
time 0.18% fanout 1 input 3.51125e+07 rows
Precode:
0: temp := artm 1 - .L_DISCOUNT
4: temp := artm .L_EXTENDEDPRICE * temp
8: BReturn 0
Hash source 101 merged into ts 0.058 rows(.L_PARTKEY) -> ()
time 17% fanout 1 input 3.51125e+07 rows
Hash source 82 0.057 rows(.L_SUPPKEY, .L_PARTKEY) -> ( <none> , .PS_SUPPLYCOST)
time 6.2% fanout 1 input 3.51125e+07 rows
Precode:
0: temp := artm .PS_SUPPLYCOST * .L_QUANTITY
4: temp := artm temp - temp
8: BReturn 0
ORDERS unq 1 rows (.O_ORDERDATE)
inlined O_ORDERKEY = k_.L_ORDERKEY
time 0.0055% fanout 1 input 3.51125e+07 rows
Hash source 49 merged into ts 1 rows(k_.L_SUPPKEY) -> (.S_NATIONKEY)
time 3.5% fanout 1 input 3.51125e+07 rows
Hash source 35 1 rows(k_.S_NATIONKEY) -> (nation)
time 8.8% fanout 0 input 3.51125e+07 rows
Precode:
0: o_year := Call year (.O_ORDERDATE)
5: BReturn 0
Sort (nation, o_year) -> (temp)
}
time 4.7e-05% fanout 175 input 1 rows
group by read node
(nation, o_year, sum_profit)
time 0.00028% fanout 0 input 175 rows
Sort (nation, o_year) -> (sum_profit)
}
time 2.2e-05% fanout 175 input 1 rows
Key from temp (nation, o_year, sum_profit)
time 1.6e-06% fanout 0 input 175 rows
Select (nation, o_year, sum_profit)
}
6114 msec 1855% cpu, 3.62624e+07 rnd 6.44384e+08 seq 99.6068% same seg 0.357328% same pg
6.1s is a good score for this query. When executing the same in 5 parallel invocations, the fastest ends in 13.7s and the slowest in 27.6s. For five concurrent executions, the peak transient memory utilization is 4.7 GB for the hash tables, which is very reasonable.
* * * * *
Let us next consider Q17.
SELECT
SUM(l_extendedprice) / 7.0 AS avg_yearly
FROM
lineitem,
part
WHERE
p_partkey = l_partkey
AND p_brand = 'Brand#23'
AND p_container = 'MED BOX'
AND l_quantity
< (
SELECT
2e-1 * AVG(l_quantity)
FROM
lineitem
WHERE
l_partkey = p_partkey
)
Deceptively simple? This calculates the total value of small orders
(below 1/5 of average quantity for the part
) for all parts
of a given brand
with a specific container
.
If there is an index on l_partkey
, the plan is easy enough: Take the parts
, look up the average quantity
for each, then recheck lineitem
and add up the small lineitems
. This takes about 1s. But we do not want indices for this workload.
If we made a hash from l_partkey
to l_quantity
for all lineitems
, we could run out of space, and this would take so long the race would be automatically lost on this point alone. The trick is to import the restriction on l_partkey
into the hash build. This gives us a plan that does a scan of lineitem
twice, doing a very selective hash join (few parts
). There is a lookup for the average
for each lineitem
with the part
. The average
is calculated potentially several times.
The below plan is workable but better is possible: We notice that the very selective join need be done just once; it is cheaper to remember the result than to do it twice, and the result is not large. The other trick is that the correlated subquery can be rewritten as
SELECT
...
FROM
lineitem,
part,
( SELECT
l_partkey,
0.2 * AVG (l_quantity) AS qty
FROM
lineitem,
part
...
) f
WHERE
l_partkey = f.l_partkey
...
In this form, one can put the entire derived table f
on the build side of a hash join. In this way, the average
is never done more than once per part
.
{
time 7.9e-06% fanout 1 input 1 rows
time 0.0031% fanout 1 input 1 rows
{ hash filler
time 0.27% fanout 20031 input 1 rows
PART 2e+04 rows(.P_PARTKEY)
P_BRAND = <c Brand#23> , P_CONTAINER = <c MED BOX>
time 0.00047% fanout 0 input 20031 rows
Sort hf 34 (.P_PARTKEY)
}
time 0.1% fanout 1 input 1 rows
{ hash filler
Subquery 40
{
time 46% fanout 600982 input 1 rows
LINEITEM 6e+08 rows(t4.L_PARTKEY, t4.L_QUANTITY)
hash partition+bloom by 38 (tmp)hash join merged always card 0.001 -> ()
time 0.0042% fanout 1 input 600982 rows
Hash source 34 merged into ts not partitionable 0.001 rows(t4.L_PARTKEY) -> ()
After code:
0: t4.L_PARTKEY := := artm t4.L_PARTKEY
4: t4.L_QUANTITY := := artm t4.L_QUANTITY
8: BReturn 0
time 0.059% fanout 0 input 600982 rows
Sort hf 62 (t4.L_PARTKEY) -> (t4.L_QUANTITY)
}
}
time 6.8e-05% fanout 1 input 1 rows
{ fork
time 46% fanout 600982 input 1 rows
LINEITEM 6e+08 rows(.L_PARTKEY, .L_QUANTITY, .L_EXTENDEDPRICE)
hash partition+bloom by 38 (tmp)hash join merged always card 0.00052 -> ()
time 0.00021% fanout 1 input 600982 rows
Hash source 34 merged into ts 0.00052 rows(.L_PARTKEY) -> ()
Precode:
0: .P_PARTKEY := := artm .L_PARTKEY
4: BReturn 0
END Node
After test:
0: {
time 0.038% fanout 1 input 600982 rows
time 0.17% fanout 1 input 600982 rows
{ fork
time 6.8% fanout 0 input 600982 rows
Hash source 62 not partitionable 0.03 rows(k_.P_PARTKEY) -> (.L_QUANTITY)
After code:
0: sum sum.L_QUANTITYset no set_ctr
5: sum count 1 set no set_ctr
10: BReturn 0
}
After code:
0: temp := artm sum / count
4: temp := artm 0.2 * temp
8: aggregate := := artm temp
12: BReturn 0
time 0.042% fanout 0 input 600982 rows
Subquery Select(aggregate)
}
8: if (.L_QUANTITY < scalar) then 12 else 13 unkn 13
12: BReturn 1
13: BReturn 0
After code:
0: sum sum.L_EXTENDEDPRICE
5: BReturn 0
}
After code:
0: avg_yearly := artm sum / 7
4: BReturn 0
time 4.6e-06% fanout 0 input 1 rows
Select (avg_yearly)
}
2695 msec 1996% cpu, 3 rnd 1.18242e+09 seq 0% same seg 0% same pg
2.7s is tolerable, but if this drags down the overall score by too much, we know that a 2+x improvement is readily available. Playing the rest of the tricks would result in the hash plan almost catching up with the 1s execution time of the index-based plan.
* * * * *
Q20 is not very long-running, but it is maybe the hardest to optimize of the lot. But as usual, failure to recognize its most salient traps will automatically lose the race, so pay attention.
SELECT TOP 100
s_name,
s_address
FROM
supplier,
nation
WHERE
s_suppkey IN
( SELECT
ps_suppkey
FROM
partsupp
WHERE
ps_partkey IN
( SELECT
p_partkey
FROM
part
WHERE
p_name LIKE 'forest%'
)
AND ps_availqty >
( SELECT
0.5 * SUM(l_quantity)
FROM
lineitem
WHERE
l_partkey = ps_partkey
AND l_suppkey = ps_suppkey
AND l_shipdate >= CAST ('1994-01-01' AS DATE)
AND l_shipdate < DATEADD ('year', 1, CAST ('1994-01-01' AS DATE))
)
)
AND s_nationkey = n_nationkey
AND n_name = 'CANADA'
ORDER BY s_name
This identifies suppliers
that have parts
in stock in excess of half a year's shipments of said part
.
The use of IN
to denote a join
is the first catch. The second is joining to lineitem
by hash without building an overly large hash table. We know that IN
becomes EXISTS
which in turn can become a join
as follows:
SELECT
l_suppkey
FROM
lineitem
WHERE
l_partkey IN
( SELECT
p_partkey
FROM
part
WHERE
p_name LIKE 'forest%'
)
;
-- is --
SELECT
l_suppkey
FROM
lineitem
WHERE EXISTS
( SELECT
p_partkey
FROM
part
WHERE
p_partkey = l_partkey
AND p_name LIKE 'forest%')
;
-- is --
SELECT
l_suppkey
FROM
lineitem,
( SELECT
DISTINCT p_partkey
FROM
part
WHERE
p_name LIKE 'forest%') f
WHERE
l_partkey = f.p_partkey
;
But since p_partkey
is unique, the DISTINCT
drops off and we have.
SELECT
l_suppkey
FROM
lineitem,
part
WHERE
p_name LIKE 'forest%
AND l_partkey = f.p_partkey
;
You see, the innermost IN
with the ps_partkey
goes through all these changes, and just becomes a join
. The outermost IN
stays as a distinct derived table, since ps_suppkey
is not unique, and the meaning of IN
is not to return a given supplier
more than once.
The derived table is flattened and the DISTINCT
is done partitioned; hence the stage node in front of the distinct. A DISTINCT
can be multithreaded, if each thread gets a specific subset of all the keys. The stage node is an exchange of tuples between several threads. Each thread then does a TOP k
sort. The TOP k
trick we saw in Q18 is used, but does not contribute much here.
{
time 8.2e-06% fanout 1 input 1 rows
time 0.00017% fanout 1 input 1 rows
{ hash filler
time 6.1e-05% fanout 1 input 1 rows
NATION 1 rows(.N_NATIONKEY)
N_NAME = <c CANADA>
time 1.2e-05% fanout 0 input 1 rows
Sort hf 34 (.N_NATIONKEY)
}
time 0.073% fanout 1 input 1 rows
{ hash filler
time 4.1% fanout 240672 input 1 rows
PART 2.4e+05 rows(t74.P_PARTKEY)
P_NAME LIKE <c forest%> LIKE <c >
time 0.011% fanout 0 input 240672 rows
Sort hf 47 (t74.P_PARTKEY)
}
time 0.69% fanout 1 input 1 rows
{ hash filler
Subquery 56
{
time 42% fanout 1.09657e+06 input 1 rows
LINEITEM 9.1e+07 rows(t76.L_PARTKEY, t76.L_SUPPKEY, t76.L_QUANTITY)
L_SHIPDATE >= <c 1994-01-01> < <c 1995-01-01>
hash partition+bloom by 54 (tmp)hash join merged always card 0.012 -> ()
time 0.022% fanout 1 input 1.09657e+06 rows
Hash source 47 merged into ts not partitionable 0.012 rows(t76.L_PARTKEY) -> ()
After code:
0: t76.L_PARTKEY := := artm t76.L_PARTKEY
4: t76.L_SUPPKEY := := artm t76.L_SUPPKEY
8: t76.L_QUANTITY := := artm t76.L_QUANTITY
12: BReturn 0
time 0.22% fanout 0 input 1.09657e+06 rows
Sort hf 80 (t76.L_PARTKEY, t76.L_SUPPKEY) -> (t76.L_QUANTITY)
}
}
time 2.1e-05% fanout 1 input 1 rows
time 3.2e-05% fanout 1 input 1 rows
{ fork
time 5.3% fanout 240672 input 1 rows
PART 2.4e+05 rows(t6.P_PARTKEY)
P_NAME LIKE <c forest%> LIKE <c >
time 1.9% fanout 4 input 240672 rows
PARTSUPP 1.2 rows(t4.PS_AVAILQTY, t4.PS_PARTKEY, t4.PS_SUPPKEY)
inlined PS_PARTKEY = t6.P_PARTKEY
time 16% fanout 0.680447 input 962688 rows
END Node
After test:
0: {
time 0.08% fanout 1 input 962688 rows
time 9.4% fanout 1 input 962688 rows
{ fork
time 3.6% fanout 0 input 962688 rows
Hash source 80 0.013 rows(k_t4.PS_PARTKEY, k_t4.PS_SUPPKEY) -> (t8.L_QUANTITY)
After code:
0: sum sumt8.L_QUANTITYset no set_ctr
5: BReturn 0
}
After code:
0: temp := artm 0.5 * sum
4: aggregate := := artm temp
8: BReturn 0
time 0.85% fanout 0 input 962688 rows
Subquery Select(aggregate)
}
8: if (t4.PS_AVAILQTY > scalar) then 12 else 13 unkn 13
12: BReturn 1
13: BReturn 0
time 1% fanout 1 input 655058 rows
Stage 2
time 0.071% fanout 1 input 655058 rows
Distinct (q_t4.PS_SUPPKEY)
After code:
0: PS_SUPPKEY := := artm t4.PS_SUPPKEY
4: BReturn 0
time 0.016% fanout 1 input 655058 rows
Subquery Select(PS_SUPPKEY)
time 3.2% fanout 0.0112845 input 655058 rows
SUPPLIER unq 0.075 rows (.S_NAME, .S_NATIONKEY, .S_ADDRESS)
inlined S_SUPPKEY = PS_SUPPKEY
hash partition+bloom by 38 (tmp)hash join merged always card 0.04 -> ()
top k on S_NAME
time 0.0012% fanout 1 input 7392 rows
Hash source 34 merged into ts 0.04 rows(.S_NATIONKEY) -> ()
time 0.074% fanout 0 input 7392 rows
Sort (.S_NAME) -> (.S_ADDRESS)
}
time 0.00013% fanout 100 input 1 rows
top order by read (.S_NAME, .S_ADDRESS)
time 5e-06% fanout 0 input 100 rows
Select (.S_NAME, .S_ADDRESS)
}
1777 msec 1355% cpu, 894483 rnd 6.39422e+08 seq 79.1214% same seg 19.3093% same pg
1.8s is sufficient, and in the ballpark with VectorWise. Some further gain is possible, as the lineitem
hash table can also be restricted by supplier
; after all, only 1/25 of all suppliers
are in the end considered. Further simplifications are possible. Another 20% of time could be saved. The tricks are however quite complex and specific, and there are easier gains to be had -- for example, in reusing intermediates in Q17 and Q15.
The next installment will discuss late projection and some miscellaneous tricks not mentioned so far. After this, we are ready to take an initial look at the performance of the system as a whole.
To be continued...
The clear take-home from London and Brussels alike is that these events have full days and 4 or more talks an hour. It is not quite TV commercial spots yet but it is going in this direction.
If you say something complex, little will get across unless the audience already knows what you will be saying.
I had a set of slides from Jens Lehmann, the GeoKnow project coordinator, for whom I was standing in. Now these are a fine rendition of the description of work. What is wrong with partners, work packages, objectives, etc? Nothing, except everybody has them.
I recall the old story about the journalist and the Zen master: The Zen master repeatedly advises the reporter to cut the story in half. We get the same from PR professionals, "If it is short, they have at least thought about what should go in there," said one recently, talking of pitches and messages. The other advice was to use pictures. And to have a personal dimension to it.
Enter "Ms. Globe" and "Mr. Cube". Frans Knibbe of Geodan gave the Linked Geospatial Data 2014 workshop's most memorable talk entitled "Linked Data and Geoinformatics - a love story" (pdf) about the excitement and the pitfalls of the burgeoning courtship of Ms. Globe (geoinformatics) and Mr. Cube (semantic technology). They get to talking, later Ms. Globe thinks to herself... "Desiloisazation, explicit semantics, integrated metadata..." Mr. Cube, young upstart now approaching a more experienced and sophisticated lady, dreams of finally making an entry into adult society, "critical mass, global scope, relevant applications..." There is a vibration in the air.
So, with Frans Knibbe's gracious permission I borrowed the storyline and some of the pictures.
Mr. Cube is not Ms. Globe's first lover, though; there is also rich and worldly Mr. Table. How will Mr. Cube prove himself? The eternal question... Well, not by moping around, not by wise-cracking about semantics, no. By boldly setting out upon a journey to fetch the Golden Fleece from beyond the crashing rocks. "Column store, vectored execution, scale out, data clustering, adaptive schema..." he affirms, with growing confidence.
This is where the story stands, right now. Virtuoso run circles around PostGIS doing aggregations and lookups on geometries in a map-scrolling scenario (GeoKnow's GeoBenchLab). Virtuoso SPARQL outperforms PostGIS SQL against planet-scale OpenStreetMap; Virtuoso SQL goes 5-10x faster still.
Mr Cube is fast on the draw, but still some corners can be smoothed out.
Later in GeoKnow, there will be still more speed but also near parity between SQL and SPARQL via taking advantage of data regularity in guiding physical storage. If it is big, it is bound to have repeating structure.
The love story grows more real by the day. To be consummated still within GeoKnow.
Talking of databases has the great advantage that this has been a performance game from the start. There are few people who need convincing about the desirability of performance, as this also makes for lower cost and more flexibility on the application side.
But this is not all there is to it.
In Brussels, the public was about E-science (Earth observation). In science, it is understood that qualitative aspects can be even more crucial. I told the story about an E-science-oriented workshop I attended in America years ago. The practitioners, from high energy physics to life sciences to climate, had invariably come across the need for self-description of data and for schema-last. This was essentially never provided by RDF, except for some life science cases. Rather, we had one-off schemes, ranging from key-value pairs to putting the table name in a column of the same table to preserve the origin across data export.
Explicit semantics and integrated metadata are important, Ms. Globe knows, but she cannot sacrifice operational capacity for this. So it is more than a DBMS or even data model choice -- there must be a solid tool chain for data integration and visualization. GeoKnow provides many tools in this space.
Some of these, such as the LIMES entity matching framework (pdf) are probably close to the best there is. For other parts, the SQL-based products with hundreds of person years invested in user interaction are simply unbeatable.
In these cases, the world can continue to talk SQL. If the regular part of the data is in fact tables already, so much the better. You connect to Virtuoso via SQL, just like to PostGIS or Oracle Spatial, and talk SQL MM. The triples, in the sense of flexible annotation and integrated metadata, stay there; you just do not see them if you do not want them.
There are possibilities all right. In the coming months I will showcase some of the progress, starting with a detailed look at the OpenStreetMap experiments we have made in GeoKnow.
Now, OKFN is a party in the LOD2 FP7 project, so I have over the years met people from there on and off. In LOD2, OKFN is praised to the skies for its visibility and influence and outreach and sometimes, in passing, critiqued for not publishing enough RDF, let alone five star linked data.
As it happens, CSV rules, and even the W3C will, it appears, undertake to standardize a CSV-to-RDF mapping. As far as I am concerned, as long as there is no alignment of identifiers or vocabulary, whether a thing is CSV or exactly equivalent RDF, there is no great difference, except that CSV is smaller and loads into Excel.
For OKFN, which has a mission of opening data, insisting on any particular format would just hinder the cause.
What do we learn from this? OKFN is praised not only for government relations but also for developer friendliness. Lobbying for open data is something I can understand, but how do you do developer relations? This is not like talking to customers, where the customer wants to do something and it is usually possible to give some kind of advice or recommendation on how they can use our technology for the purpose.
Are JSON and Mongo DB the key? A well renowned database guy once said that to be with the times, JSON is your data model, Hadoop your file system, Mongo DB your database, and JavaScript your language, and failing this, you are an old fart, a legacy suit, well, some uncool fossil.
The key is not limited to JSON. More generally, it is zero time to some result and no learning curve. Some people will sacrifice almost anything for this, such as the possibility of doing arbitrary joins. People will even write code, even lots of it, if it only happens to be in their framework of choice.
Phil again deplored the early fiasco of RDF messaging. "Triples are not so difficult. It is not true that RDF has a very steep learning curve." I would have to agree. The earlier gaffes of the RDF/XML syntax and the infamous semantic web layer cake diagram now lie buried and unlamented; let them be.
Generating user experience from data or schema is an old mirage that has never really worked out. The imagined gain from eliminating application writing has however continued to fascinate IT minds and attempts in this direction have never really ceased. The lesson of history seems to be that coding is not to be eliminated, but that it should have fast turnaround time and immediately visible results.
And since this is the age of data, databases should follow this lead. Schema-last is a good point, maybe adding JSON alongside XML as an object type in RDF might not be so bad. There are already XML functions, so why not the analog for JSON? Just don't mention XML to the JSON folks...
How does this relate to OKFN? Well, in the first instance this is the cultural impression I received from the meetup, but in a broader sense these factors are critical to realizing the full potential of OKFN's successes so far. OKFN is a data opening advocacy group; it is not a domain-specific think tank or special interest group. The data owners and their consultants will do analytics and even data integration if they see enough benefit in this, all in the established ways. However, the widespread opening of data does create possibilities that did not exist before. Actual benefits depend in great part on constant lowering of access barriers, and on a commitment by publishers to keep the data up to date, so that developers can build more than just a one-off mashup.
True, there are government users of open data, since there is a productivity gain in already having the neighboring department's data opened to a point; one does no longer have to go through red tape to gain access to it.
For an application ecosystem to keep growing on the base of tens of thousands of very heterogeneous datasets coming into the open, continuing to lower barriers is key. This is a very different task from making faster and faster databases or of optimizing a particular business process, and it demands different thinking.
Andy was later on a panel where Phil Archer asked him whether SPARQL was slow by nature or whether this was a matter of bad implementations. Andy answered approximately as follows: "If you allow for arbitrary ad hoc structure, you will always pay something for this. However, if you tell the engine what your data is like, it is no different from executing SQL." This is essentially the gist of our conversation. Most likely we will make this happen via adaptive schema for the regular part and exceptions as quads.
Later I talked with Phil about the "SPARQL is slow" meme. The fact is that Virtuoso SPARQL will outperform or match PostGIS SQL for Geospatial lookups against the OpenStreetMap dataset. Virtuoso SQL will win by a factor of 5 to 10. Still, the SPARQL is slow meme is not entirely without a basis in fact. I would say that the really blatant cases that give SPARQL a bad name are query optimization problems. With 50 triple patterns in a query there are 50-factorial ways of getting a bad plan. This is where the catastrophic failures of 100+ times worse than SQL come from. The regular penalty of doing triples vs tables is somewhere between 2.5 (Star Schema Benchmark) and 10 (lookups with many literals), quite acceptable for many applications. Some really bad cases can occur with regular expressions on URI strings or literals, but then, if this is the core of the application, it should use a different data model or an n-gram index.
The solutions, including more dependable query plan choice, will flow from adaptive schema which essentially reduces RDF back into relational, however without forcing schema first and with accommodation for exceptions in the data.
Phil noted here that there already exist many (so far, proprietary) ways of describing the shape of a graph. He said there would be a W3C activity for converging these. If so, a vocabulary that can express relationships, the types of related entities, their cardinalities, etc., comes close to a SQL schema and its statistics. Such a thing can be the output of data analysis, or the input to a query optimizer or storage engine, for using a schema where one in fact exists. Like this, there is no reason why things would be less predictable than with SQL. The idea of a re-convergence of data models is definitely in the air; this is in no sense limited to us.
Reporting on each talk and the many highly diverse topics addressed is beyond the scope of this article; for this you can go to the program and the slides that will be online. Instead, I will talk about questions that to me seemed to be in the air, and about some conversations I had with the relevant people.
The trend in events like this is towards shorter and shorter talks and more and more interaction. In this workshop, talks were given in series of three talks with all questions at the end, with all the presenters on stage. This is not a bad idea since we get a panel-like effect where many presenters can address the same question. If the subject matter allows, a panel is my preferred format.
Geospatial data tends to be exposed via web services, e.g., WFS (Web Feature Service). This allows item retrieval on a lookup basis and some predefined filtering, transformation, and content negotiation. Capabilities vary; OGC now has WFS 2.0, and there are open source implementations that do a fair job of providing the functionality.
Of course, a real query language is much more expressive, but a service API is more scalable, as people say. What they mean is that an API is more predictable. For pretty much any complex data task, a query language is near-infinitely more efficient than going back-and-forth, often on a wide area network, via an API. So, as Andreas Harth put it: for data publishers, make an API; an open SPARQL endpoint is too "brave," [Andreas' word, with the meaning of foolhardy]. When you analyze, he continued, then you load it into a endpoint, but you use your own. Any quality of service terms must be formulated with respect to a fixed workload, this is not meaningful with ad hoc queries in an expressive language. Things like anytime semantics (return whatever is found within a time limit) are only good for a first interactive look, not for applications.
Should the application go to the data or the reverse? Some data is big and moving it is not self-evident. A culture of datasets being hosted on a cloud may be forming. Of course some linked data like DBpedia has for a long time been available as Amazon images. Recently, SindiceTech has made a similar packaging of Freebase. The data of interest here is larger and its target audience is more specific, on the e-science side.
How should geometries be modeled? I have met the GeoSPARQL and the SQL MM on which it is based with a sense of relief, as these are reasonable things that can be efficiently implemented. There are proposals where points have URIs, and linestrings are ordered sets of points, and collections are actual trees with RDF subjects as nodes. As a standard, such a thing is beyond horrible, as it hits all the RDF penalties and overheads full force, and promises easily 10x worse space consumption and 100x worse run times compared to the sweetly reasonable GeoSPARQL. One presenter said that cases of actually hanging attributes off points of complex geometries had been heard of but were, in his words, anecdotal. He posed a question to the audience about use cases where points in fact needed separately addressable identities. Several cases did emerge, involving, for example, different measurement certainties for different points on on a trajectory trace obtained by radar. Applications that need data of this sort will perforce be very domain specific. OpenStreetMap (OSM) itself is a bit like this, but there the points that have individual identity also have predominantly non-geometry attributes and stand for actually-distinct entities. OSM being a practical project, these are then again collapsed into linestrings for cases where this is more efficient. The OGC data types themselves have up to 4 dimensions, of which the 4th could be used as an identifier of a point in the event this really were needed. If so, this would likely be empty for most points and would compress away if the data representation were done right.
For data publishing, Andreas proposed to give OGC geometries URIs, i.e., the borders of a country can be more or less precisely modeled, and the large polygon may have different versions and provenances. This is reasonable enough, as long as the geometries are big. For applications, one will then collapse the 1:n
between entity and its geometry into a 1:1
. In the end, when you make an application, even an RDF one, you do not just throw all the data in a bucket and write queries against that. Some alignment and transformation is generally involved.
From the TPC-H specification:
SELECT TOP 100
c_name,
c_custkey,
o_orderkey,
o_orderdate,
o_totalprice,
SUM ( l_quantity )
FROM customer,
orders,
lineitem
WHERE o_orderkey
IN
(
SELECT l_orderkey
FROM lineitem
GROUP BY l_orderkey
HAVING
SUM ( l_quantity ) > 312
)
AND c_custkey = o_custkey
AND o_orderkey = l_orderkey
GROUP BY c_name,
c_custkey,
o_orderkey,
o_orderdate,
o_totalprice
ORDER BY o_totalprice DESC,
o_orderdate
The intent of the query is to return order
and customer
information for cases where an order
involves a large quantity of items
, with highest-value orders
first.
We note that the only restriction in the query is the one on the SUM
of l_quantity
in the IN
subquery. Everything else is a full scan or a JOIN
on a foreign key.
Now, the first query optimization rule of thumb could be summarized as start from the small. Small here means something that is restricted; it does not mean small table. Smallest is the one from which the highest percentage is dropped via a condition that does not depend on other tables.
The next rule of thumb is to try starting from the large, if the large has a restricting join
; for example, scan all the lineitems
and hash join to parts
that are green and of a given brand. In this case, the idea is to make a hash table from the small side and sequentially scan the large side, dropping everything that does not match something in the hash table.
The only restriction here is on orders
via a join
on lineitem
. So, the IN
subquery can be flattened, so as to read like --
SELECT ...
FROM ( SELECT l_orderkey,
SUM ( l_quantity )
FROM lineitem
GROUP BY l_orderkey
HAVING
SUM ( l_quantity ) > 312
) f,
orders,
customer,
lineitem
WHERE f.l_orderkey = o_orderkey ....
The above (left to right) is the best JOIN
order for this type of plan. We start from the restriction, and for all the rest the JOIN
is foreign key to primary key, sometimes n:1
(orders
to customer
), sometimes 1:n
(orders
to lineitem
). A 1:n
is usually best by index; an n:1
can be better by hash if there are enough tuples on the n side to make it worthwhile to build the hash table.
We note that the first GROUP BY
makes a very large number of groups, e.g., 150M at 100 Gtriple scale. We also note that if lineitem
is ordered so that the lineitems
of a single order
are together, the GROUP BY
is ordered. In other words, once you have seen a specific value of l_orderkey
change to the next, you will not see the old value again. In this way, the groups do not have to be remembered for all time. The GROUP BY
produces a stream of results as the scan of lineitem
proceeds.
Considering vectored execution, the GROUP BY
does remember a bunch of groups, up to a vector size worth, so that output from the GROUP BY
is done in large enough batches, not a tuple at a time.
Considering parallelization, the scan of lineitem
must be split in such a way that all lineitems
with the same l_orderkey
get processed by the same thread. If this is the case, all threads will produce an independent stream of results that is guaranteed to need no merge with the output of another thread.
So, we can try this:
{
time 6e-06% fanout 1 input 1 rows
time 4.5% fanout 1 input 1 rows
{ hash filler
-- Make a hash table from c_custkey
to c_name
time 0.99% fanout 1.5e+07 input 1 rows
CUSTOMER 1.5e+07 rows(.C_CUSTKEY, .C_NAME)
time 0.81% fanout 0 input 1.5e+07 rows
Sort hf 35 (.C_CUSTKEY) -> (.C_NAME)
}
time 2.2e-05% fanout 1 input 1 rows
time 1.6e-05% fanout 1 input 1 rows
{ fork
time 5.2e-06% fanout 1 input 1 rows
{ fork
-- Scan lineitem
time 10% fanout 6.00038e+08 input 1 rows
LINEITEM 6e+08 rows(t5.L_ORDERKEY, t5.L_QUANTITY)
-- Ordered GROUP BY
(streaming with duplicates)
time 73% fanout 1.17743e-05 input 6.00038e+08 rows
Sort streaming with duplicates (t5.L_ORDERKEY) -> (t5.L_QUANTITY)
-- The ordered aggregation above emits a batch of results every so often, having accumulated 20K or so groups (DISTINCT l_orderkey
's)
-- The operator below reads the batch and sends it onward, the GROUP BY
hash table for the next batch.
time 10% fanout 21231.4 input 7065 rows
group by read node
(t5.L_ORDERKEY, aggregate)
END Node
After test:
0: if (aggregate > 312 ) then 4 else 5 unkn 5
4: BReturn 1
5: BReturn 0
After code:
0: L_ORDERKEY := := artm t5.L_ORDERKEY
4: BReturn 0
-- This marks the end of the flattened IN
subquery. 1063 out of 150M groups survive the test on the SUM
of l_quantity
.
-- The main difficulty of Q18 is guessing that this condition is this selective.
time 0.0013% fanout 1 input 1063 rows
Subquery Select(L_ORDERKEY)
time 0.058% fanout 1 input 1063 rows
ORDERS unq 0.97 rows (.O_CUSTKEY, .O_ORDERKEY, .O_ORDERDATE, .O_TOTALPRICE)
inlined O_ORDERKEY = L_ORDERKEY
hash partition+bloom by 42 (tmp)hash join merged always card 0.99 -> (.C_NAME)
time 0.0029% fanout 1 input 1063 rows
Hash source 35 merged into ts 0.99 rows(.O_CUSTKEY) -> (.C_NAME)
After code:
0: .C_CUSTKEY := := artm .O_CUSTKEY
4: BReturn 0
time 0.018% fanout 7 input 1063 rows
LINEITEM 4.3 rows(.L_QUANTITY)
inlined L_ORDERKEY = .O_ORDERKEY
time 0.011% fanout 0 input 7441 rows
Sort (.C_CUSTKEY, .O_ORDERKEY) -> (.L_QUANTITY, .O_TOTALPRICE, .O_ORDERDATE, .C_NAME)
}
time 0.00026% fanout 1063 input 1 rows
group by read node
(.C_CUSTKEY, .O_ORDERKEY, aggregate, .O_TOTALPRICE, .O_ORDERDATE, .C_NAME)
time 0.00061% fanout 0 input 1063 rows
Sort (.O_TOTALPRICE, .O_ORDERDATE) -> (.C_NAME, .C_CUSTKEY, .O_ORDERKEY, aggregate)
}
time 1.7e-05% fanout 100 input 1 rows
top order by read (.C_NAME, .C_CUSTKEY, .O_ORDERKEY, .O_ORDERDATE, .O_TOTALPRICE, aggregate)
time 1.2e-06% fanout 0 input 100 rows
Select (.C_NAME, .C_CUSTKEY, .O_ORDERKEY, .O_ORDERDATE, .O_TOTALPRICE, aggregate)
}
6351 msec 1470% cpu, 2151 rnd 6.14898e+08 seq 0.185874% same seg 1.57993% same pg
What is wrong with this? The result is not bad, in the ballpark with VectorWise published results (4.9s on a slightly faster box), but better is possible. We note that there is a hash join from orders
to customer
. Only 1K customers
of 15M get hit. The whole hash table of 15M entries is built in vain. Let's cheat and declare the join
to be by index
. Cheats like this are not allowed in an official run but here we are just looking. So we change the mention of the customer
table in the FROM
clause from FROM ... customer, ...
to FROM ... customer TABLE OPTION (loop), ...
{
time 1.4e-06% fanout 1 input 1 rows
time 9e-07% fanout 1 input 1 rows
-- Here was the hash build in the previous plan; now we start direct with the scan of lineitem
.
time 2.2e-06% fanout 1 input 1 rows
{ fork
time 2.3e-06% fanout 1 input 1 rows
{ fork
time 11% fanout 6.00038e+08 input 1 rows
LINEITEM 6e+08 rows(t5.L_ORDERKEY, t5.L_QUANTITY)
time 78% fanout 1.17743e-05 input 6.00038e+08 rows
Sort streaming with duplicates (t5.L_ORDERKEY) -> (t5.L_QUANTITY)
time 11% fanout 21231.4 input 7065 rows
group by read node
(t5.L_ORDERKEY, aggregate)
END Node
After test:
0: if (aggregate > 312 ) then 4 else 5 unkn 5
4: BReturn 1
5: BReturn 0
After code:
0: L_ORDERKEY := := artm t5.L_ORDERKEY
4: BReturn 0
time 0.0014% fanout 1 input 1063 rows
Subquery Select(L_ORDERKEY)
time 0.051% fanout 1 input 1063 rows
ORDERS unq 0.97 rows (.O_CUSTKEY, .O_ORDERKEY, .O_ORDERDATE, .O_TOTALPRICE)
inlined O_ORDERKEY = L_ORDERKEY
-- We note that getting the 1063 customers
by index takes no time, and there is no hash table to build
time 0.023% fanout 1 input 1063 rows
CUSTOMER unq 0.99 rows (.C_CUSTKEY, .C_NAME)
inlined C_CUSTKEY = .O_CUSTKEY
time 0.021% fanout 7 input 1063 rows
LINEITEM 4.3 rows(.L_QUANTITY)
inlined L_ORDERKEY = k_.O_ORDERKEY
-- The rest is identical to the previous plan, cut for brevity
3852 msec 2311% cpu, 3213 rnd 5.99907e+08 seq 0.124456% same seg 1.08899% same pg
Compilation: 1 msec 0 reads 0% read 0 messages 0% clw
We save over 2s of real time. But the problem is how to know that very few customers
will be hit. One could make a calculation that l_quantity
is between 1 and 50, and that an order
has an average of 4 lineitems
with a maximum of 7. For the SUM
to be over 312, only orders
with 7 lineitems
are eligible, and even so the l_quantities
must all be high. Assuming flat distributions, which here happens to be the case, one could estimate that the condition selects very few orders
. The problem is that real data with this kind of regularity is sight unseen, so such a trick, while allowed, would just work for benchmarks.
* * * * *
As it happens, there is a better way. We also note that the query selects the TOP 100 orders
with the highest o_totalprice
. This is a very common pattern; there is almost always a TOP k
clause in analytics queries unless they GROUP BY
something that is known to be of low cardinality, like nation or year.
If the ordering falls on a grouping column, as soon as there are enough groups generated to fill a TOP 100
, one can take the lowest o_totalprice
as a limit and add this into the query as an extra restriction. Every time the TOP 100
changes, the condition becomes more selective, as the 100th highest o_totalprice
increases.
Sometimes the ordering falls on the aggregation result, which is not known until the aggregation is finished. However, in lookup-style queries, it is common to take the latest-so-many events or just the TOP k
items by some metric. In these cases, pushing the TOP k
restriction down into the selection always works.
So, we try this:
{
time 4e-06% fanout 1 input 1 rows
time 6.1e-06% fanout 1 input 1 rows
{ fork
-- The plan begins with orders
, as we now expect a selection on o_totalprice
-- We see that out of 150M orders
, a little over 10M survive the o_totalprice
selection, which gets more restrictive as the query proceeds.
time 33% fanout 1.00628e+07 input 1 rows
ORDERS 4.3e+04 rows(.O_TOTALPRICE, .O_ORDERKEY, .O_CUSTKEY, .O_ORDERDATE)
top k on O_TOTALPRICE
time 32% fanout 3.50797e-05 input 1.00628e+07 rows
END Node
After test:
0: if ({
-- The IN
subquery is here kept as a subquery, not flattened.
time 0.42% fanout 1 input 1.00628e+07 rows
time 11% fanout 4.00136 input 1.00628e+07 rows
LINEITEM 4 rows(.L_ORDERKEY, .L_QUANTITY)
inlined L_ORDERKEY = k_.O_ORDERKEY
time 21% fanout 2.55806e-05 input 4.02649e+07 rows
Sort streaming with duplicates (set_ctr, .L_ORDERKEY) -> (.L_QUANTITY)
time 2.4% fanout 9769.72 input 1030 rows
group by read node
(gb_set_no, .L_ORDERKEY, aggregate)
END Node
After test:
0: if (aggregate > 312 ) then 4 else 5 unkn 5
4: BReturn 1
5: BReturn 0
time 0.00047% fanout 0 input 353 rows
Subquery Select( )
}
) then 4 else 5 unkn 5
4: BReturn 1
5: BReturn 0
-- Here we see that fewer customers
are accessed than in the non-TOP k
plans, since there is an extra cut on o_totalprice
that takes effect earlier
time 0.013% fanout 1 input 353 rows
CUSTOMER unq 1 rows (.C_CUSTKEY, .C_NAME)
inlined C_CUSTKEY = k_.O_CUSTKEY
time 0.0079% fanout 7 input 353 rows
LINEITEM 4 rows(.L_QUANTITY)
inlined L_ORDERKEY = k_.O_ORDERKEY
time 0.0063% fanout 0.0477539 input 2471 rows
Sort streaming with duplicates (.C_CUSTKEY, .O_ORDERKEY) -> (.L_QUANTITY, .O_TOTALPRICE, .O_ORDERDATE, .C_NAME)
time 0.0088% fanout 2.99153 input 118 rows
group by read node
(.C_CUSTKEY, .O_ORDERKEY, aggregate, .O_TOTALPRICE, .O_ORDERDATE, .C_NAME)
time 0.0063% fanout 0 input 353 rows
Sort (.O_TOTALPRICE, .O_ORDERDATE) -> (.C_NAME, .C_CUSTKEY, .O_ORDERKEY, aggregate)
}
time 8.5e-05% fanout 100 input 1 rows
top order by read (.C_NAME, .C_CUSTKEY, .O_ORDERKEY, .O_ORDERDATE, .O_TOTALPRICE, aggregate)
time 2.7e-06% fanout 0 input 100 rows
Select (.C_NAME, .C_CUSTKEY, .O_ORDERKEY, .O_ORDERDATE, .O_TOTALPRICE, aggregate)
}
949 msec 2179% cpu, 1.00486e+07 rnd 4.71013e+07 seq 99.9267% same seg 0.0318055% same pg
Here we see that the time is about 4x better than with the cheat version. We note that about 10M of 1.5e8 orders
get considered. After going through the first 10% or so of orders
, there is a TOP 100
, and a condition on o_totalprice
that will drop most orders
can be introduced.
If we set the condition on the SUM
of quantity
so that no orders
match, there is no TOP k
at any point, and we get a time of 6.8s, which is a little worse than the initial time with the flattened IN
. But since the TOP k
trick does not allocate memory, it is relatively safe even in cases where it does not help.
We can argue that the TOP k
pushdown trick is more robust than guessing the selectivity of a SUM
of l_quantity
. Further, it applies to a broad range of lookup queries, while the SUM
trick applies to only TPC-H Q18, or close enough. Thus, the TOP k
trick is safer and more generic.
We are approaching the end of the TPC-H blog series, with still two families of tricks to consider, namely, moving predicates between subqueries, and late projection. After this we will look at results and the overall picture.
To be continued...
The plenary meeting was preceded by a Linked Open Data Meetup with talks from Springer, fluid Operations, and several LOD2 partners (Universität Leipzig, University of Mannheim, the Semantic Web Company, and Wolters Kluwer Deutschland GmbH (WKD)).
Wolters Kluwer Deutschland GmbH (WKD) gave a presentation on the content production pipeline of their legal publications and their experiences in incorporating LOD2 technologies for content enrichment. This is a very successful LOD2 use case and demonstrates the value of linked data for the information industry.
Springer gave a talk about their interest in linked data for enriching the Lecture Notes in Computer Science product. Also conference proceedings could be enhanced with structured metadata in RDF. I asked about nanopublications. The comment was that content authors might perceive nanopublications as an extra imposition. On the other hand, in the life sciences field there is a lot of enthusiasm for the idea. We will see; anyway, biology will likely lead the way for nanopublications. I referred Aliaksandr Birukou of Springer to the companies Euretos and its parent S&T in Delft, Netherlands, and to Barend Mons, scientific director of NBIC, the Netherlands Bioinformatics Centre. These are among the founding fathers of the Nano Republic, as they themselves put it.
Sebastian Hellman gave a talk on efforts to set up the DBpedia Foundation as a not-for-profit organization, hopefully in the next 10 days, to aid in the sustainability and growth of the DBpedia project. The Foundation would identify stakeholders, their interests, and ways to generate income to further improve DBpedia. Planned areas of improvement include the development of high-availability value-added DBpedia services with quality of service (QoS) agreements for enterprise users; additional tools in the DBpedia stack to support improved and cost-efficient data curation and internationalization; and improved documentation, tutorials, and support to speed uptake.
I had a word with Peter Haase of fluid Operations about the Optique project and their cloud management offerings. The claim is to do ontology-directed querying over thousands of terabytes of heterogenous data. This turns out to be a full-force attempt at large scale SQL federation with ontology-directed query rewriting for covering OWL 2 QL semantics. With Ian Horrocks of Oxford leading the ontology side, the matter is in good hands. Still the matter is not without its problems. Simple lookups can be directed to the data but if there are terabytes of it, it is more likely that aggregations are what is desired. Federated aggregation tends to move a lot of data. So the problems are as they ever were. However, if the analytics are already done and stored in the relational space, finding these based on ontologies is a worthwhile thing for streamlining end user access to information.
The LOD2 plenary itself was structured in the usual way, covering the work packages in two parallel tracks.
LOD2 Plenary Group Photo, Mannheim, February 2014
On the database side, the final victory will be won by going to adaptive schema for RDF. We brought the RDF penalty against relational to a factor of 2.5 for common analytics style queries, e.g., Star Schema Benchmark. This is a comparison to Virtuoso SQL, which offers very high performance in this workload, over 2x the speed of column store pioneer MonetDB and 300x MySQL. So this is where matters stand. To move them significantly forward, exploitation of structure for guiding physical storage will be needed. Also the project still has to deliver the 500 Gtriple results. The experiments around Christmas at CWI support the possibility, but they are not final. Putting triples into tables when the triples in fact form table-shaped structures, which is the case most of the time, may turn out to be necessary for this. At least, this will be a significant help.
Be the case as it may, using a table schema for regularly shaped data, while preserving the RDF quad flexibility, would essentially abolish the RDF tax and bring the LOD2 project to a glorious conclusion in August.
I took the poetic license to compare the data journey into RDF and back to the Egyptian myth of Osiris: The data gets shut in a silo and then gets cut into 14 pieces; and subsequently thrown into the Nile (i.e., the LOD cloud, or the CKAN catalog). Grief-stricken Isis sees what is become of her love: She patiently reassembles the pieces, reconstructing Osiris in fact so well that he sires her a child, hawk-headed Horus, who proceeds to reclaim his father’s honor. (See, Isis means Intelligent Structured Information Storage.)
I had many interesting conversations with Chris Bizer about his research in data integration, working with the 150M HTML tables in the common crawl. The idea is to resolve references and combine data from the tables. Interestingly enough, the data model in these situations is basically triples, while these are generally not stored as RDF but in Lucene. This makes sense due to the string-matching nature of the task. There appears to be opportunity in bringing together the state of the art in database, meaning the very highly optimized column-store and vectored execution in Virtuoso with the search-style workload found in instance matching and other data integration tasks. The promise goes in the direction of very fast ETL and subsequent discovery of structural commonalities and enrichment possibilities. This is also not infinitely far from the schema discovery that one may do in order to adaptively optimize storage based on the data.
Volha Bryl gave a very good overview of the Mannheim work in the data integration domain. For example, learning data fusion rules from examples of successful conflict resolution seems very promising. Learning text extraction rules from examples is also interesting. The problem of data integration is that the tasks are very heterogenous and therefore data integration suites have very large numbers of distinct tools. This is labor intensive but there is progress in automation. An error-free, or near enough, data product remains case by case and has human curation but automatic methods seem, based on Volha’s and Chris’ presentation, to be in the ballpark for statistics.
Giovanni Tummarello of Insight/SindiceTech, always the life of the party, presented his Solr-based relational faceted browser. The idea is to show and drill down by facets over a set of related tables; in the demo, this was investments, investment targets, and investors. You can look at the data from any of the points and restrict the search based on attributes of any. Well, this is what a database does, right? That is so, but the Sindice tool is on top of Solr and actually materializes joins into a document. This blows up the data but has all the things colocated so it can even run from disk. We also talked about the Knowledge Graph package Sindice offers on the Google cloud, this time a Virtuoso application.
We hope that negotiations between SindiceTech and Insight (formerly DERI) around open sourcing the SPARQL editor and other items come to a successful conclusion. The SPARQL editor especially would be of general interest to the RDF community. It is noteworthy that there is no SPARQL query builder in common use out there (even OpenLink's own open source iSPARQL has been largely (but not entirely!) overlooked and misunderstood, though it's been available as part of the OpenLink Ajax Toolkit for several years). OK, a query builder is useful when there is schema. But if the schema is an SQL one, as will be the case if RDF is adaptively stored, then any SQL query builder can be applied to the regular portion of the data. 40 years of calendar time and millennia of person years have gone into making SQL front ends and these will become applicable overnight; Virtuoso does speak SQL, as you may know.
I had the breakout session about the database work in LOD2. What will be done is clear enough, the execution side is very good, and our coverage of the infinite space of query optimization continues to grow. One more revolution for storage may come about, as suggested above. There is not very much to discuss, just to execute. So I used the time to explain how you run
SELECT SUM ( l_extendedprice )
FROM lineitem
, part
WHERE l_partkey = p_partkey
AND p_name LIKE '%green%'
Simple query, right? Sure, but application guys or sem-heads generally have no clue about how these in fact need to be done. I have the probably foolish belief that a little understanding of database, especially in the RDF space which does get hit by every query optimization problem, would be helpful. At least one would know what goes wrong. So I explained to Giovanni, who is in fact a good geek, that this is a hash join, and with only a little prompting he suggested that you should also put a Bloom filter in front of the hash. Good. So in the bar after dinner I was told I ought to teach. Maybe. But the students would have to be very fast and motivated. Anyway, the take-home message is that the DBMS must figure it out. In the SQL space this is easier, and of course, if most of RDF reduces to this, then RDF too will be more predictable in this department.
I talked with Martin Kaltenböck of the Semantic Web Company about his brilliant networking accomplishments around organizing the European Data Forum and other activities. Martin is a great ambassador and lobbyist for linked data across Europe. Great work, also in generating visibility for LOD2.
The EU in general, thanks in great part to Stefano Bertolo’s long term push in this direction, is putting increasing emphasis on measuring progress in the research it funds. This is one of the messages from the LOD2 review also. Database is the domain of performance race par excellence; the matters on that side are well attended to by LDBC and, of course, the unimpeachably authoritative TPC, among others. In other domains, measurement is harder, as it involves a human-curated ground truth for any extraction, linking, or other integration. There is good work in both Mannheim and Leipzig in these areas, and I may at some point take a closer look, but for now it is appropriate to stick to core database.
]]>To check out —
$ git clone https://github.com/v7fasttrack/virtuoso-opensource.git v7fasttrack
The v7fasttrack tree compiles just like the main Virtuoso tree. Its content is substantially the same today, except for the file tables feature. There is a diff with the main tree which now consists mostly of white space, since the Fast Track tree is automatically indented with the Linux indent utility each time it is updated, and the Virtuoso tree is not.
Ongoing maintenance and previews of new features will be added to this tree as and when they become available.
Let's now look at the "file tables" feature.
You can use any CSV file like a table, as described in the documentation. The TPC-H data generator (complete source ZIP; ZIP of just the dbgen source) is a convenient place to start to try things out.
To generate the qualification database, run —
dbgen -s 1
This makes 8 CSV files called *.tbl
. You can use these scripts to load them into Virtuoso —
To verify the load, do —
SELECT COUNT (*)
FROM lineitem_f
;
SELECT COUNT (*)
FROM lineitem
;
To try different combinations of tables and CSV files, you can, for example, do —
SELECT COUNT (*)
FROM lineitem,
part_f
WHERE l_partkey = p_partkey
AND p_name LIKE '%green%'
;
This counts shipments of green parts, using the file as the part
table. You can then replace the part_f
with part
to join against the database. The database will be a little faster but the file is also pretty fast since the smaller table (part
) is on the build side of a hash join and the scan of lineitem
is the same in either case.
You can now replace lineitem
with lineitem_f
and you will see a larger difference. This is still reasonably fast since the lineitem
file is scanned in parallel.
You can try the different TPC-H queries against tables and files. To get the perfect plans you will need the analytics branch which will be made available shortly via this same GitHub channel.
You can also try RDFizing the files using the scripts in the Enterprise Linked Data article from earlier this year. The qualification database should go in about 15 minutes on a commodity server and make some 120M triples. In the article, the data came from another server, but it can just as well come from files. These two scripts from that article have been adapted for loading from files —
To try this, execute the following commands in iSQL —
LOAD sql_rdf.sql;
RDF_VIEW_SYNC_TO_PHYSICAL
( 'http://example.com/tpcd'
, 1
,
, "urn:example.com:tpcd"
, 2
, 0
)
;
To verify the result —
sparql
SELECT ?c
COUNT (*)
WHERE { ?s ?p ?o }
GROUP BY ?p
ORDER BY DESC 2
;
]]>Since then the situation has improved somewhat with the code being on github and kept more or less up to date with maintenance.
There is no reason why users should not get bug fixes when they are made, nor why users should not get functionality previews.
Introducing v7fasttrack, a clone of the git repository, where maintenance goes real time, and where early releases of functionality are available for experimentation.
It is open kitchen, folks. You will see us slice the fish and mix the curry right in the dining room. You will see epic feats of pizza flipping -- something to regale your grandchildren with.
You will still have to place your orders with your waiter; we still won't have time to be on most lists and such but at least you will get the dish while it's hot.
In the immediate future, the "file-as-table" feature will be introduced via this channel. This is the capability to make CSV files look like tables, which was mentioned in the TPC-H bulk load article. This is key to bulk loading 100GB in 15 minutes on a single commodity server.
The query caching feature mentioned in the previous blog post is the next candidate for availability in this channel. If you have repeating long queries that do lookups with under a million triples or so, compile times will dominate and this will give significant acceleration.
The largest item to come is the availability of the analytics branch that has been discussed in the TPC-H series. It is basically complete, with implementations of most tricks discussed in the TPC TC paper, TPC-H Analyzed: Hidden Messages and Lessons Learned from an Influential Benchmark.
These are structured as branches, with safe maintenance to one side, and new features which potentially break things in their own branch. For example, with a much smarter optimizer in the analytics branch, it is conceivable some applications will get worse plans at first even if the benchmarks are good. Anyway, experimentation is safe.
These will be introduced in future blog posts, and the relevant documentation will be in the archive. The DocBook XML form will come later.
These features will migrate to the regular git in time. But you will not have to hold your breath waiting for this.
The next action is the publishing of the file table feature on the fast track. There will be a post on this next week.
]]>So reusing query plans between invocations is a natural optimization. This works especially well with lookup workloads, or when the data is small. Analytics tends to be dominated by run time, but lookups that touch at most a million or so triples will often be bound by compilation time, especially if these have tens of triple patterns.
Let's consider the following from Open PHACTS:
sparql
PREFIX chembl: <http://rdf.ebi.ac.uk/terms/chembl#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX cheminf: <http://semanticscience.org/resource/>
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX qudt: <http://qudt.org/1.1/schema/qudt#>
SELECT DISTINCT ?item
WHERE
{ VALUES ?chembl_target_uri
{ <http://rdf.ebi.ac.uk/resource/chembl/target/CHEMBL5451> }
GRAPH <http://www.ebi.ac.uk/chembl>
{
?assay_uri chembl:hasTarget ?chembl_target_uri .
?assay_uri chembl:hasActivity ?item .
?item chembl:hasMolecule ?compound_chembl .
?chembl_target_uri a ?target_type .
OPTIONAL { ?chembl_target_uri dcterms:title ?target_name_chembl }
OPTIONAL { ?chembl_target_uri chembl:organismName ?target_organism }
OPTIONAL { ?chembl_target_uri chembl:hasTargetComponent ?protein .
GRAPH <http://www.conceptwiki.org>
{
?cw_target skos:exactMatch ?protein
; skos:prefLabel ?protein_name
}
}
OPTIONAL { ?assay_uri chembl:organismName ?assay_organism }
OPTIONAL { ?assay_uri dcterms:description ?assay_description }
OPTIONAL { ?assay_uri chembl:assayTestType ?assay_type }
OPTIONAL { ?item chembl:publishedType ?published_type }
OPTIONAL { ?item chembl:publishedRelation ?published_relation }
OPTIONAL { ?item chembl:publishedValue ?published_value }
OPTIONAL { ?item chembl:publishedUnits ?published_unit }
OPTIONAL { ?item chembl:standardType ?activity_type }
OPTIONAL { ?item chembl:standardRelation ?activity_relation }
OPTIONAL { ?item chembl:standardValue ?standard_value .
BIND ( xsd:decimal( ?standard_value ) AS ?activity_value )
}
OPTIONAL { ?item chembl:standardUnits ?activity_unit }
OPTIONAL { ?item chembl:hasQUDT ?qudt_uri }
OPTIONAL { ?item chembl:pChembl ?pChembl }
OPTIONAL { ?item chembl:activityComment ?act_comment }
OPTIONAL { ?item chembl:hasDocument ?doc_uri .
OPTIONAL { ?doc_uri owl:sameAs ?doi }
OPTIONAL { ?doc_uri bibo:pmid ?pmid }
}
}
GRAPH <http://ops.rsc.org>
{
?compound_ocrs skos:exactMatch ?compound_chembl .
?compound_ocrs cheminf:CHEMINF_000396 ?inchi
; cheminf:CHEMINF_000399 ?inchi_key
; cheminf:CHEMINF_000018 ?smiles .
OPTIONAL { [] obo:IAO_0000136 ?compound_ocrs
; a cheminf:CHEMINF_000484
; qudt:numericValue ?molweight .
}
OPTIONAL { [] obo:IAO_0000136 ?compound_ocrs
; a cheminf:CHEMINF_000367
; qudt:numericValue ?num_ro5_violations .
}
}
?compound_cw skos:exactMatch ?compound_ocrs
; skos:prefLabel ?compound_name
}
ORDER BY ?item
LIMIT 10
OFFSET 0
;
This is quite typical. More complex ones have been seen, with many unions on top.
We run this with profile on warm cache, no plan reuse in effect. The database is Open PHACTS OPS from this January.
profile ('sparql prefix ....');
...
15 msec 4% cpu, 15688 rnd 9547 seq 91.8669% same seg 4.62107% same pg
Compilation: 313 msec 0 reads 0% read 0 messages 0% clw
The compilation time is over 20x longer than the execution. We see from the top line that the execution did 15K random lookups and retrieved 9K rows sequentially.
We enable plan reuse and rerun:
14 msec 58% cpu, 15688 rnd 9547 seq 91.8669% same seg 4.62107% same pg
Compilation: 0 msec 0 reads 0% read 0 messages 0% clw
The compile time is now gone. This is an especially large win. With a set of 31 queries from Open PHACTS, each repeated over many different parameter bindings, the gains from query caching are a speedup of 1.5x. More details may be published after VU Amsterdam, which does the data management for Open PHACTS, following publication of the benchmark data and queries. The present figures are an order of magnitude better than the figures from last fall, which will also be in the publication.
With query plan caching, the same plan will be reused as long as the literals in the new query have approximately the same selectivity as the ones which were present when the plan was first made. In this way, if a different plan is in fact needed, one will be made. The same query text can have many alternative plans for different selectivities of search conditions.
In this way, plan reuse may work better than prepared statements. Anyway, prepared statements do not exist in the SPARQL query language. In SQL they do, but then the optimizer does not know the values the parameters will have.
The overhead of plan reuse, as opposed to parameterized prepared statements, is relatively low. The cache remembers the sampling that was done when the plan was first made. The same samples are taken with the new literals plugged in. If the cardinalities are within a settable percentage (e.g., 20% of the original), the plan is assumed to be applicable. On the other hand, with prepared parameterized statements, there is no sampling at all, but then the plan might be worse due to less information being available to the optimizer.
Publishing is another type of workload where compile times easily form a large percentage of the total. The queries are shorter than in the biology case, since the modeling tends to be simpler and there are less distinct sources being queried. The frequency of the queries is higher though, and each might touch only some tens of triples.
The query caching feature will be included in forthcoming Virtuoso updates, and will not require operator intervention nor changes to configurations or applications. The feature will be controlled by a few settings in the configuration file, but the defaults will work for almost all cases.
]]>So, we take our standard TPC-H 100 GB dataset and turn it into RDF. We adjust the schema a little, so that each customer
, its orders
, and the lineitems
of those orders go into a per-customer graph. The part
, partsupp
, and other tables will all go into a public graph. The per-customer graph can be used as a security label; for example, if there is customer self-service access to the warehouse, or if the access is compartmentalized by areas of responsibility (e.g., customer countries or market segments).
We use two Virtuoso processes. The first contains the TPC-H 100G dataset, the same that was discussed in the TPC-H bulk load article. The second process attaches the tables from the first via SQL federation, and constructs an RDF translation into its RDF store. The mapping is made with an RDF view, also known as a Linked Data View. The initial RDF view can be generated from the relational schema, then edited for the selection of properties. If there are modeling or unit changes in the mapping, these are easiest done with SQL views, in which case the RDF mapping is made on top of the views, not the actual tables. The SQL views reside on the same server that has the RDF views, so no write access to the source database is needed.
The server configuration is found in virtuoso.ini
. This is for 4 disks and 192 GB RAM, so if you try this, make sure you have at least this much or use an accordingly scaled down dataset.
The data is defined by loading the scripts:
SQL> LOAD att2.sql ;
SQL> LOAD sql_rdf_rdfh11.sql ;
The first attaches the tables from the source server; the second defines the mapping from tables to triples. The final script (below) starts the actual ETL.
We set the default vector size to 200,000. Experience shows this is good for this sort of operation, and may save some 20% of time.
SQL> __dbf_set ('dc_batch_sz', 200000) ;
We run the transformation: The commands are in the rdb2rdf_rdfh11_1.sql
script, discussed below. The ld_meter_run
starts a thread to record the load rate every 30 seconds.
SQL> LOAD /mvi/te/suite/tpc-d/rdb2rdf_rdfh11_1.sql &
SQL> ld_meter_run (30) &
Done. -- 40686670 msec.
-- The load is non-transactional bulk load, so needs an explicit checkpoint to make the result durable.
SQL> checkpoint ;
Done. -- 1858934 msec.
-- we check the result
SQL> sparql
SELECT COUNT (*)
WHERE { ?s ?p ?o }
;
11869611740
SQL> sparql
SELECT ?p COUNT (*)
WHERE { ?s ?p ?o }
GROUP BY ?p
ORDER BY DESC 2
LIMIT 200
;
Predicate URI | number of triples |
---|---|
http://www.w3.org/1999/02/22-rdf-syntax-ns#type | 881,038,747 |
http://lod2.eu/schemas/rdfh#l_has_order | 600,037,902 |
http://lod2.eu/schemas/rdfh#l_has_part | 600,037,902 |
http://lod2.eu/schemas/rdfh#l_number | 600,037,902 |
http://lod2.eu/schemas/rdfh#l_discount | 600,037,902 |
http://lod2.eu/schemas/rdfh#l_linestatus | 600,037,902 |
http://lod2.eu/schemas/rdfh#l_shipdate | 600,037,902 |
http://lod2.eu/schemas/rdfh#l_quantity | 600,037,902 |
http://lod2.eu/schemas/rdfh#l_extendedprice | 600,037,902 |
http://lod2.eu/schemas/rdfh#l_commitdate | 600,037,902 |
http://lod2.eu/schemas/rdfh#l_has_supplier | 600,037,902 |
http://lod2.eu/schemas/rdfh#l_tax | 600,037,902 |
http://lod2.eu/schemas/rdfh#l_returnflag | 600,037,902 |
http://lod2.eu/schemas/rdfh#l_receiptdate | 600,037,902 |
http://lod2.eu/schemas/rdfh#l_shipinstruct | 600,037,902 |
http://lod2.eu/schemas/rdfh#l_shipmode | 600,037,902 |
http://lod2.eu/schemas/rdfh#o_clerk | 150,000,000 |
http://lod2.eu/schemas/rdfh#o_comment | 150,000,000 |
http://lod2.eu/schemas/rdfh#c_customer_of | 150,000,000 |
http://lod2.eu/schemas/rdfh#o_orderstatus | 150,000,000 |
http://lod2.eu/schemas/rdfh#o_totalprice | 150,000,000 |
http://lod2.eu/schemas/rdfh#o_orderpriority | 150,000,000 |
http://lod2.eu/schemas/rdfh#o_orderkey | 150,000,000 |
http://lod2.eu/schemas/rdfh#o_orderdate | 150,000,000 |
http://lod2.eu/schemas/rdfh#o_shippriority | 150,000,000 |
http://lod2.eu/schemas/rdfh#ps_has_supplier | 80,000,000 |
http://lod2.eu/schemas/rdfh#ps_availqty | 80,000,000 |
http://lod2.eu/schemas/rdfh#ps_supplycost | 80,000,000 |
http://lod2.eu/schemas/rdfh#ps_has_part | 80,000,000 |
http://lod2.eu/schemas/rdfh#p_type | 20,000,000 |
http://lod2.eu/schemas/rdfh#p_size | 20,000,000 |
http://lod2.eu/schemas/rdfh#p_container | 20,000,000 |
http://lod2.eu/schemas/rdfh#p_mfgr | 20,000,000 |
http://lod2.eu/schemas/rdfh#p_partkey | 20,000,000 |
http://lod2.eu/schemas/rdfh#p_name | 20,000,000 |
http://lod2.eu/schemas/rdfh#p_brand | 20,000,000 |
http://lod2.eu/schemas/rdfh#p_comment | 20,000,000 |
http://xmlns.com/foaf/0.1/phone | 16,000,000 |
http://lod2.eu/schemas/rdfh#c_comment | 15,000,000 |
http://lod2.eu/schemas/rdfh#c_acctbal | 15,000,000 |
http://lod2.eu/schemas/rdfh#c_mktsegment | 15,000,000 |
http://lod2.eu/schemas/rdfh#c_custkey | 15,000,000 |
http://lod2.eu/schemas/rdfh#c_name | 15,000,000 |
http://lod2.eu/schemas/rdfh#c_has_nation | 15,000,000 |
http://lod2.eu/schemas/rdfh#c_address | 15,000,000 |
http://lod2.eu/schemas/rdfh#c_phone | 15,000,000 |
http://lod2.eu/schemas/rdfh#n_nation_of | 15,000,000 |
http://lod2.eu/schemas/rdfh#s_has_nation | 1,000,000 |
http://lod2.eu/schemas/rdfh#s_acctbal | 1,000,000 |
http://lod2.eu/schemas/rdfh#s_name | 1,000,000 |
http://lod2.eu/schemas/rdfh#s_address | 1,000,000 |
http://lod2.eu/schemas/rdfh#s_comment | 1,000,000 |
http://lod2.eu/schemas/rdfh#s_suppkey | 1,000,000 |
http://lod2.eu/schemas/rdfh#s_phone | 1,000,000 |
... |
We calculate the ETL speed:
SQL> SELECT 11869611740 / ( ( 1858934 + 40686670 ) / 1000.0 ) ;
278985.620700084549276
11.8 hours, at just under 280 Kt/s, end-to-end. Worse has been heard of. This is a small single-server speed, on the usual test system with dual Xeon E5 2630, 192G RAM. A single-server for double the price might get double throughput. Beyond this, scale out is clearly the better deal. An elastic cluster will get throughput linear to the count of machines for this type of workload.
This shows that deploying mid-size enterprise data as RDF is a job that goes easily overnight with a commodity box, reading directly from the source system; no file-system-based staging areas are needed.
The dataset is 600M order lines; 150M orders; 15M customers; 20M parts, each with 4 suppliers; 1M total suppliers. You can contrast this to what you have in-house to get a rough estimate of what your own DW would come to.
Later, we will use this dataset to illustrate how to scope queries to security categories with graph-level security. Of course, this dataset also provides a point of SQL-to-SPARQL comparison for the ongoing TPC-H series. There will be more installments in not too long.
]]>The BSBM generator and driver are an ever-fresh source of new terrors. The nature of the gig is that you have a window to do your experiment in, and that involves first generating the test data. It is somewhere around 3 TB of gzipped files. It took a whole week to make the files. During that time of course you want to anticipate what's going to break with the queries. So while the generator was going, we loaded 50 renamed replicas of the 10 Gt dataset. At partial capacity, we may add, because 4 boxes had half memory taken by the BSBM generator. We hate that program. Of course nobody gives a damn about it so it has been maintained in the worst way possible; for example, the way its cluster version generates slices of data is by having every instance actually generate the full data set, but only write 1 out of so many items to the output. So no amount of capacity will make it faster. As for BSBM itself, if you generate 10 Gt once and occasionally use this as test data, it does not inconvenience you so much. Then, of course, the test driver was patched to generate queries against renamed replicas of a dataset. But then the new driver would not read the dataset summary files made by the previous driver, because of Java class versions. 8 hours to regenerate 10 Gt. A real train wreck. This is by far not the end of it but we are out of space. So on with it; may that program be buried.
In the end, the 2000 gz files with the 500 Gt in them were complete. Then it turns out each file has tens of millions of namespace prefixes at the beginning. So, starting to load a file grows the process by some 9 GB just for the prefixes. So, out of 256 GB of RAM per box, there are about 72 GB taken by the prefixes, if you load 8 files in parallel on each. Well, one could do a sed
script to unzip, expand the prefixes, and rezip, and the file would not be any bigger; but it would be a day to run.
So, anyway with 12 boxes, 24 processes, and (in principle) 384 threads, the load rate is between 3 and 4 million triples per second ("Mt/s"). With 2 boxes, it is 630 Kt/s, so you would say this is scalable. Near enough to linear; the 2 boxes have 12 cores and 2.3GHz, Scilens has 16 at 2.0GHz; close enough.
For the 3-4 Mt rate, there is an average of 200 threads running. This is not full platform, as there's the 2nd thread of each core idle for the most part. Adding the second thread usually adds some 30% throughput. A high of 5 Mt/s could be had if going to full CPU, but doubling the files being loaded would run out of memory because of the namespace prefixes. See, it is sheer luck that the BSBM thing, inept as it is, is still marginally usable, despite the prefixes and the horrible generator. A bit worse still, and it would have been a non-starter. It comes from the times when RDF just meant inept database, so scalability clearly was not in its design objectives.
With 96 files being loaded across the cluster, we got the run stats below for a couple of 4 minute windows. In each, the data size at time of the sample is between 50 Gt and 100 Gt. The long line is the cluster status summary; the tables below are load rates in the windows between timestamps, so, growth in triple count as triples per second (tps) since the previous sample.
Cluster 24 nodes, 240 s. 18866 m/s 692017 KB/s 21842% cpu 7% read 95% clw threads 356r 0w 114i buffers 99250961 97503789 d 2275 w 0 pfs
load rate (tps) |
timestamp |
---|---|
3,853,915.028323892 | 2014-01-04 08:38:36 +0000 |
4,245,681.678456353 | 2014-01-04 08:38:33 +0000 |
3,680,757.080973009 | 2014-01-04 08:38:06 +0000 |
4,138,599.125958298 | 2014-01-04 08:38:03 +0000 |
4,887,272.575808064 | 2014-01-04 08:37:36 +0000 |
4,093,772.082515462 | 2014-01-04 08:37:33 +0000 |
4,399,343.552149284 | 2014-01-04 08:37:06 +0000 |
4,184,758.045998296 | 2014-01-04 08:37:03 +0000 |
3,884,665.444851716 | 2014-01-04 08:36:36 +0000 |
4,197,270.027036035 | 2014-01-04 08:36:33 +0000 |
Some hours later --
Cluster 24 nodes, 240 s. 14601 m/s 506784 KB/s 19721% cpu 61% read 1310% clw threads 374r 0w 126i buffers 189886490 107378792 d 1983 w 18 pfs
load rate (tps) |
timestamp |
---|---|
3,273,757.708076397 | 2014-01-04 11:49:53 +0000 |
3,274,119.596013466 | 2014-01-04 11:49:53 +0000 |
3,318,539.715342822 | 2014-01-04 11:49:23 +0000 |
3,318,701.609946335 | 2014-01-04 11:49:23 +0000 |
3,127,730.142328589 | 2014-01-04 11:48:53 +0000 |
3,127,731.608946369 | 2014-01-04 11:48:53 +0000 |
3,273,572.647578414 | 2014-01-04 11:48:23 +0000 |
3,273,622.779240692 | 2014-01-04 11:48:23 +0000 |
2,872,466.21779274 | 2014-01-04 11:47:53 +0000 |
2,872,495.383487217 | 2014-01-04 11:47:53 +0000 |
Pretty good. I don't know of others coming even close.
Next we will look at query plans and scalability in query processing.
]]>In this usage pattern, the graph is akin to a relational row, plus maybe items of one to many tables; for example, an order and its order lines would make a single graph. In a publishing setting, what goes into a graph is whatever is approved for publication as a unit. So, a graph is conceptually somewhere between a row and a document.
The access control situations further fall into two principal types.
OpenPHACTS, for example, would fall into the latter category. There the matter is not so much about charging for premium content but about keeping proprietary separate from public.
The graph is thus a combined provenance and security label. We sometimes hear about applications that would want quints (a quad with an extra field) for separating concerns, but so far no actionable need has materialized. One could do quints if necessary. We did also in past times talk to Systap, the vendor of the Bigdata® RDF store, about doing reification right, which would resolve many of these things in one go.
So anyway, it is time to do graph-level security right. The technical approaches depend on how many grantable objects one has, and how many distinct security principals ("grantees") there are. If the separately grantable objects are numerous (e.g., a billion distinct graphs), and if the typical principal has access to most, it makes sense to enumerate for each principal the items that are denied instead of enumerating the granted ones.
There is a related case of scoping queries to a variable set of graphs. This corresponds to cases like "give me information from sources with an over-average reliability rating." There is also the security-related special case of scoping to non-classified material. This has the special feature that one part of the query must know what the classified material is, so as to exclude that. However, since from the user's viewpoint the classified material does not exist, the query must run at two different access control levels and must prevent leakage between the two. This is routinely done with SQL views and SQL policy functions so there is nothing new here, but this is just not something the RDF people have thought much of.
In the hosted-application scenario there is a slowly-changing, relatively-small set of graphs that are in scope for a user session. Queries will only consider data where the graph is in this set. This set is typically different for each principal. The distinct end users are not extremely numerous; maybe in the thousands.
In a publishing setting, there is a small set of restricted content and a large number of principals; however, many principals will have exactly the same exclude list (e.g.: no special access; paid access to content class A or B or both; etc.). The number of access compartments is small relative to the number of principals, and many independent principals will share the same set of premium content. As a result, the number of distinct exclusion lists will not be extremely large, even if the lists may be mid-size.
How does one build this into the database? This is done by a selective hash join for granted lists, and a hash anti-join (NOT EXISTs
operator) for the exclusion lists. From the TPC-H series, we recall that there is an invisible hash join operation that can be merged into a table scan/index lookup. This is the ideal building block for this task, except that it needs to be extended to also have a negative form. The negative form also occurs in TPC-H -- for example, in Q16, where one considers only stock from suppliers with no customer complaints.
There are a few distinct cases of application behavior where the cost of enforcing graph-level access will be different. For example, in OpenPHACTS, a query is often a union/join between graph patterns where the graph is fixed. In this way, each triple pattern has exactly one named graph where it can be matched, but many named graphs are mentioned in the query. In this case, the access check has no cost, or the cost is a vanishingly small constant. In the case of the hosted application, where the named graph corresponds to a unit of update ("row" or "document," as discussed above), the query typically does not specify a named graph, so each triple pattern can match anywhere within the visible graphs. There the check is enforced at each triple pattern, but this is always against the same list (i.e., hash table), and the hash table will most often fit in CPU cache. Most of the accesses will find a match in the hash table.
In the publishing case, where relatively small premium content is to be excluded, the named graph will also generally not be specified; but it can happen that some parts of the query are required to match within the same named graph -- i.e., there is a pattern like
graph ?g
{ ?x xx:about <stuff>
. ?x xx:date ?dt
. filter ( ?dt > ... )
}
Here, the named graph needs to be checked only once for the two patterns. We also note that since most of the content is not restricted, the check against the restriction will nearly always fail -- i.e., the hash lookup will miss. Missing is cheaper than hitting, as mentioned in the TPC-H series. One can use a Bloom filter for low cost miss detection. So it follows that an exclusion list that seldom hits can be maybe 10x bigger than a near-always-hitting inclusion list with the same cost of checking.
So, if inclusion lists are in the 100,000-entry range and exclusion lists in the 1,000,000-item range, we are set. What if this is not so? In that case, we will rely on two-level tricks and encoding of application information into supposedly meaning-free identifiers. This is not sem-head material but is well within the domain of database science. For example, if the graph is a marker of provenance or update locality, then all graphs of the same owner or security classification may have an identifier in a certain range or ranges. Then we may only check for range, i.e., omit the low few bits from the hash lookup, knowing that any graph id in the range will have the same security behavior. This does introduce some more burden on the application, and will add some special cases when moving assets between security principals, but then life is like that. If there is a big-ticket application that depends on this, then such things can be provided as custom development.
I will here refer you to David Karger's immortal insight at the Semantic Big Data panel at ESWC 2013, where he said that "big data is very much about performance, and performance is usually obtained by sacrificing the general for the specific." So it is.
So far, I have made a summary where the many people with whom this matter has been touched upon may recognize their specific case. Next I will show some experiments. We use a copy of the URIBurner dataset, as this is a collection with large numbers of tiny graphs that can be assigned different security attributes.
To be continued.
]]>I will here give results for practice runs on my desktop, with 1/50th the data and 1/8th the capacity. This is 10 billion triples on a system of two machines, each with 2x Xeon E5-2630, 192 GB RAM, and QDR InfiniBand.
We start with Explore with 16 clients. The clients are evenly divided over 4 server processes, with 2 processes per machine. We do a run of 100 --
% ./bibm/bsbmdriver -seed 1287654 -dg http://bsbm.org -t 300000 \
-idir /1d4/bsbm_10000/td_data -uqp query -uc bsbm/explore \
-mt 16 -runs 500 http://madras:8604/sparql \
http://madras:8605/sparql http://masala-i:8606/sparql \
http://masala-i:8607/sparql
The QMPH (query mixes per hour) is 12683.046. The run is 1500 query mixes; it takes 425s. The warmup is about 3000 query mixes with a different seed. The details are in 10gc16e.xml
. The sample configurations are as in virtuoso.global.ini
, cluster.global.ini
, and virtuoso.ini
.
We note that there is a total of 20M 8 KB buffer pages, and after running 1000 or so different query mixes, about 16M get used. So the working set is about 16M * 8 KB = 128 GB, for 10 Gt ("Gigatriples"). The quads themselves take less space than that, but the benchmark also accesses some literals. At 50x the size and at best 2.5 TB worth of buffers, there may be a problem.
The total database files are around 800K pages * 8 KB/slice * 48 slices = 272 GB. This times 50 is 15.3 TB. I do not think the system has that much SSD space, and it has about 3 TB per node in 3-way striped RAID 0 disk. There will be some disk access during the explore run. So we will report one number with steady state from disk, and another for a rerun of a set of queries where data is known to come from memory.
We note that there is speculative read, taking whole extents in when not all pages get used. Whether one reads 8 KB or 2 MB (the extent) makes little difference, so may as well do whole extents. Subtracting the speculatively-read pages that are not in fact accessed, we get 1.5M working set per box, which would indicate that we will make it into a RAM-based steady-state on the 500 Gt Scilens experiment. We shall see.
Loading may present some problems, since last time we had two boxes with significantly worse disk-write throughput than the rest. The Virtuoso I/O system is now different, with more emphasis on writing contiguous sequential ranges of pages, irrespective of the time the page became dirty. But there is nothing that a bad disk will not screw up.
We go to BI. First single user (power) run. This is preceded by one single user BI run with a different seed, for warmup. The power run has 4 consecutive query mixes; the throughput run has the same 4 query mixes concurrently.
Power query mix run time: 229 (arithmetic mean)
Throughput query mix run time: 269s (arithmetic mean)
The test driver output follows, the full result summaries are in 10gc-4pwer.xml
and 10gc-4tp.xml
.
% ./bibm/bsbmdriver -drill -t 300000 -dg http://bsbm.org -idir \
/1d4/bsbm_10000/td_data -uqp query -uc bsbm/bi -mt 1 -runs 4 \
http://madras:8604/sparql
% java -Xmx256M com.openlinksw.bibm.bsbm.TestDriver -qrd ./bibm \
-dg http://bsbm/ -drill -t 300000 -dg http://bsbm.org -idir \
/1d4/bsbm_10000/td_data -uqp query -uc bsbm/bi -mt 1 -runs 4 \
http://madras:8604/sparql
Thread 1: query mix: 0 255.074 s, total: 255.195 s
Thread 1: query mix: 1 170.622 s, total: 170.667 s
Thread 1: query mix: 2 295.642 s, total: 295.691 s
Thread 1: query mix: 3 188.885 s, total: 188.935 s
Benchmark run completed in 910.493 s
Query Number | Execute Count | Timeshare | aqet | aqetg | aps | minqet | maxqet | Average Results | Min Results | Max Results | Timeout Count |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 4 | 4.079 | 9.281250 | 7.848979 | 0.108 | 4.141000 | 18.836000 | 10.000 | 10 | 10 | 0 |
2 | 4 | 2.358 | 5.365000 | 4.907928 | 0.186 | 2.539000 | 8.182000 | 10.000 | 10 | 10 | 0 |
3 | 4 | 25.639 | 58.343500 | 24.902196 | 0.017 | 2.640000 | 111.680000 | 10.000 | 10 | 10 | 0 |
4 | 20 | 28.908 | 13.156450 | 1.725366 | 0.076 | 0.130000 | 73.728000 | 92.650 | 55 | 100 | 0 |
5 | 20 | 12.229 | 5.565350 | 2.550107 | 0.180 | 0.202000 | 16.251000 | 30.250 | 14 | 58 | 0 |
6 | 4 | 0.277 | 0.631250 | 0.574319 | 1.584 | 0.269000 | 0.950000 | 49.250 | 14 | 72 | 0 |
7 | 24 | 3.875 | 1.469625 | 0.268809 | 0.680 | 0.056000 | 7.961000 | 54.875 | 0 | 413 | 0 |
8 | 20 | 22.635 | 10.301600 | 5.450917 | 0.097 | 0.626000 | 36.908000 | 10.000 | 10 | 10 | 0 |
% ./bibm/bsbmdriver -drill -t 300000 -dg http://bsbm.org -idir \
/1d4/bsbm_10000/td_data -uqp query -uc bsbm/bi -mt 4 -runs 4 \
http://madras:8604/sparql http://madras:8605/sparql \
http://masala-i:8606/sparql http://masala-i:8607/sparql
% java -Xmx256M com.openlinksw.bibm.bsbm.TestDriver -qrd ./bibm \
-dg http://bsbm/ -drill -t 300000 -dg http://bsbm.org -idir \
/1d4/bsbm_10000/td_data -uqp query -uc bsbm/bi -mt 4 -runs 4 \
http://madras:8604/sparql http://madras:8605/sparql \
http://masala-i:8606/sparql http://masala-i:8607/sparql
Thread 2: query mix: 1 474.435 s, total: 474.498 s
Thread 1: query mix: 0 669.552 s, total: 669.663 s
Thread 3: query mix: 3 914.943 s, total: 915.009 s
Thread 4: query mix: 2 1077.138 s, total: 1077.283 s
Benchmark run completed in 1077.285 s
%
Query Number | Execute Count | Timeshare | aqet | aqetg | aps | minqet | maxqet | Average Results | Min Results | Max Results | Timeout Count |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 4 | 2.478 | 19.424500 | 12.522236 | 0.150 | 4.207000 | 52.115000 | 10.000 | 10 | 10 | 0 |
2 | 4 | 1.188 | 9.312250 | 7.525659 | 0.313 | 2.116000 | 14.486000 | 10.000 | 10 | 10 | 0 |
3 | 4 | 17.822 | 139.727250 | 79.815584 | 0.021 | 19.953000 | 268.652000 | 10.000 | 10 | 10 | 0 |
4 | 20 | 47.737 | 74.853750 | 4.014962 | 0.039 | 0.132000 | 728.240000 | 92.650 | 55 | 100 | 0 |
5 | 20 | 10.393 | 16.296000 | 6.905212 | 0.179 | 0.406000 | 76.083000 | 30.250 | 14 | 58 | 0 |
6 | 4 | 1.557 | 12.209000 | 2.304804 | 0.238 | 0.364000 | 45.836000 | 49.250 | 14 | 72 | 0 |
7 | 24 | 3.795 | 4.958458 | 0.876038 | 0.587 | 0.049000 | 33.648000 | 54.708 | 0 | 413 | 0 |
8 | 20 | 15.031 | 23.568900 | 9.495984 | 0.124 | 0.614000 | 157.444000 | 10.000 | 10 | 10 | 0 |
We notice quite a bit of variability between the different query mixes. This comes from parameter choices, and runs with different seeds are therefore not comparable, unless they be very long.
What does this promise for the 500 Gt runs? The complexities are n·log(n)
, with the log pretty constant. There will be some loss of speed from less locality of reference. I expect run times that are 8x or so higher, 50x more data, and about 8x more CPU. The dataset does not scale that linearly throughout, as the product hierarchies may have different depth.
The working set will be OK; on each of the 4 processes, there are 3.1M buffers used, of which 2M are just read ahead, not really hit. When you read, read the whole extent of 256 pages while at it; costs the same and may prefetch. So actually, 4.4M buffers used is 34 GB, times 50 is 1.7 TB. Will fit.
In the interest of advancing standards of disclosure, I am also providing the test driver output for the runs, and an excerpt of the server query log for an interactive query mix and a BI query mix. The query texts and plans are there, with per operator time and cardinality.
Generic relational performance is the necessary predecessor of any graph performance, so we will talk SQL at first; the Linked Geodata RDFisms will come later.
In this article, we will look in detail at ETL> from PostGIS to Virtuoso via SQL federation. In the TPC-H series, we looked at bulk loading from files which are 1:1 shaped like the tables, but life is seldom this simple.
Here we will see how to change normalization in schemas, from a denormalized key-value pair-structure in PostGIS, to a normalized "triple table" in Virtuoso. We will also look at data type conversion, overall data transfer speed, and automatic parallelization.
ETL, even with medium data sizes, like with OSM at a little under 600 GB in PostgreSQL files, is a performance game, like everything in database. Data must move fast, expressing the transformation logic must be compact, and parallelism must be automatic. Next to nobody can write parallel code and the few that can are needed somewhere else.
I suppose, without insider knowledge, I would dump the data into CSV; do some sed
scripts or the like for the transformation, maybe in Hadoop if the data were really large; and then I would use the target database's bulk load utility. This makes the steps so simple that they can be delegated with some possibility of success. This is what data integration tends to be like. As we saw with the TPC-H bulk load, CSV loading is foolproof, easy, and fast.
Further, I would not make a JDBC program to first read one database and write into another because this would have to be explicitly multithreaded, would have loops, would require use of array parameters in order not to get killed by client server latency, would be liable to run into oddities of JDBC implementations, and so forth. Plus, this could be a few hundred lines long, and the developer would come back with questions like, "Why is it slow?" Well, it is slow because of lock contention, because transactions are not turned off, or something of the sort. No. Shell scripts and bulk load anytime.
Now we will explore a third possibility: vectored stored procedures. It is true that nobody uses stored procedures. They are sooo nineties -- where's the client side Javascript? I will introduce a design pattern that runs table-to-table copy and normalization changes, with perfect parallelism and scale-out, in SQL procedures. This will work from the file system as well, since a CSV file can be accessed as a table. For number of code lines, time-to-solution, as well as run-time performance, this is unbeatable.
The LOD2 project developed a benchmark for geo retrieval in SPARQL. We have adapted the benchmark to work in SQL against the PostgreSQL OSM schema and a Virtuoso SQL equivalent.
The intent is to run the LOD2 geobench against the planet-wide OSM dataset in PostgreSQL and Virtuoso. With Virtuoso we will also compare scale-out and single server versions.
The PostgreSQL OSM implementation exists in both normalized and denormalized variants. The denormalized variant uses an H-Store
column type, which is a built-in non-first-normal-form set of key-value pairs that can occur as a column value. In Virtuoso, the equivalent would be to use an array in a column value, but this is not very efficient. Rather, we will go the normalized route, getting outstanding JOIN
performance and space efficiency from the column store. Since this is a freestyle race, we take the liberty of borrowing the IRI
data type from the RDF side of Virtuoso. This offers a fast mapping between names and integer identifiers. This is especially handy for tags. PostgreSQL likely has some similar encoding as part of the H-Store implementation.
The geometry types are transferred as strings, and then re-parsed into the Virtuoso equivalents. The EWKT
syntax is compatible between the systems. The potentially long geometries are stored in a LONG ANY
column, and the always short ones (e.g., bounding boxes and points) into an ANY
column. In both implementations, there is an R-tree index on the points but not on the linestrings.
We will later look at space consumption and access locality in more detail.
To ETL the PostgreSQL based dataset, we attach the OSM tables as remote tables using Virtuoso's SQL federation (VDB) feature. This is not in the Open Source Edition (VOS) but you can get the same effect by dumping the tables into files, and defining the files as tables with the file-table feature.
The tables which have no need of special transformation go with just an INSERT ... SELECT
, like this:
log_enable (2);
INSERT INTO users
SELECT *
FROM users1 &
The tables which have special datatypes (like geometries or H-Stores
) need a little application logic, like this:
CREATE PROCEDURE copy_ways ()
{
log_enable (2);
RETURN
( SELECT COUNT (ins_ways ( id,
version,
user_id,
tstamp,
changeset_id,
tags,
linestring_wkt,
bbox_wkt
) )
FROM ways1)
;
}
The first line disables logging and makes inserts non-transactional. The rest does the copy. The scan of the remote table is automatically split by ranges of its primary key, so there is no need for explicit parallelism. The ins_ways
function is called on each thread, on a whole vector of values for each column. In this way operations are batched together, gaining by locality, and eliminating interpretation overhead.
The ins_ways procedure follows:
CREATE PROCEDURE ins_ways
( IN id BIGINT,
IN version INT,
IN user_id INT,
IN tstamp DATETIME,
IN changeset_id BIGINT,
IN tags ANY ARRAY,
IN linestring VARCHAR,
IN bbox VARCHAR
)
{
-- The vectored declaration means that each statement is run on the full input before going to the next.
-- Thus, by default, the insert gets 10K consecutive rows to insert. The conversion functions like st_ewkt_read
are also run in a tight loop over a large number of values.
VECTORED;
INSERT INTO ways
VALUES ( id,
version,
user_id,
tstamp,
changeset_id,
st_ewkt_read
( charset_recode
( linestring, '_WIDE_', 'UTF-8' )
),
st_ewkt_read
( charset_recode
( bbox, '_WIDE_', 'UTF-8' )
)
)
;
-- The tags
is a vector of strings where each string is a serialization of the H-Store content. split_and_decode
splits each string into an array at the delimiter.
tags :=
split_and_decode
( TRIM
( REPLACE
( REPLACE
( REPLACE
( REPLACE
( tags,
'"=>"',
'!!!'
),
'&',
'%26'
),
'", "',
'&'
),
'=',
'%3D'
),
'"'
)
);
NOT VECTORED
{
DECLARE a1,
b1 VARCHAR
;
DECLARE ws,
vs,
ts ANY ARRAY
;
DECLARE n_sets,
n_tags,
set_no,
wid,
inx,
pos,
fill INT
;
-- We insert triples of the form tag, way_id, tag_value
. For each of these, we reserve an array of 100K elements. We put the values into the array, and insert when full or when all rows of input are done. An insert of 100K values in one go is much faster than inserting 100K values singly, especially on a cluster.
ws := make_array (100000, 'ANY');
ts := make_array (100000, 'ANY');
vs := make_array (100000, 'ANY');
fill := 0;
DECLARE tag_arr,
str ANY ARRAY;
n_sets := vec_length (tags);
-- For each row of input to the vectored function:
FOR ( set_no := 0 ;
set_no < n_sets ;
set_no := set_no + 1
)
{
wid := vec_ref (id, set_no);
tag_arr := vec_ref (tags, set_no);
n_tags := LENGTH (tag_arr);
-- for each tag in the H-Store string:
FOR ( inx := 0;
inx < n_tags;
inx := inx + 2)
{
-- split the tag into a key and a value at the !!!
delimiter
str := tag_arr[inx];
pos := strstr(str, '!!!');
a1 := substring(str, 1, pos);
b1 := subseq(str, pos + 3);
-- add to the array of key-value pairs to insert
way_tag_add (ws, ts, vs, fill, wid, a1, b1);
}
}
way_tag_ins (ws, ts, vs);
}
}
Now we define the functions for adding a way, key, value
triple into the batch, and for inserting the batch.
CREATE PROCEDURE way_tag_ins
( INOUT ws ANY ARRAY,
INOUT ts ANY ARRAY,
INOUT vs ANY ARRAY
)
{
-- given an array of way ids, tag names, and tag values, insert all rows where the tag is not 0
. If the tag is empty, call it unknown instead.
-- The __i2id
function replaces the tag name with an IRI ID
that is persistently mapped to the name. The insert and the tag name-to-id mapping are done as a single operation. This is a single network round trip for each in a cluster setting.
FOR VECTORED
( IN wid INT := ws,
IN tag ANY := ts,
IN val VARCHAR := vs
)
{
IF (tag <> 0)
{
IF ('' = tag)
tag := 'unknown';
INSERT INTO ways_tags
VALUES ( __i2id (tag), wid, val );
}
}
}
CREATE PROCEDURE way_tag_add
( INOUT ws ANY ARRAY,
INOUT ts ANY ARRAY,
INOUT vs ANY ARRAY,
INOUT fill INT,
IN wid INT,
INOUT tg VARCHAR,
INOUT val VARCHAR
)
{
-- Add at the end of the arrays; if full, insert the content and replace with fresh arrays.
-- The INOUT
keyword means call by reference, which is important; you do not want to copy larger arrays, and you want to return new ones to the caller.
ws[fill] := wid;
ts[fill] := tg;
vs[fill] := val;
fill := fill + 1;
IF (100000 = fill)
{
way_tag_ins (ws, ts, vs);
fill := 0;
ws := make_array (100000, 'ANY');
ts := make_array (100000, 'ANY');
vs := make_array (100000, 'ANY');
}
}
The same logic can be applied to any simple data transformation task. Vectoring and automatic parallelism make sure that there is full platform utilization without explicitly working with threads. The NOT VECTORED {}
section allows the procedure to aggregate over all the values in a vector. The FOR VECTORED
construct in the INSERT
function switches back into running on a vector composed in the scalar part so as to get the insert throughput and cluster-friendly message pattern.
Because every non-1MF hack in every application is different, it is not possible to make this fully declarative. But the code is very repetitive and a skeleton could be easily generated from the schema.
In the next installment, we we will analyze the performance of copying the full Open Street Map dataset from PostgreSQL to Virtuoso. To be continued...
]]>The meeting was very well attended, along with most of the new advisory board. Xavier Lopez from Oracle, Luis Ceze from the University of Washington, and Abraham Bernstein of the University of Zurich were present. Jans Aasman of Franz, Inc., and Karl Huppler, former chairman of the TPC, were not present but are signed up as advisory board members.
We had great talks by the new board members and invited graph and RDF DB users.
Nuno Carvalho of Fujitsu Labs presented on the Fujitsu RDF use cases and benchmarking requirements, based around analytics streaming on time series of streaming data. The technology platform is diverse, with anything from RDF stores to HBase. The challenge is integration. I pointed out that with Virtuoso column store, you could now efficiently host time series data alongside RDF. Sure, a relational format is more efficient with time series data, but it can be co-located with RDF, and queries can join between the two. This is especially so after our stellar bulk-load speed measured with the TPC-H dataset.
Luis Ceze of Washington University presented Grappa, a C++ graph programming framework that in his words would be like Cray XMT, later Yarc Data, in software. The idea is to have a graph algorithm divided into small executable steps, millions in number, and to have very efficient scheduling and switching between these, building latency tolerance into every step of the application. Commodity interconnects like InfiniBand deliver bad throughput with small messages, but with endless message combination opportunities from millions of mini work units, the overall throughput stays good. We know the same from all the Virtuoso scale-out work. Luis is presently working on GraphBench, a research project at Washington State funded by Oracle for graph algorithm benchmarking. The major interest for LDBC is in having a library of common graph analytics as a starting point. Having these, the data generation can further evolve so as to create challenges for the algorithms. One issue that came up is the question of validating graph algorithm results: Unlike in SQL queries, there is not necessarily a single correct answer. If the algorithm to use and the count of iterations to run is not fully specified, response times will vary widely. Random walks will anyway create variation between consecutive runs.
Abraham Bernstein presented about the work on his Signal/Collect graph programming framework and its applications in fraud detection. He also talked about the EU FP7 project ViSTA-TV which does massive stream processing around the real time behavior of internet TV users. Again, Abraham gave very direct suggestions for what to include in the LDBC graph analytics workload.
Andreas Both of Unister presented on RDF ontology-driven applications in an e-commerce context. Unister is Germany’s leading e-commerce portal operator with a large number of properties ranging across travel to most B2C. The RDF use cases are many, in principle down to final content distribution but high online demand often calls for specialized solutions like bit field intersections for combining conditions. Sufficiently advanced database technology may also offer this but this is not a guarantee. Selecting travel destinations based on attributes like sports opportunities, culture, etc., can be made into efficient query plans, but this also requires perfect query plans for short queries. I expect to learn more about this when visiting on site. There is clear input for LDBC in these workloads.
There were three talks on semantic applications in cultural heritage. Robina Clayphan of Europeana talked about this pan-European digital museum and library, and the Europeana Data Model (EDM). C.E. Ore of the University of Oslo talked about the CIDOC CRM (Conceptual Reference Model) ontology (ISO standard 21127:2006) and its role in representing cultural, historic, and archaeological information. Atanas Kiryakov of Ontotext gave a talk on a possible benchmark around CIDOC CRM reasoning. In the present LDBC work, RDF inference plays a minor role, but reasoning would be emphasized with this CRM workload, in which the inference needed revolves around abbreviating unions between many traversal paths of different lengths between modeled objects. The data is not very large but the ontology has a lot of detail. This still is not the elusive use case which would really require all the OWL complexities. We will first see how the semantic publishing benchmark work led by Ontotext in LDBC plays out. There is anyhow work enough there.
The most concrete result was that the graph analytics part of the LDBC agenda starts to take shape. The LDBC organization is getting formed, and its processes and policies are getting defined. I visited Thomas Neumann’s group in Munich just prior to the TUC meeting to work on this. Nowadays Peter Boncz, who was recently awarded the Humboldt Prize, goes to Munich on a weekly basis, so Munich is the favored destination for much LDBC-related work.
The first workload of the Social Network Benchmark is taking shape, and there is good advance also in the Semantic Publishing Benchmark. I will in a future post give more commentary on these workloads, now that the initial drafts from the respective task forces are out.
]]>OR
and IN
. Also proper order between JOINs
and expressions is tested in Q21.
IN
PredicatesOne of the choke points mentioned in the TPC-H Analyzed paper is the IN
predicate with a list of constants. This occurs in Q12 and Q22. Q12 is the simpler of the two:
SELECT l_shipmode ,
SUM ( CASE
WHEN o_orderpriority = '1-URGENT'
OR o_orderpriority = '2-HIGH'
THEN 1
ELSE 0
END
) AS high_line_count,
SUM ( CASE
WHEN o_orderpriority <> '1-URGENT'
AND o_orderpriority <> '2-HIGH'
THEN 1
ELSE 0
END
) AS low_line_count
FROM orders,
lineitem
WHERE o_orderkey = l_orderkey
AND l_shipmode IN ('MAIL', 'SHIP')
AND l_commitdate < l_receiptdate
AND l_shipdate < l_commitdate
AND l_receiptdate >= CAST ('1994-01-01' AS DATE)
AND l_receiptdate < DATEADD ('year', 1, CAST ('1994-01-01' AS DATE))
GROUP BY l_shipmode
ORDER BY l_shipmode
The execution profile:
{
time 2.2e-05% fanout 1 input 1 rows
time 0.00023% fanout 1 input 1 rows
Precode:
0: chash_in_init := Call chash_in_init ( 182 , $29 "chash_in_tree", 0 , 0 , <c MAIL>, <c SHIP>)
5: BReturn 0
{ fork
time 1.2e-05% fanout 1 input 1 rows
{ fork
time 86% fanout 2.60039e+07 input 1 rows
LINEITEM 2.4e+06 rows(.L_COMMITDATE, .L_RECEIPTDATE, .L_SHIPDATE, .L_ORDERKEY, .L_SHIPMODE)
L_RECEIPTDATE >= <c 1994-01-01> < <c 1995-01-01>
hash partition+bloom by 0 ()
time 4.4% fanout 0.119803 input 2.60039e+07 rows
END Node
After test:
0: if (.L_COMMITDATE < .L_RECEIPTDATE) then 4 else 9 unkn 9
4: if (.L_SHIPDATE < .L_COMMITDATE) then 8 else 9 unkn 9
8: BReturn 1
9: BReturn 0
time 7.9% fanout 1 input 3.11534e+06 rows
ORDERS unq 1 rows (.O_ORDERPRIORITY)
inlined O_ORDERKEY = k_.L_ORDERKEY
After code:
0: if (.O_ORDERPRIORITY = <c 1-URGENT>) then 13 else 4 unkn 13
4: if (.O_ORDERPRIORITY = <c 2-HIGH>) then 13 else 8 unkn 13
8: callretSearchedCASE := := artm 1
12: Jump 17 (level=0)
13: callretSearchedCASE := := artm 0
17: if (.O_ORDERPRIORITY = <c 1-URGENT>) then 25 else 21 unkn 21
21: if (.O_ORDERPRIORITY = <c 2-HIGH>) then 25 else 30 unkn 30
25: callretSearchedCASE := := artm 1
29: Jump 34 (level=0)
30: callretSearchedCASE := := artm 0
34: BReturn 0
time 1.3% fanout 0 input 3.11534e+06 rows
Sort (.L_SHIPMODE) -> (callretSearchedCASE, callretSearchedCASE)
}
-- rest left out
1077 msec 2014% cpu, 3.11419e+06 rnd 5.99896e+08 seq 97.9872% same seg 1.74106% same pg
The top item in the profile is the predicate on l_receiptdate
. TPC-H Analyzed correctly points out that a lineitem
table in date order is best here, because zone maps will work on l_receiptdate
since this is correlated with l_shipdate
, which is the best date ordering column, as it is the most used. The date compare is done first in the scan, as it selects 1/7 and is fast, whereas the IN
selects 2/7 and has more instructions on the execution path.
SELECT cntrycode,
COUNT(*) AS numcust,
SUM(c_acctbal) AS totacctbal
FROM
( SELECT SUBSTRING(c_phone, 1, 2) AS cntrycode,
c_acctbal
FROM customer
WHERE SUBSTRING(c_phone, 1, 2)
IN ('13', '31', '23', '29', '30', '18', '17')
AND c_acctbal > ( SELECT AVG(c_acctbal)
FROM customer
WHERE c_acctbal > 0.00
AND SUBSTRING(c_phone, 1, 2)
IN ('13', '31', '23', '29', '30', '18', '17')
)
AND NOT EXISTS ( SELECT *
FROM orders
WHERE o_custkey = c_custkey
)
) AS custsale
GROUP BY cntrycode
ORDER BY cntrycode
Q22 has a condition on a substring of a string column. This merits a separate trick, namely merging the substring extraction into the scan, so the invisible hash-join predicate reads the column, cuts the substring in place, calculates a hash number, Bloom filters this against the IN
set, then finally outputs the row numbers which match. This operation is run-time re-orderable with other conditions, like the test on c_acctbal
. TPC-DS has a similar pattern in some queries.
The profile follows.
Q22 is one of the rare queries that clearly benefit from having an index on a foreign key column. The NOT EXISTS
with orders
could be done by hash, but then the hash would have to have every DISTINCT o_custkey
and the hash build could be filtered by a JOIN
on customer
with the conditions on c_acctbal
and c_phone
repeated. Being inside an existence, the number of orders
would not have to be retained, so the hash table would not end up larger than the probe side.
The payoff in Q22 makes it worthwhile to maintain an index on o_custkey
in the refresh functions.
{
time 1.4e-05% fanout 1 input 1 rows
time 1.2% fanout 1 input 1 rows
Precode:
0: $27 "chash_in_init" := Call chash_in_init ( 182 , $29 "chash_in_tree", 0 , 0 , <c 13>, <c 31>, <c 23>, <c 29>, <c 30>, <c 18>, <c 17>)
5: {
time 7.9e-06% fanout 1 input 1 rows
time 6.7e-05% fanout 1 input 1 rows
{ fork
-- Read customer
once, filter with the IN
predicate, and add up c_acctbal
and count
for the average.
time 26% fanout 0 input 1 rows
CUSTOMER 1.4e+07 rows(t4.C_ACCTBAL)
C_ACCTBAL > 0
hash partition+bloom by 0 ()
After code:
0: sum count 1
5: sum sumt4.C_ACCTBAL
10: BReturn 0
}
After code:
0: temp := artm sum / count
4: aggregate := := artm temp
8: BReturn 0
time 2.6e-05% fanout 0 input 1 rows
Subquery Select(aggregate)
}
13: BReturn 0
{ fork
time 1.9e-05% fanout 1 input 1 rows
{ fork
-- second scan of customer with the test on c_acctbal > average and the same in predicate.
time 26% fanout 1.90967e+06 input 1 rows
CUSTOMER 2.7e+05 rows(t2.C_CUSTKEY, t2.C_PHONE, t2.C_ACCTBAL)
C_ACCTBAL > k_scalar
hash partition+bloom by 0 ()
time 7% fanout 0.333434 input 1.90967e+06 rows
END Node
After test:
0: if ({
time 0.26% fanout 1 input 1.90967e+06 rows
-- See if the customer has orders. The index lookup is very fast since the keys come in order from the scan of customer.
time 5% fanout 9.9955 input 1.90967e+06 rows
O_CK 9.9 rows()
inlined O_CUSTKEY = k_t2.C_CUSTKEY
time 1.6% fanout 0 input 1.90881e+07 rows
Subquery Select( <none> )
}
) then 5 else 4 unkn 5
4: BReturn 1
5: BReturn 0
time 3.1% fanout 1 input 636749 rows
Precode:
0: cntrycode := Call substring (t2.C_PHONE, 1 , 2 )
5: BReturn 0
Stage 2
time 0.5% fanout 0 input 636749 rows
Sort (q_cntrycode) -> (t2.C_ACCTBAL, inc)
}
time 0.011% fanout 7 input 1 rows
group by read node
(cntrycode, totacctbal, numcust)in each partition slice
time 0.0013% fanout 0 input 7 rows
Sort (cntrycode) -> (numcust, totacctbal)
}
time 0.00016% fanout 7 input 1 rows
Key from temp (cntrycode, numcust, totacctbal)
time 1.4e-05% fanout 0 input 7 rows
Select (cntrycode, numcust, totacctbal)
}
306 msec 2107% cpu, 1.90678e+06 rnd 4.77961e+07 seq 99.5612% same seg 0.419137% same pg
This identifies suppliers
from a given country that have kept orders
waiting, i.e., they supply a delayed lineitem
and nobody else in the order
supplies a delayed lineitem
.
SELECT TOP 100 s_name,
COUNT(*) AS numwait
FROM supplier,
lineitem l1,
orders,
nation
WHERE s_suppkey = l1.l_suppkey
AND o_orderkey = l1.l_orderkey
AND o_orderstatus = 'F'
AND l1.l_receiptdate > l1.l_commitdate
AND EXISTS ( SELECT *
FROM lineitem l2
WHERE l2.l_orderkey = l1.l_orderkey
AND l2.l_suppkey <> l1.l_suppkey
)
AND NOT EXISTS ( SELECT *
FROM lineitem l3
WHERE l3.l_orderkey = l1.l_orderkey
AND l3.l_suppkey <> l1.l_suppkey
AND l3.l_receiptdate > l3.l_commitdate
)
AND s_nationkey = n_nationkey
AND n_name = 'SAUDI ARABIA'
GROUP BY s_name
ORDER BY numwait desc,
s_name
;
The crucial thing the optimizer must realize is that conditions that depend on a table should not always be placed right after the table because there could be a selective join that costs less. In this case, the plan comes out as a scan of orders
selecting 1/2; then an index lookup on lineitem
which is very efficient because the keys come in order and are tightly local; then the selective hash join with suppliers
from the country is merged into the index lookup. After this, 1/50 of lineitems
are left. Only for these does the system actually fetch the dates. After this comes the series of tests, ordered similarly to how joins are ordered. First the cheap date comparison, then the subqueries. The lineitems
can be fetched by their primary key, which again comes in order. Doing this by hash would duplicate most of the query on the build side and would be a lot of trouble.
Another join order would be lineitem
first, selecting 1/25 with the merged hash-join with supplier
, then orders
by index, selecting 1/2, then the existences. Experiment shows there is no great difference.
{
time 4.5e-06% fanout 1 input 1 rows
time 0.0036% fanout 1 input 1 rows
{ hash filler
Subquery 27
{
time 3.8e-05% fanout 1 input 1 rows
NATION 1 rows(t4.N_NATIONKEY)
N_NAME = <c SAUDI ARABIA>
time 0.01% fanout 39953 input 1 rows
SUPPLIER 4.2e+04 rows(t1.S_SUPPKEY, t1.S_NAME)
S_NATIONKEY = t4.N_NATIONKEY
After code:
0: t1.S_SUPPKEY := := artm t1.S_SUPPKEY
4: t1.S_NAME := := artm t1.S_NAME
8: BReturn 0
time 0.0019% fanout 0 input 39953 rows
Sort hf 48 (t1.S_SUPPKEY) -> (t1.S_NAME)
}
}
time 2.8e-06% fanout 1 input 1 rows
{ fork
time 2.6e-06% fanout 1 input 1 rows
{ fork
time 6.2% fanout 7.30725e+07 input 1 rows
ORDERS 7.3e+07 rows(.O_ORDERKEY)
O_ORDERSTATUS = <c F>
time 33% fanout 0.142002 input 7.30725e+07 rows
LINEITEM 0.49 rows(l1.L_RECEIPTDATE, l1.L_COMMITDATE, l1.L_ORDERKEY, l1.L_SUPPKEY)
inlined L_ORDERKEY = .O_ORDERKEY
hash partition+bloom by 58 (tmp)hash join merged always card 0.04 -> (.S_NAME)
time 9.7% fanout 0.0341517 input 1.03764e+07 rows
END Node
After test:
0: if (l1.L_RECEIPTDATE > l1.L_COMMITDATE) then 4 else 13 unkn 13
4: if ({
time 0.18% fanout 0.630264 input 1.03764e+07 rows
time 6.4% fanout 4.99133 input 6.53989e+06 rows
LINEITEM 1.1 rows(l3.L_SUPPKEY, l3.L_RECEIPTDATE, l3.L_COMMITDATE)
inlined L_ORDERKEY = k_l1.L_ORDERKEY
time 1.4% fanout 0.504963 input 3.26427e+07 rows
END Node
After test:
0: if (l3.L_RECEIPTDATE > l3.L_COMMITDATE) then 4 else 9 unkn 9
4: if (l3.L_SUPPKEY = l1.L_SUPPKEY) then 9 else 8 unkn 9
8: BReturn 1
9: BReturn 0
time 0.17% fanout 0 input 1.64834e+07 rows
Subquery Select( <none> )
}
) then 13 else 8 unkn 13
8: if ({
time 0.052% fanout 0.0570441 input 1.03764e+07 rows
time 1.1% fanout 2.13264 input 591914 rows
LINEITEM 3.7 rows(l2.L_SUPPKEY)
inlined L_ORDERKEY = k_l1.L_ORDERKEY
time 0.047% fanout 0.531095 input 1.26234e+06 rows
END Node
After test:
0: if (l2.L_SUPPKEY = l1.L_SUPPKEY) then 5 else 4 unkn 5
4: BReturn 1
5: BReturn 0
time 0.013% fanout 0 input 670421 rows
Subquery Select( <none> )
}
) then 12 else 13 unkn 13
12: BReturn 1
13: BReturn 0
time 0.0086% fanout 1 input 354373 rows
Hash source 48 merged into ts 0.04 rows(k_l1.L_SUPPKEY) -> (.S_NAME)
time 1.5% fanout 1 input 354373 rows
Stage 2
time 0.18% fanout 0 input 354373 rows
Sort (q_.S_NAME) -> (inc)
}
-- rest left out
2301 msec 2092% cpu, 8.01116e+07 rnd 3.95017e+08 seq 99.4284% same seg 0.475267% same pg
Compilation: 2 msec 0 reads 0% read 0 messages 0% clw
SELECT SUM(l_extendedprice* (1 - l_discount)) AS revenue
FROM lineitem,
part
WHERE ( p_partkey = l_partkey
AND p_brand = 'Brand#12'
AND p_container IN ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')
AND l_quantity >= 1
AND l_quantity <= 1 + 10
AND p_size BETWEEN 1 AND 5
AND l_shipmode IN ('AIR', 'AIR REG')
AND l_shipinstruct = 'DELIVER IN PERSON'
)
OR
(
p_partkey = l_partkey
AND p_brand = 'Brand#23'
AND p_container IN ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')
AND l_quantity >= 10
AND l_quantity <= 10 + 10
AND p_size BETWEEN 1 AND 10
AND l_shipmode IN ('AIR', 'AIR REG')
AND l_shipinstruct = 'DELIVER IN PERSON'
)
OR
(
p_partkey = l_partkey
AND p_brand = 'Brand#34'
AND p_container IN ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')
AND l_quantity >= 20
AND l_quantity <= 20 + 10
AND p_size BETWEEN 1 AND 15
AND l_shipmode in ('AIR', 'AIR REG')
AND l_shipinstruct = 'DELIVER IN PERSON'
)
The essential trick is to recognize that each of the terms of the OR
have the join condition between lineitem
and part
and the l_shipmode
and l_shipinstruct
conditions in common. After extracting these, the OR
is split into two more ORs
, one with conditions on part
and the other with conditions on lineitem
only. A hash is made of the matching parts where parts that correspond to none of the 3 ORed
ANDs
are left out. Then there is a scan of lineitem
with the hash lookup merged. The merged hash lookup does in this case produce result columns, which are further tested later in the query.
{ time 1.7e-05% fanout 1 input 1 rows time 0.018% fanout 1 input 1 rows
Precode:
0: chash_in_init := Call chash_in_init ( 182 , $29 "chash_in_tree", 0 , 0 , <c AIR>, <c AIR REG>)
5: temp := artm 1 + 10
9: temp := artm 10 + 10
13: temp := artm 20 + 10
17: BReturn 0
{ hash filler
time 4.5% fanout 2e+07 input 1 rows
PART 7.2e+04 rows(.P_BRAND, .P_CONTAINER, .P_SIZE, .P_PARTKEY)
P_SIZE >= 1
time 16% fanout 0.00240925 input 2e+07 rows
END Node
After test:
0: if (.P_BRAND = <c Brand#12>) then 4 else 17 unkn 17
4: if (.P_SIZE <= 5 ) then 8 else 17 unkn 17
8: one_of_these := Call one_of_these (.P_CONTAINER, <c SM CASE>, <c SM BOX>, <c SM PACK>, <c SM PKG>)
13: if ( 0 < one_of_these) then 51 else 17 unkn 17
17: if (.P_BRAND = <c Brand#23>) then 21 else 34 unkn 34
21: if (.P_SIZE <= 10 ) then 25 else 34 unkn 34
25: one_of_these := Call one_of_these (.P_CONTAINER, <c MED BAG>, <c MED BOX>, <c MED PKG>, <c MED PACK>)
30: if ( 0 < one_of_these) then 51 else 34 unkn 34
34: if (.P_BRAND = <c Brand#34>) then 38 else 52 unkn 52
38: if (.P_SIZE <= 15 ) then 42 else 52 unkn 52
42: one_of_these := Call one_of_these (.P_CONTAINER, <c LG CASE>, <c LG BOX>, <c LG PACK>, <c LG PKG>)
47: if ( 0 < one_of_these) then 51 else 52 unkn 52
51: BReturn 1
52: BReturn 0
time 0.058% fanout 0 input 48185 rows
Sort hf 52 (.P_PARTKEY) -> (.P_SIZE, .P_CONTAINER, .P_BRAND)
}
time 2.4e-05% fanout 1 input 1 rows
{ fork
time 79% fanout 46004 input 1 rows
LINEITEM 1.1e+07 rows(.L_QUANTITY, .L_PARTKEY, .L_EXTENDEDPRICE, .L_DISCOUNT, .L_SHIPMODE)
L_SHIPINSTRUCT = <c DELIVER IN PERSON>
hash partition+bloom by 0 ()
hash partition+bloom by 59 (tmp)hash join merged always card 0.00032 -> (.P_SIZE, .P_CONTAINER, .P_BRAND)
time 0.011% fanout 0.599796 input 46004 rows
END Node
After test:
0: if (.L_QUANTITY <= temp) then 4 else 8 unkn 8
4: if (.L_QUANTITY >= 1 ) then 24 else 8 unkn 8
8: if (.L_QUANTITY <= temp) then 12 else 16 unkn 16
12: if ( 10 <= .L_QUANTITY) then 24 else 16 unkn 16
16: if (temp >= .L_QUANTITY) then 20 else 25 unkn 25
20: if (.L_QUANTITY >= 20 ) then 24 else 25 unkn 25
24: BReturn 1
25: BReturn 0
time 0.002% fanout 1 input 27593 rows
Precode:
0: temp := artm 1 - .L_DISCOUNT
4: temp := artm .L_EXTENDEDPRICE * temp
8: BReturn 0
Hash source 52 merged into ts 0.00032 rows(k_.L_PARTKEY) -> (.P_SIZE, .P_CONTAINER, .P_BRAND)
time 0.053% fanout 0 input 27593 rows
END Node
After test:
0: if (.P_BRAND = <c Brand#12>) then 4 else 25 unkn 25
4: if (.L_QUANTITY >= 1 ) then 8 else 25 unkn 25
8: if (.L_QUANTITY <= temp) then 12 else 25 unkn 25
12: if (.P_SIZE <= 5 ) then 16 else 25 unkn 25
16: one_of_these := Call one_of_these (.P_CONTAINER, <c SM CASE>, <c SM BOX>, <c SM PACK>, <c SM PKG>)
21: if ( 0 < one_of_these) then 75 else 25 unkn 25
25: if (.P_BRAND = <c Brand#23>) then 29 else 50 unkn 50
29: if ( 10 <= .L_QUANTITY) then 33 else 50 unkn 50
33: if (.L_QUANTITY <= temp) then 37 else 50 unkn 50
37: if (.P_SIZE <= 10 ) then 41 else 50 unkn 50
41: one_of_these := Call one_of_these (.P_CONTAINER, <c MED BAG>, <c MED BOX>, <c MED PKG>, <c MED PACK>)
46: if ( 0 < one_of_these) then 75 else 50 unkn 50
50: if (.P_BRAND = <c Brand#34>) then 54 else 76 unkn 76
54: if (.L_QUANTITY >= 20 ) then 58 else 76 unkn 76
58: if (temp >= .L_QUANTITY) then 62 else 76 unkn 76
62: if (.P_SIZE <= 15 ) then 66 else 76 unkn 76
66: one_of_these := Call one_of_these (.P_CONTAINER, <c LG CASE>, <c LG BOX>, <c LG PACK>, <c LG PKG>)
71: if ( 0 < one_of_these) then 75 else 76 unkn 76
75: BReturn 1
76: BReturn 0
After code:
0: sum revenuetemp
5: BReturn 0
}
time 8.7e-06% fanout 0 input 1 rows
Select (revenue)
}
1315 msec 1889% cpu, 2 rnd 1.62319e+08 seq 0% same seg 0% same pg
We find a similar pattern in Q7, where an implementation is expected to extract conditions from an OR
and to restrict hash build sides with these. For
SELECT supp_nation,
cust_nation,
l_year,
SUM(volume) AS revenue
FROM
( SELECT
n1.n_name AS supp_nation,
n2.n_name AS cust_nation,
extract(year FROM l_shipdate) AS l_year,
l_extendedprice * (1 - l_discount) AS volume
FROM supplier,
lineitem,
orders,
customer,
nation n1,
nation n2
WHERE s_suppkey = l_suppkey
AND o_orderkey = l_orderkey
AND c_custkey = o_custkey
AND s_nationkey = n1.n_nationkey
AND c_nationkey = n2.n_nationkey
AND ( ( n1.n_name = 'FRANCE'
AND n2.n_name = 'GERMANY'
)
OR ( n1.n_name = 'GERMANY'
AND n2.n_name = 'FRANCE'
)
)
AND l_shipdate BETWEEN CAST ('1995-01-01' AS DATE) AND CAST ('1996-12-31' AS DATE)
) AS shipping
GROUP BY supp_nation,
cust_nation,
l_year
ORDER BY supp_nation,
cust_nation,
l_year
The plan builds a hash with customers
from either France or Germany, then of suppliers
from either France or Germany. Then it scans lineitem
for 2/7 years
and selects 2/25 based on the supplier
. The name of the supplier
country is also returned from the merged hash lookup. Then the corresponding order
is fetched by primary key, which is fast since the lineitem
produces keys in order. A similar hash condition is on the customer
. Finally, there is code to check that the countries are different between supplier
and customer
. We leave out the plan in the interest of space. A single execution is between 1.7 and 1.9s; 5 concurrent executions are 7.5s for the slowest.
To be continued...
Q13 counts the orders
of each customer
and then shows, for each distinct count
of orders
, how many customers
have this number of orders
. 1/3 of the customers
have no orders
; hence this is an outer join between customers
and orders
, as follows:
SELECT c_count,
COUNT(*) AS custdist
FROM ( SELECT c_custkey,
COUNT(o_orderkey) AS c_count
FROM ( SELECT *
FROM customer
LEFT OUTER JOIN orders
ON
c_custkey = o_custkey
AND
o_comment NOT LIKE '%special%requests%'
) c_customer
GROUP BY c_custkey
) c_orders
GROUP BY c_count
ORDER BY custdist DESC,
c_count DESC
;
The only parameter of the query is the pattern in the NOT LIKE
condition. The NOT LIKE
is very unselective, so almost all orders
will be considered.
The Virtuoso run time for Q13 is 6.7s, which we can consider a good result. Running 5 of these at the same time has the fastest execution finishing in 23.7s and the slowest in 35.3s. Doing 5x the work takes 5.2x the time. This is not bad, considering that the query has a high transient memory consumption. A second execution of 5 concurrent Q13s has the fastest finishing in 22.s and the slowest in 29.8s. The difference comes from already having the needed memory blocks cached, so there are no calls to the OS for mapping more memory.
To measure the peak memory consumption, which is a factor with this query, there is the mp_max_large_in_use
counter. To reset:
__dbf_set ('mp_max_large_in_use', 0);
To read:
SELECT sys_stat ('mp_max_large_in_use');
For the 5 concurrent executions of Q13, the counter goes to 10GB. This is easily accommodated at 100 GB; but at ten times the scale, this will be a significant quantity, even in a scale out setting. The memory allocation time is recorded in the counter mp_mmap_clocks
, read with sys_stat
. This is a count of cycles spent waiting for mmap
or munmap
and allows tracking if the process is being slowed down by transient memory allocation.
Let us consider how this works. The plan is as follows:
{
{ hash filler
CUSTOMER 1.5e+07 rows(t3.C_CUSTKEY)
-- Make a hash table of the 150M customers. The stage 2 operator below means that the customers
are partitioned in a number of distinct partitions based on the c_custkey
, which is the key in the hash table. This means that a number of disjoint hash tables are built, as many as there are concurrent threads. This corresponds to the ThreadsPerQuery
ini file setting of the enable_qp
setting with __dbf_set
and sys_stat
.
Stage 2
Sort hf 34 (q_t3.C_CUSTKEY)
}
{ fork
{ fork
{ fork
END Node
outer {
-- Here we start a RIGHT OUTER JOIN
block. The below operator scans the orders
table and picks out the orders
which do not contain the mentioned LIKE
pattern.
ORDERS 1.5e+08 rows(t4.O_CUSTKEY, t4.O_ORDERKEY)
O_COMMENT LIKE LIKE
hash partition+bloom by 80 ()
-- Below is a partitioning operator, also known as an exchange operator, which will divide the stream of o_custkeys
from the previous scan into different partitions, each served by a different thread.
Stage 2
-- Below is a lookup in the customer
hash table. The lookup takes place in the partition determined by the o_custkey
being looked up.
Hash source 34 not partitionable 1 rows(q_t4.O_CUSTKEY) -> ()
right oj, key out ssls: (t3.C_CUSTKEY)
After code:
0: t3.C_CUSTKEY := := artm t4.O_CUSTKEY
4: BReturn 0
-- The below is a RIGHT OUTER JOIN
end operator; see below for further description
end of outer}
set_ctr
out: (t4.O_ORDERKEY, t4.O_CUSTKEY)
shadow: (t4.O_ORDERKEY, t4.O_CUSTKEY)
Precode:
0: isnotnull := Call isnotnull (t4.O_ORDERKEY)
5: BReturn 0
-- The below sort is the innermost GROUP BY
. The ISNOTNULL
above makes a 0
or a 1
, depending on whether there was a found o_custkey
for the c_custkey
of the customer
.
Sort (t3.C_CUSTKEY) -> (isnotnull)
}
-- The below operators start after the above have executed to completion on every partition. We read the first aggregation, containing for each customer
the COUNT
of orders
.
group by read node
(t3.C_CUSTKEY, aggregate)in each partition slice
After code:
0: c_custkey := := artm t3.C_CUSTKEY
4: c_count := := artm aggregate
8: BReturn 0
Subquery Select(c_custkey, c_count)
-- Below is the second GROUP BY
; for each COUNT
, we count how many customers
have this many orders
.
Sort (c_count) -> (inc)
}
group by read node
(c_count, custdist)
-- Below is the final ORDER BY
.
Sort (custdist, c_count)
}
Key from temp (c_count, custdist)
Select (c_count, custdist)
}
The CPU profile starts as follows:
971537 31.8329 setp_chash_run
494300 16.1960 hash_source_chash_input_1i_n
262218 8.5917 clrg_partition_dc
162773 5.3333 strstr_sse42
68049 2.2297 memcpy_16
65883 2.1587 cha_insert_1i_n
57515 1.8845 hs_send_output
56093 1.8379 cmp_like_const
53752 1.7612 gb_aggregate
51274 1.6800 cha_rehash_ents
...
The GROUP BY
is on top, with 31%. This is the first GROUP BY
, which has one group per customer, for a total of 150M groups. Below the GROUP BY
is the hash lookup of the hash join from orders
to customer
. The third item is partitioning of a data column (dc
, or vectored query variable). The partitioning refers to the operator labeled stage 2 above. From one column of values, it makes several. In the 4th place, we have the NOT LIKE
predicate on o_comment
. This is a substring search implemented using SSE 4.2 instructions. Finally, in the last place, there is a function for resizing a hash table; in the present case, the hash table for the innermost GROUP BY
.
At this point, we have to explain the RIGHT OUTER JOIN
: Generally when making a hash join, the larger table is on the probe side and the smaller on the build side. This means that the rows on the build side get put in a hash table and then for each row on the probe side there is a lookup to see if there is a match in the hash table.
However, here the bigger table is on the right side of LEFT OUTER JOIN
. Normally, one would have to make the hash table from the orders
table and then probe it with customer
, so that one would find no match for the customers
with no orders
and several matches for customers
with many orders
. However, this would be much slower. So there is a trick for reversing the process: You still build the hash from the smaller set in the JOIN
, but now for each key that does get probed, you set a bit in a bit mask, in addition to sending the match as output. After all outputs have been generated, you look in the hash table for the entries where the bit is not set. These correspond to the customers
with no orders
. For these, you send the c_custkey
with a null o_orderkey
to the next operator in the pipeline, which is the GROUP BY
on c_custkey
with the count of non-null o_orderkeys
.
One might at first think that such a backwards way of doing an outer join is good for nothing but this benchmark and should be considered a benchmark special. This is not so, though, as there are accepted implementations that do this very thing.
Furthermore, getting a competitive score in any other way is impossible, as we shall see below.
We further note that the the grouping key in the innermost GROUP BY
is the same as the hash key in the last hash join, i.e., o_custkey
. This means that the GROUP BY
and the hash join could be combined in a single operator called GROUPJOIN
. If this were done, the hash would be built from customer
with extra space left for the counters. This would in fact remove the hash join from the profile as well as the rehash of the group by hash table, for a gain of about 20%. The outer join behavior is not a problem here since untouched buckets, e.g., customers
without orders
, would be inited with a COUNT
of 0
. For an inner join behavior, one would simply leave out the zero counts when reading the GROUP BY
. At the end of the series, we will see what the DBT3 score will be. We remember that there is a 1.5s savings to be had here for the throughput score if the score is not high enough otherwise. The effect on the power score will be less because that only cares about relative speedup, not absolute time.
Next, we disable the RIGHT OUTER JOIN
optimization and force the JOIN
to build a hash on orders
and to probe it with customer
. The execution time is 25s. Most of the time goes into building the hash table of orders
. The memory consumption also goes up to around 8G. Then we try the JOIN
by index with a scan of customer
, and for each an index lookup of orders
based on an index on o_custkey
. Here we note that there is a condition on a dependent part of the primary key, namely o_comment
, which requires joining to the main row from the o_ck
index. There is a gain however because the GROUP BY
becomes ordered; i.e., there is no need to keep groups around for customers
that have already been seen since we know they will not come again, the outer scan being in order of c_custkey
. For this reason, the memory consumption for the GROUP BY
goes away. However, the index-based plan is extremely sensitive to vector size: The execution takes 29.4s if vector size is allowed to grow to 1MB, but 413s if it stays at the default of 10KB. The difference is in the 1MB vector hitting 1/150 (1 million lookups for a 150 million row table), whereas the 10KB vector hits 1/15000. Thus, benefits from vectoring lookups are largely lost, since there are hardly ever hits in the same segment; in this case, within 2000 rows. But this is not the main problem: The condition on the main row is a LIKE
on a long column. Thus, the whole column for the segment in question must be accessed for read, meaning 2000 or so o_comments
, of which one will be checked. If instead of a condition on o_comment
, we have one on o_totalprice > 0
, we get 93s with 10KB vector size and 15s with dynamic up to 1MB.
If we now remove the condition on dependent columns of orders
, the index plan becomes faster, since the whole condition is resolved within the o_custkey
index -- 2.5s with 10KB vector size, 2.6s with dynamic vector size up to 1MB. The point here is that the access from customer
to orders
on the o_custkey
index is ordered, like a merge join.
Q13 is a combo of many choke points in the TPC-H Analyzed paper. The most important is special JOIN
types, i.e., RIGHT OUTER JOIN
and GROUPJOIN
. Then there is string operation performance for the substring matching with LIKE
. This needs to be implemented with the SSE 4.2 string instructions; otherwise there is a hit of about 0.5s on query speed.
The TPC-H Analyzed paper was written against the background of analytical DB tradition where the dominant JOIN
type is hash, except when there is a merge between two sets that are ordered or at least clustered on the same key. Clustered here means physical order but without the need to be strictly in key order.
Here I have added some index based variants to show that hash join indeed wins and to point out the sensitivity of random access to vector size. As column stores go, Virtuoso is especially good at random access. This must be so since it was optimized to do RDF well, which entails a lot of lookup. Also note how a big string column goes with great ease in a sequential scan, but kills in a non-local random access pattern.
The query is below. The date is a parameter (a value near the end of the l_shipdate
range is used), so most of the columns get read.
SELECT l_returnflag,
l_linestatus,
SUM(l_quantity) AS sum_qty,
SUM(l_extendedprice) AS sum_base_price,
SUM(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
SUM(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
AVG(l_quantity) AS avg_qty,
AVG(l_extendedprice) AS avg_price,
AVG(l_discount) AS avg_disc,
COUNT(*) AS count_order
FROM lineitem
WHERE l_shipdate <= dateadd('DAY', -90, CAST ('1998-12-01' AS DATE))
GROUP BY l_returnflag,
l_linestatus
ORDER BY l_returnflag,
l_linestatus
;
We note that a count of a non-nullable column is the same as COUNT (*)
and that AVG
of a non-null column is SUM (column) / COUNT (*)
. So COUNT (*)
occurs 4 times, and SUM (l_extendedprice)
and SUM (l_quantity)
each occur twice. The grouping columns have few distinct values.
TPC-H Analyzed suggests to use an array-based GROUP BY
, because there can only be 64K combinations of 2 single-character values. The grouping keys are declared CHAR (1)
and non-nullable. The Virtuoso implementation does not do this, though.
This query has been treated in many papers because it cannot be implemented in very many ways, it is easy to understand, and it still illustrates some basic metrics.
One execution on warm cache is between 3.9s and 4.7s. One execution with the data coming from OS disk cache is 11.8s. One execution with the data coming from 2 SSDs is 22s. Five concurrent executions from warm cache are 17.3s for the fastest and 20.5s for the slowest. A single threaded execution from warm cache is 58.4s.
We see that scaling is linear; i.e., 5 times the work takes a little under 5x longer. The parallelism is reasonable, with 14.6 speedup from 24 threads on 12 cores. Splitting the work into 48 software threads, time-sliced on 24 hardware threads, does not affect execution time. The work thus appears to be evenly spread on the threads.
It may be interesting to see how much data is transferred. To see the space consumption per column --
SELECT TOP 20 *
FROM sys_index_space_stats
ORDER BY iss_pages DESC ;
-- followed by --
SELECT coi_column,
SUM (coi_pages) / 128
FROM sys_col_info
GROUP BY coi_column
ORDER BY 2 DESC ;
-- gives us the following --
L_COMMENT 17893
PS_COMMENT 10371
O_COMMENT 7824
L_EXTENDEDPRICE 4771
O_CLERK 2744
L_PARTKEY 2432
L_SUPPKEY 2432
L_COMMITDATE 1784
L_SHIPDATE 1551
L_RECEIPTDATE 1537
O_TOTALPRICE 1181
C_COMMENT 1150
L_QUANTITY 960
O_ORDERKEY 736
P_NAME 729
PS_SUPPLYCOST 647
O_CUSTKEY 624
C_ADDRESS 427
L_DISCOUNT 424
L_TAX 419
L_SHIPINSTRUCT 412
L_SHIPMODE 410
L_LINENUMBER 394
L_RETURNFLAG 394
L_LINESTATUS 394
O_ORDERDATE 389
P_COMMENT 341
PS_SUPPKEY 323
P_TYPE 293
C_PHONE 274
L_ORDERKEY 268
PS_AVAILQTY 201
P_RETAILPRICE 161
C_ACCTBAL 123
O_ORDERPRIORITY 95
O_ORDERSTATUS 94
S_COMMENT 66
...
The total in allocated pages is 65.6 GB, of which 34.1 GB are accessed by the workload. The comment strings could be stream-compressed, bringing some speedup in load time due to less I/O. Also l_extendedprice
, a frequently accessed column, could be represented with 4 bytes instead of 8. The working set could thus be cut down to about 28 GB, which may offer some benefit at larger scales. At any rate, for system sizing, the space utilization report is very useful.
The query execution profile is as below, with comments inline. The profile here is obvious, but we show this as a guide to reading future profiles which will be more interesting.
{
time 2.6e-06% fanout 1 input 1 rows
time 1.7e-06% fanout 1 input 1 rows
{ fork
time 2.1e-06% fanout 1 input 1 rows
{ fork
The time xx%
line above each operator is the actual percentage of execution time taken by it, followed by the count of rows of output per row of input, followed by the actual rows of input. The below produced 591M rows of output for one row of input --
time 34% fanout 5.91599e+08 input 1 rows
LINEITEM 5.9e+08 rows(.L_RETURNFLAG, .L_LINESTATUS, .L_DISCOUNT, .L_EXTENDEDPRICE, .L_QUANTITY, .L_TAX)
L_SHIPDATE <= <c 1998-09-02>
Below is the arithmetic of the query, followed by a sort (GROUP BY
) operator.
After code:
0: temp := artm 1 - .L_DISCOUNT
4: temp := artm .L_EXTENDEDPRICE * temp
8: temp := artm 1 + .L_TAX
12: temp := artm temp * temp
16: BReturn 0
Most of the time is spent below, in the GROUP BY
. We notice that each needed aggregation is done once, so the common subexpressions are correctly detected.
time 66% fanout 0 input 5.91599e+08 rows
Sort (.L_RETURNFLAG, .L_LINESTATUS) -> (inc, .L_DISCOUNT, .L_EXTENDEDPRICE, .L_QUANTITY, temp, temp)
}
time 4e-05% fanout 4 input 1 rows
group by read node
(.L_RETURNFLAG, .L_LINESTATUS, count_order, aggregate, sum_base_price, sum_qty, sum_charge, sum_disc_price)
time 6.2e-05% fanout 0 input 4 rows
The SUMs
are divided by the COUNTs
, and the rows are sorted.
Precode:
0: avg_qty := artm sum_qty / count_order
4: avg_price := artm sum_base_price / count_order
8: avg_disc := artm aggregate / count_order
12: BReturn 0
Sort (.L_RETURNFLAG, .L_LINESTATUS) -> (sum_qty, sum_base_price, sum_disc_price, sum_charge, avg_qty, avg_price, avg_disc, count_order)
}
time 1.2e-05% fanout 4 input 1 rows
Key from temp (.L_RETURNFLAG, .L_LINESTATUS, sum_qty, sum_base_price, sum_disc_price, sum_charge, avg_qty, avg_price, avg_disc, count_order)
The data is returned to the client.
time 4.4e-06% fanout 0 input 4 rows
Select (.L_RETURNFLAG, .L_LINESTATUS, sum_qty, sum_base_price, sum_disc_price, sum_charge, avg_qty, avg_price, avg_disc, count_order)
}
Elapsed time and CPU%.
3947 msec 2365% cpu, 6 rnd 5.99841e+08 seq 0% same seg 0% same pg
Compilation: 0 msec 0 reads 0% read 0 messages 0% clw
This output is produced by the following sequence on the iSQL command line --
SQL> SET blobs ON;
SQL> PROFILE ('SELECT .... FROM .....');
We will next consider the CPU profile:
704173 33.0087 setp_chash_run
275039 12.8927 gb_aggregate
252751 11.8479 ce_dict_any_sets_decode
178304 8.3581 cha_cmp_2a
170819 8.0073 ce_dict_int64_sets_decode
127827 5.9920 chash_array_0
120994 5.6717 chash_array
60902 2.8548 ce_intd_any_range_lte
47865 2.2437 artm_mpy_double
38634 1.8110 ce_vec_int64_sets_decode
26411 1.2380 artm_sub_double
24794 1.1622 artm_add_double
13600 0.6375 cs_decode
For hardcore aficionados, the code may be found in the Virtuoso develop/7.x
branch on github.com. The version is not exactly the same but close enough for the parts above. artm_*
is arithmetic on typed vectors. As pointed out before, the arithmetic is with DOUBLEs
, although users would prefer fixed point. There is, I believe, a MS SQL Server result with DOUBLEs
, so using DOUBLEs
would not disqualify a 100 GB TPC-H result.
The moral of the story is that an array-based aggregation without the chash_array*
and cha_cmp*
and only 1/3 of the setp_chash_run
function would save upwards of a second of real time. The setp_
and cha_
are aggregation; the ce_*
are column decompression and filtering. The arithmetic is not high in the sample but it could be sped up by 2-4x by SIMD, specially since AVX on Sandy Bridge and later does 4 DOUBLEs
in a single instruction.
We note that the ce_filter_*
function would drop off if the table were stored in date order, as then the top level index would show that all the values in the column matched, thus making it unnecessary to even read the l_shipdate
column, except for the last part of the table. However this is a marginal slice of the time even now.
We have demonstrated good load balance and passed the required common sub-expressions exam. The array-based GROUP BY
trick is unused but would save over 1s of real time, hence will be good value for only 100-200 lines of code.
Next we look at JOINs
by hash and index. Q3 is a relatively straightforward example, so we will go over the basics of JOIN
type (i.e., whether by index or hash) and JOIN
order. This will also show some scheduling effects.
The definition is:
SELECT TOP 10 l_orderkey,
SUM(l_extendedprice * (1 - l_discount)) AS revenue,
o_orderdate,
o_shippriority
FROM customer,
orders,
lineitem
WHERE c_mktsegment = 'BUILDING'
AND c_custkey = o_custkey
AND l_orderkey = o_orderkey
AND o_orderdate < CAST ('1995-03-15' AS DATE)
AND l_shipdate > CAST ('1995-03-15' AS DATE)
GROUP BY l_orderkey,
o_orderdate,
o_shippriority
ORDER BY revenue desc,
o_orderdate
The profile, comments inline, is:
{
time 6.7e-06% fanout 1 input 1 rows
Make a hash table with c_custkey
for all customers with c_mktsegment
building. For a hash join build side, the time above the hash filler line is the time for making the hash table from the buffered rows. The time above the sort ... hf ...
line is the time for buffering the rows that go into the hash table. The other times in the hash filler block are for the operators for getting the data, and are not related to making the hash table.
time 0.72% fanout 1 input 1 rows
{ hash filler
We see that the actual cardinality of customer
is close to what was predicted. The actual number is on the line with time; the predicted is on the line with the index name customer
.
time 0.63% fanout 3.00019e+06 input 1 rows
CUSTOMER 3e+06 rows(.C_CUSTKEY)
C_MKTSEGMENT =
time 0.089% fanout 0 input 3.00019e+06 rows
Sort hf 34 (.C_CUSTKEY)
}
time 9.2e-06% fanout 1 input 1 rows
{ fork
time 5.2e-06% fanout 1 input 1 rows
{ fork
The below is a merge of a scan of orders
and a hash join to customer
. The orders
table is scanned, first reading o_orderdate
and o_custkey
, on which there are selections. The o_orderdate
is a range check that is true of about 1/2 of the rows. The other condition is an invisible hash join against the customer
hash table built above. This selects 1/5 of the rows on the average. So we see that for a total of 150M orders, the fanout is 14.5M, about 1/10, as predicted. The 6.1e7 rows on the line with orders
represents the estimate based on the orderdate
condition. The card 0.2 on the hash filter line is the prediction for the hash join selectivity.
We note that since no order
has more than one customer
, the JOIN
is always cardinality-restricting, hence can be merged into a scan. Being merged into a scan, it becomes run-time re-orderable with the condition on o_orderdate
. The conditions are evaluated and arranged at run time in the order of rows eliminated per unit of time.
The expression "hash partition + bloom" means that the hash join could be partitioned if the hash table did not fit in memory; i.e., there could be several passes over the data. This is not here the case, nor is this generally desirable. The bloom means that the hash is pre-filtered with a Bloom filter, which we will see in the CPU profile.
time 30% fanout 1.45679e+07 input 1 rows
ORDERS 6.1e+07 rows(.O_CUSTKEY, .O_ORDERKEY, .O_ORDERDATE, .O_SHIPPRIORITY)
O_ORDERDATE < <c 1995-03-15>
hash partition+bloom by 41 (tmp)hash join merged always card 0.2 -> ()
The below is the hash join operator that in fact was merged into the table scan above.
time 0.0016% fanout 1 input 1.45679e+07 rows
Hash source 34 merged into ts 0.2 rows(.O_CUSTKEY) -> ()
Below is the index-based access to lineitem
. This is a de facto merge-join since the o_orderkeys
are generated in order by the scan. One in 10 l_orderkeys
is selected. Each of these has an average of 4 lineitems
. Of these 4, the cost model predicts that 2.7 will be selected based on the additional condition on l_shipdate
. The actual number of rows matched is in fact much lower since the date selection is heavily anti-correlated with the date selection on orders
. In other words, an order
tends to be shipped soon after the orderdate
.
time 16% fanout 0.20508 input 1.45679e+07 rows
LINEITEM 2.6 rows(.L_ORDERKEY, .L_EXTENDEDPRICE, .L_DISCOUNT)
inlined L_ORDERKEY = .O_ORDERKEY L_SHIPDATE >
After code:
0: temp := artm 1 - .L_DISCOUNT
4: temp := artm .L_EXTENDEDPRICE * temp
8: BReturn 0
The query has a GROUP BY
that includes the high cardinality column, l_orderkey
, with 150M distinct values. The GROUP BY
is therefore partitioned.
This means that the previous part of the query is run on multiple threads, so that each thread gets an approximately equal number of lines of orders. For the GROUP BY
, the threads pass each other chunks of data so that each grouping key can only end up in one partition. This means that at the end of the GROUP BY
, there are multiple hash tables with grouping results that are guaranteed non-overlapping, hence there is no need to add up (re-aggregate) the per-thread results. The stage operator passes data between the threads. This is also known as an exchange operator.
time 1% fanout 1 input 2.98758e+06 rows
Stage 2
time 1.4% fanout 0 input 2.98758e+06 rows
Sort (q_.L_ORDERKEY, .O_ORDERDATE, .O_SHIPPRIORITY) -> (temp)
}
time 0.4% fanout 1.13104e+06 input 1 rows
group by read node
(.L_ORDERKEY, .O_ORDERDATE, .O_SHIPPRIORITY, revenue)in each partition slice
time 0.36% fanout 0 input 1.13104e+06 rows
Sort (revenue, .O_ORDERDATE) -> (.L_ORDERKEY, .O_SHIPPRIORITY)
}
time 3.1e-05% fanout 10 input 1 rows
top order by read (.L_ORDERKEY, revenue, .O_ORDERDATE, .O_SHIPPRIORITY)
time 6.2e-06% fanout 0 input 10 rows
Select (.L_ORDERKEY, revenue, .O_ORDERDATE, .O_SHIPPRIORITY)
}
1189 msec 2042% cpu, 1.45513e+07 rnd 2.08588e+08 seq 98.9053% same seg 0.952364% same pg
Compilation: 1 msec 0 reads 0% read 0 messages 0% clw
This query also illustrates the meaning of the random and sequential access meters in the profile: For 1/10 of the orders
, there is a random lookup from lineitem
, hence 14.5M random lookups. The sequential scan is 150M rows of orders
plus an average of 3 extra rows for each of the 14M random accesses of lineitem
. The locality metric, 98.9% same segment, means that the JOIN
has a merge-join pattern, since 99% of lookups fall in the same segment as the previous one. A segment is a column store structure that, in the case of lineitem
, corresponds to about 4500 consecutive rows.
This is one of the queries where storing the data in date order would be advantageous. A zone map on date would eliminate half the second half of the orders
without even looking at the columns. A zone map is a summary data structure that keeps, for example, a minimum and a maximum value of an attribute for a range of consecutive rows. Also, for all but the lineitems
at the end of the range of orders
, a zone map would also disqualify the items without looking at the column. VectorWise, for example, profits from this. However, the CPU profile below shows that the time spent in date compares is not very long even now.
On further analysis, we see that the query is run in the order of o_orderkey
, so that each o_orderkey
is seen once. Hence the partitioned GROUP BY
can be changed into an ordered GROUP BY
, as all the grouping columns are further functionally dependent on o_orderkey
. An ordered GROUP BY
is more efficient than a partitioned or re-aggregated one, since it does not have to remember grouping keys: Once a new key comes in, the previous key will not be seen again, and the aggregation for it can be sent onwards in the pipeline.
However, this last transformation has little effect here, as the count of rows passing to the aggregation is small. Use of ordered aggregation has much higher impact in other queries and will be visited there. There is also a chance for late projection, as the o_shippriority
is in fact only needed for the top 10 rows returned. The impact is small in this case, though. This too will be visited later.
We now consider the CPU profile:
93087 20.9350 cha_inline_1i_n
63224 14.2189 cha_bloom_unroll
30162 6.7834 cha_insert_1i_n
29178 6.5621 ce_search_rld
28519 6.4139 ce_intd_range_ltgt
17319 3.8950 cs_decode
15848 3.5642 ce_intd_sets_ltgt
11731 2.6383 ce_skip_bits_2
8212 1.8469 ce_vec_int_sets_decode
7946 1.7870 itc_single_row_opt
7886 1.7735 ce_intd_any_sets_decode
7072 1.5905 itc_fetch_col_vec
6474 1.4560 setp_chash_run
5304 1.1929 itc_ce_value_offset
5263 1.1836 itc_col_seg
The top 3 items are for the orders
x customer
hash join -- the top 2 for the probe, and the 3rd for the build. The 4th item is the index lookup on lineitem
. The one below that is the date condition on orders
; below this is the condition on the date of lineitem
.
The functions working on a compressed column are usually called ce_<compression type>_<sets or range>_<filter or decode>
. ce
means compression entry; the compression types are rl
(run length), rld
(run length with delta), bits
(densely ascending values as bitmap), intd
(16 bit deltas on a base), and dict
(dictionary). The sets
vs range
determines whether the operation works on a set of contiguous values in the entry, or takes a vector of row numbers as context. The first predicate works on a range; the next one on the sets (row numbers) selected by the previous. Filter
means selection, and decode
means extracting a value for processing by a downstream operator.
We run 5 of these concurrently: the fastest returns in 2.8s; the slowest in 5.4s. The executions are staggered, so that each divides into up to 24 independent fragments which are then multiplexed on 48 worker threads, with each fragment guaranteed at least one thread. The slices of the first query are prioritized, so that when a worker thread has a choice of next unit of work, it will prefer one from an older queue. Each query in this setting has one queue of independently executable fragments. Thus the first to come in gets the most threads and finishes sooner. The rationale for this is that a query may have large transient memory consumption, e.g., GROUP BYs
or hash join build sides. The sooner such a query finishes, the less likely it is that there will be many concurrent queries with the high peak-memory demand. This does not block short queries since in any case a runnable query will have one thread which will get scheduled by the OS from time to time.
The balance is that unused tricks (ordered aggregation, late projection) would gain little. Date order would gain about 0.4s from 1.3s, but would lose in other queries.
We have treated Q1 and Q3 at some length in order to introduce reading of query profiles and the meaning of some meters. For the handful of people who are deep into this sport, the information is rather obvious, but will still give an idea of the specific feature mix of Virtuoso. Column stores are similar to a point, but not all make the exact same choices.
If you are a developer, what, if anything, should you remember of this? Never mind the finesses of column store science -- if you understand join order and join type, then there is the possibility of understanding why some queries are fast and some slow. Most support questions are about this. If you know what the DBMS does or should do you are in control. This is why the metrics and concepts here are also of some interest outside the very small group that actually makes DBMS.
With the coming of age of the Virtuoso column store, where this becomes a strong contender for SQL warehousing, the SQL federation aspect is also revitalized.
In the previous article, we saw that Virtuoso can load files at well over gigabit-ethernet wire speed. The same of course applies to SQL federation. We can copy the 100 GB TPC-H dataset between two Virtuoso instances in only slightly more time than it takes to load the data from files. In a network situation, the network is likely to be the slowest link when extracting data from other SQL stores into Virtuoso. So, to be "semantically elastic," federating has become warehousing. The articles to follow will show excellent query speed for analytics. The combination of this with connectivity to any existing SQL infrastructure makes Virtuoso an easy-to-deploy accelerator cache for almost any data integration situation. This in fact also simplifies query execution, because the more data one can have locally, the more query optimization choices there are, and performance becomes much more predictable than in situations where queries execute across many heterogenous systems. The win is compounded by reducing loads on the line-of-business databases. The missing link in this case becomes heterogenous log shipping. One can usually not modify a line of business system; for example, adding triggers for tracking changes is generally not done. Being able to read transaction logs of all the most common DBMS would offer a solution.
The barrier to having one's own extract of data for analysis has become much lower. Even the ETL step can be easily streamlined by the SQL federation. For very time-sensitive applications, one can always keep a local copy of a history in a union with the most recent data accessed from the line-of-business system. At the end of the TPC-H series, we will show examples of a near real-time analytics system that keeps up to date with an Oracle database.
For RDF users, this means we have the capacity to extract RDF at bulk load speed from any relational source, whether local or remote. For the test system discussed in the TPC-H series, RDF load shows a sustained throughput of around 320K triples per second. This means that an RDF materialization of the 100 GB TPC-H dataset, about 12.5 billion triples, is done in under 11 hours. This is a vast improvement over the present, and we will show the details in a forthcoming article.
virtuoso.ini
, discussed in the previous post. The schema is created by loading the file schema.sql
, attached. All the tables are stored column-wise. The file contains declarations for hash partitioning in a cluster, but these have no effect on the single-server case. The file tables are declared in ldschema.sql
and bound to files in ldfile.sql
. The refresh functions are in rf.sql
.
The source data is created with the dbgen
utility. One file is generated per table.
Twelve refresh datasets are created in order to do the prescribed two runs; each consists of one power test, and one five-stream throughput test. Five streams is the minimum for the 100 GB scale.
The bulk load script, ld.sql
specifies the CSV files from which the data is loaded as file tables. The load command is simply --
log_enable (2);
INSERT INTO lineitem
SELECT *
FROM lineitem_f
;
The log_enable (2)
turns off transaction logging, and enables non-transactional inserts. The lineitem
table is a column-wise stored database table; the lineitem_f
is a table view on the lineitem.tbl
CSV file. The load script launches one statement like the above for each table, all in parallel, and then waits for their completion. It then makes an explicit CHECKPOINT
to make the data durable. No foreign keys are declared; hence the load does not have to occur in any particular order. Each file is loaded in 24 parallel chunks; the file table facility splits the scan automatically inTO as many chunks as are specified by ThreadsPerQuery
in the ini file.
The last of the load statements, that for lineitem
, completes in 849s of real time. At this point, the data is loaded, and the database is ready for query. There are 3.4M dirty buffers yet to be flushed before the database state is durable. Thus, we must include the checkpoint time in the load result, which adds another 169s. The total load time is hence 16m58s.
By the TPC-H rules the timed portion of load must include any gathering of database statistics. We do not do any; rather the queries will derive any needed statistics by sampling at run time.
The bulk load has a sustained read rate around 120 MB/s from the source files. The average rate of writing is 60 MB/s. The writing continues long after the read has finished, so we have a truly I/O-bound situation. This can be improved by adding more SSDs. The CPU profile shows a possible gain of around 10%. Thus, with a better I/O system and some more optimization, a load time of about 11m should be possible with this CPU/memory configuration.
TPC-H specifies two data refresh operations: one inserting 1/1000th of the orders
/lineitem
combination; and another deleting the same. The rules leave the implementation largely open; they only specify that the order
and its lineitem
s must be inserted or deleted within the same transaction.
Most implementations bulk-load a staging table, and then do an INSERT ... SELECT
statement for the INSERT
or a DELETE WHERE IN (SELECT ...)
for the DELETE
.
In Virtuoso, the refreshes are implemented as SQL procedures that read file tables, as follows:
CREATE PROCEDURE rf1
( IN dir VARCHAR ,
IN nth INT ,
IN no_pk INT := 0 ,
IN rb INT := 0 ,
IN qp INT := NULL
)
{
INSERT INTO orders
SELECT *
FROM orders_f
TABLE OPTION
( FROM sprintf ('%s/orders.tbl.u%d', dir, nth)
)
;
INSERT INTO lineitem
SELECT *
FROM lineitem_f
TABLE OPTION
( FROM sprintf ('%s/lineitem.tbl.u%d', dir, nth)
)
;
COMMIT WORK
;
}
CREATE PROCEDURE del_batch
( IN d_orderkey INT
)
{
VECTORED;
DELETE
FROM lineitem
WHERE l_orderkey = d_orderkey
;
DELETE
FROM orders
WHERE o_orderkey = d_orderkey
;
}
CREATE PROCEDURE rf2
( IN dir VARCHAR ,
IN nth INT
)
{
DECLARE cnt INT
;
cnt :=
( SELECT COUNT (del_batch (d_orderkey) )
FROM delete_f
TABLE OPTION
( FROM sprintf ( '%s/delete.%d', dir, nth ) )
)
;
COMMIT WORK
;
RETURN cnt
;
}
Things can hardly be simpler. The delete uses a vectored stored procedure to delete a batch of rows in one statement. The parallelization is done automatically, dividing the file table being read into equal size chunks.
The performance is as follows. These refreshes are run on the database right after the bulk load.
-- Line 33: rf1 ('/1s1/tpch100src', 1)
Done. -- 3909 msec.
-- Line 34: rf2 ('/1s1/tpch100src', 1)
Done. -- 863 msec.
-- Line 36: rf1 ('/1s1/tpch100src', 2)
Done. -- 2269 msec.
-- Line 37: rf2 ('/1s1/tpch100src', 2)
Done. -- 871 msec.
-- Line 40: rf1 ('/1s1/tpch100src', 3)
Done. -- 2315 msec.
-- Line 41: rf2 ('/1s1/tpch100src', 3)
Done. -- 906 msec.
-- Line 43: rf1 ('/1s1/tpch100src', 4)
Done. -- 2337 msec.
-- Line 44: rf2 ('/1s1/tpch100src', 4)
Done. -- 913 msec.
-- Line 46: rf1 ('/1s1/tpch100src', 5)
Done. -- 2429 msec.
-- Line 47: rf2 ('/1s1/tpch100src', 5)
Done. -- 1970 msec.
-- Line 49: rf1 ('/1s1/tpch100src', 6)
Done. -- 2467 msec.
-- Line 50: rf2 ('/1s1/tpch100src', 6)
Done. -- 888 msec.
The performance is very competitive. Some improvement remains possible but load and refresh are already strong.
In the next installment we will look at some queries and explain how to interpret query plans and profiles.
The relevant sections of the virtuoso.ini
file are below, with commentary inline. The actual ini file has many more settings but these do not influence the benchmark.
The test file system layout has two SSD file systems, mounted on /1s1
and /1s2
. The database is striped across the two file systems.
[Database]
DatabaseFile = virtuoso.db
TransactionFile = /1s2/dbs/virtuoso.trx
Striping = 1
This sets the log to be on the second SSD, and the database to be striped; the files are declared in the [Striping]
section further below.
[TempDatabase]
DatabaseFile = virtuoso.tdb
TransactionFile = virtuoso.ttr
[Parameters]
ServerPort = 1209
ServerThreads = 100
CheckpointInterval = 0
NumberOfBuffers = 8000000
MaxDirtyBuffers = 1000000
The thread count is set to 100. This is not significant, since the test will only have a few concurrent connections, but this should be at least as high as the number of concurrent user connections expected.
The 100 GB TPC-H working set is about 38 GB for the queries. The full database is about 80 GB. Eight million buffers at 8 KB each means that up to 64 GB of database pages will be resident in memory. This should be set higher than the expected working set if possible, but the database process size should also not exceed 80% of physical memory.
The max dirty buffers limit is set to a small fraction of the total buffers for faster bulk load. The bulk load is limited by writing to secondary storage, so we want the writing to start early, and continue through the bulk load. Otherwise the checkpoint at the end of the bulk load would be oversized, because of high numbers of un-flushed buffers.
The checkpoint interval is set to 0, meaning no automatic checkpoints. There will be one at the end of the bulk load, as required by the rules, but the rules do not require checkpoints for the refresh functions.
ColumnStore = 1
This sets all tables to be created column-wise. No special DDL directives are needed for column store operation.
MaxCheckpointRemap = 2500000
DefaultIsolation = 2
The default isolation is set to READ COMMITTED
. Running large queries with locking on reads would have a very high overhead.
DirsAllowed = /
TransactionAfterImageLimit = 1500000000
This is set to an arbitrarily high number. The measure is the count of bytes to be written to log at commit (1.5 GB, here). If the amount of data to be logged exceeds this, the transaction aborts. The RF1 transaction at 100 GB scale will log about 100 MB.
FDsPerFile = 4
MaxMemPoolSize = 40000000
This is the maximum number of bytes of transient memory to be used for query optimization (40 MB, here). The number is adequate for TPC-H, since the queries only have a few joins each. For RDF workloads, the number should be higher, since there are more joins.
AdjustVectorSize = 0
The workload will run at the default vector size. Index operations can be accelerated by switching to a larger vector size, trading memory for locality. But since this workload is mostly by hash join, there is no benefit in changing this.
ThreadsPerQuery = 24
Each query is divided into up to 24 parallel fragments. 24 is the number of threads on the test system.
AsyncQueueMaxThreads = 48
Queries are run by a pool of 48 worker threads. Each session has one thread of its own. If a query parallelizes, the first fragment runs on the session's thread and the remaining fragments run on a thread from this pool. Thus the core threads are oversubscribed by a factor of slightly over 2 in the throughput run: 6 sessions plus 48 threads makes up to 53 runnable threads at any point in the throughput test.
MaxQueryMem = 30G
This is a cap on query execution memory. If memory would exceed this, optimizations that would increase space consumption are not used. The memory may still transiently exceed this limit.
HashJoinSpace = 30G
This is the maximum memory to be used for hash tables during hash joins. If a hash join causes this amount to be exceeded, it will be run in multiple passes, so as to have a cap on the hash table size. Not all hash joins may be partitioned, and the test must not do multi-pass hash joins, hence a high number here. We will see actual space consumption figures when looking at the queries. This parameter may be increased for analytics performance, especially in multiuser situations.
[Client]
SQL_QUERY_TIMEOUT = 0
SQL_TXN_TIMEOUT = 0
SQL_ROWSET_SIZE = 10
SQL_PREFETCH_BYTES = 120000
120 KB of results is to be sent to clients in a single window. This is enough for the relatively short result sets in this benchmark.
[Striping]
Segment1 = 1024, /1s1/dbs/tpch100cp-1.db = q1, /1s2/dbs/tpch100cp-2.db = q2
The database is set to stripe in two files, each on a different SSD. Each file has its own background I/O thread; this is the meaning of the = q1
and = q2
declaration. All files on each separately-seekable device should share the same q
.
[Flags]
enable_mt_txn = 1
enable_mt_transact = 1
The first setting enables multithreading DML statement execution. The second setting enables multithreading of COMMIT
or ROLLBACK
operations. This is important for the refresh function performance. A column store COMMIT
of a DELETE
will especially benefit from multithreading, since this may involve re-compression.
hash_join_enable = 2
Will use hash joins for SQL and SPARQL (even though SPARQL is not used in this experiment).
dbf_explain_level = 0
Specifies less verbose query plan formatting for logging of query execution.
dbf_log_fsync = 1
Specify that fsync
is to be called after each write to the transaction log. The ACID qualification procedure specifies that the system is to be powered down in mid-run, hence this setting is required by the test.
Tables do not always have to be stored in the order of their primary key. A clustered index is a structure where the index tree leaves contain the whole row of data. TPC-H explicitly forbids materializing multiple copies of a table in different sort orders for speeding up different queries. That said, an implementation may pick one primary order for each table which does not have to be the primary key order.
The present champion in core-for-core speed, Actian VectorWise, organizes the main tables, lineitem
and orders
, in date order. The Microsoft SQL Server implementations use clustered indices on lineitem
and orders
, where the table is stored in non-primary key order; in this case, the major ordering columns are l_shipdate
and o_orderdate
, respectively.
These index schemes create locality on the date dimension.
In the present discussion, we take a different tack: We keep lineitem
and orders
in primary key order, and sacrifice locality on date in favor of a fast merge join between these two tables, and for faster load and data maintenance.
TPC-H rules allow indices on foreign keys and dates. In this implementation, we only define one on o_custkey
; lineitem
is only indexed on its primary key. These are the two largest tables, and the only ones that change during the benchmark. Whether indices are defined on other tables makes little or no difference, since joins between these tend to perform better by hash than by index in any case.
All the runs are done at 100 GB scale. The test machine is a dual E5-2630 (2x6 cores, 2x12 threads, 2.3 GHz) with 192 GB of 1066 MHz RAM. Two Crucial 512 GB SSDs are used both for database and staging of the files. The disks are independent file systems with no RAID. (Note that RAID would be required for an official result.) The operating system is CentOS 6.2.
For cluster results, two machines with the above spec are used with a QDR InfiniBand interconnect.
The Virtuoso used is a current internal development version, to be available at the end of this series or by special request. This is neither the publicly available open nor closed source version.
By now, TPC-H is an old game and it is safe to say that pretty much any player in the analytics database domain has had a go at it, even though some have never published a result. So, the bar for new entrants is very high.
Especially, VectorWise and EXASolution have taken performance in this workload close to the limits of the achievable. A challenger has to do everything right in order to win. One wrong move will lose the whole race.
This presentation has many objectives:
To illustrate how Virtuoso is an excellent SQL analytics engine
To provide an in-depth discussion on the science of query optimization and execution
To outline avenues of future development, specifically as concerns analytics with schema-less data
In the TPC TC workshop at VLDB 2013 there was a paper TPC-H Analyzed: Hidden Messages and Lessons Learned from an Influential Benchmark, by Peter Boncz, Thomas Neumann, and myself concerning what the database world has learned from this very tough exercise. Peter Boncz is the original architect of Actian VectorWise, the current champion in TPC-H performance per core. Thomas Neumann is the author of HyPer, most likely the best entry in DBMS research for simultaneously supporting analytics and OLTP. Peter and Thomas are among the most renowned in database science. I am the Program Manager of the Virtuoso column store, overseeing core engineering tasks such as SQL query optimization, execution, storage, and scale out.
In this series I will go over the Virtuoso implementation of TPC-H and will elaborate further on the points discussed in the paper. The subject is broader than any single paper can cover in detail, although there are plenty of papers only addressing one or two of the 22 queries.
Virtuoso is mostly known for RDF. Here we will cover the whole benchmark in SQL first, with both single-server and cluster implementations, and discussion of where these differ. A state-of-the-art SQL implementation is the necessary basis for discussing how the same can be accomplished in RDF. Comparing good RDF to bad SQL is not interesting.
The earlier articles on the Star Schema Benchmark (SSB) (PDF) -- Annuit Coeptis, or, Star Schema and The Cost of Freedom and E Pluribus Unum, or, Star Schema Meets Cluster -- demonstrated how the most basic analytical database operations perform in Virtuoso. All the techniques used there are also directly applicable to TPC-H, but the latter adds a good 20 more tricks one needs to see through.
Future installments will discuss TPC-H query by query. We conclude with a full run of OSDL-DBT-3. DBT-3™ is an unofficial TPC-H without auditing but with the same workload.
One could claim that a 30GB scale is trivial. Here, we multiply the scale by 10 and move from one to two servers, going from shared-memory multicore to distributed-memory scale-out with partitioned data, i.e. each machine holds a distinct fragment of the database.
Again, we run the same workload in SQL and in SPARQL. The RDB schema is the same as before, so for SQL tables only primary keys are declared, there are no indices or special declarations about data placement, and the table partitioning is on the first part of the primary key. The RDF data for the SPARQL runs is quads with the default index scheme, partitioned on subject or object, whichever is first in key order. The test system is two machines, each with dual Xeon E5-2630 and 192GB RAM. The queries are from warm cache.
There are 1.79 billion line-order rows, giving a total of 32.5 billion RDF triples.
Query | Virtuoso SQL | Virtuoso SPARQL | SPARQL-to-SQL ratio |
---|---|---|---|
Q1 | 2.285 | 7.767 | 3.4x |
Q2 | 1.530 | 3.535 | 2.3x |
Q3 | 1.237 | 1.457 | 1.2x |
Q4 | 3.459 | 6.978 | 2.0x |
Q5 | 3.065 | 8.710 | 2.8x |
Q6 | 2.901 | 8.454 | 2.9x |
Q7 | 5.733 | 15.939 | 2.8x |
Q8 | 2.267 | 6.759 | 3.0x |
Q9 | 1.773 | 4.217 | 2.4x |
Q10 | 1.440 | 4.342 | 3.0x |
Q11 | 5.031 | 12.608 | 2.5x |
Q12 | 4.464 | 15.497 | 3.5x |
Q13 | 2.807 | 4.467 | 1.6x |
Total | 37.992 | 100.730 | 2.7x |
The SQL run takes 38s, and the SPARQL run takes 101s. Comparing to the single server, we see a little better than linear scaling, i.e., double the gear is a little over 2x faster. In the present case, 10x the data takes a little under 5x the time.
This is due to slightly better load balancing. The single server splits the workload into chunks based on a run-time approximation; the cluster stores the partitions separately and uses this to determine the parallelism. In the latter case, the chunks are of more equal size.
The SPARQL penalty is here 2.7x, essentially the same as in the single server case.
It is no secret that a star schema is "embarrassingly parallel." In other words, when there is one big table (fact table) that references many smaller tables (dimension tables), and query conditions are expressed on properties of the dimension tables, the correct query plan nearly always consists of putting the interesting foreign key values into hash tables, and then scanning the fact table from beginning to end and picking the rows where the values in the foreign key columns are found in the hash tables. This is called a selective hash join or invisible hash join if the hash join operation is merged in the table scan itself. Daniel Abadi’s well known thesis explains this matter. In the case of a cluster, supposing all tables are partitioned, identical hash tables are made on all participating servers, and after this each server gets to scan its fraction of the fact table independently of any other. This nearly always works because the dimension tables are typically orders of magnitude smaller than the fact table.
The complexity of the queries is close to linear to the data size. The factor that makes this deviate from linear is the fact that as hash tables get larger, they will miss the CPU cache more frequently; hence they get slower to probe. We here assume that a bigger fact table means bigger dimension tables. This is often the case, i.e., the more sales records there are, the larger the number of distinct customers or distinct items in the catalogue is likely to be. This is not always so, though, as the number of days in the history does not scale in the same way.
Any decent analytics oriented RDBMS with scale out will give near-linear performance with a star schema, at least up to the point where the hash tables can no longer be replicated on all servers, or the hash join must do multiple passes over the fact table.
Doing the same with schema-less data is harder, even though the principle is exactly the same. The difficulty lies again in detecting that a scan filtered by selective hash joins will give the most locality in access pattern, visiting each cache-line-worth of data once at most and doing so almost always in sequential order. This triggers memory prefetching on any modern CPU and significantly reduces memory latency.
As the scale grows there are some details of query plan that become significant. For example, in Q7 of SSB, there is a report on sales between Asian customers and Asian suppliers, country by country. In SQL:
SELECT d_year, c_nation, s_nation, SUM (...), ...
FROM lineorder, customer, supplier, dwdate
WHERE lo_custkey = c_custkey
AND lo_suppkey = s_suppkey
AND c_region = 'ASIA'
AND s_region = 'ASIA' ...
In this case, one builds a hash table from c_custkey
to c_nation
and from s_suppkey
to s_nation
, including only suppliers and customers where the region = 'ASIA'
.
In SPARQL, the situation is the same, except that we say:
SELECT *
WHERE
{
li rdfh:lo_custkey ?lo_custkey
; rdfh:lo_suppkey ?lo_suppkey .
?lo_custkey rdfh:c_nation ?c_nation
; rdfh:c_region "ASIA" .
?lo_suppkey rdfh:s_nation ?s_nation
; rdfh:s_region "ASIA" .
}
This is an equivalent expression, or near enough. We note that now it makes sense to build the hash table from the join of two patterns, i.e., the rdfh:s_nation
and rdfh:s_nregion
, which together are the same as the supplier
table in the SQL variant. But now, a supplier might have more than one rdfh:s_nation
triple; hence the hash table is no longer known to be a priori unique -- i.e., its key is the URI of a supplier, but it is only known that suppliers usually have one nation; it is not known that they never have two.
However, if we build the hash table from only one pattern, i.e., { ?s_suppkey rdfh:s_region "ASIA" }
, and the graph is specified, then we know that the subject will be unique, as stating that X has region "ASIA" twice has no effect beyond stating it once. However, if we do this, then there needs to be another hash table built for mapping the supplier to its s_nation
. There is no way to know that this is unique. Making two hash tables instead of one has been seen to slow down the query by a factor of two at scale 300GB whereas the effect is hardly noticeable at 30GB.
Thus recognizing these special cases with SPARQL is crucial if one is to come anywhere near the performance the SQL world attains just by following the schema. What used to be basic becomes trickier. It is true that the same or similar tricks are also needed in pure SQL workloads, but then not within the SSBM queries. What is a star in SQL is quite often a snowflake in SPARQL. A star schema has a single table-per-dimension, e.g., customer
, whereas a snowflake has more structure, e.g., customer
, the customer’s country
and the country’s region
, all in different tables.
The message to the SPARQL public is that now, for the first time, if there is natural parallelism and locality in the data, you know that the database will exploit this correctly and derive the same benefits from this as any SQL-only system would. Of course, there have been and are parallel RDF databases, including previous versions of Virtuoso, which do distributed index-based operations with various levels of concurrency and distributed coordination, but experience shows that these query strategies are not very good for queries that touch a large fraction (over 3%) of the database.
For basic data warehouse workloads, whether in SQL or SPARQL, Virtuoso offers linear scaling where clusters pay off from the start. While a handful of relational column stores have offered such capabilities for SQL for some time, now the same is also available for schema-less data: Entirely declarative querying; no explicit data partitioning; no schema restrictions; no map-reduce programs or the like.
The cluster evolution path is clear: as with single server, we strive for more speed on a broader range of operations. The star and snowflake functionality discussed here is the core piece for any analytics, so if this is not right, the rest is also compromised. The other side of cluster is operational, i.e., flexibility of deployment -- for example, flexibly resizing cloud-based databases.
This is the qualitative jump. Incremental performance gains will follow for both SQL and SPARQL.
Stars and snowflakes are common but the world does not end with these. As we will see in subsequent articles, there is a whole world of graphs, e.g., social media analytics, as well as more complex relational schemas.
]]>In order to meaningfully answer this question, one has to have a top notch analytics engine. Comparing "schema" and "no schema" on anything except the state-of-the-art in analytics database is not interesting. So we made Virtuoso 7 and its Column Store Module, and implemented the features common in dedicated analytics databases, plus suitable SPARQL adaptations of same.
The present beach head is the Star Schema Benchmark (SSB) (PDF), which represents the core of most data warehouse workloads (i.e., big scans, selective hash joins, and aggregation). These same patterns are also found at the core of TPC-H and the new TPC-DS.
At present, the cost of having no schema is a 2.7x increase in run time, as evidenced by SSB. Virtuoso SQL does the run in 8.4s, MonetDB SQL in 17s, Virtuoso SPARQL in 22.5s. Virtuoso outperforms column store pioneer MonetDB by a fair margin. MonetDB is probably the fastest open source column store, although there exists faster ones in closed source.
SPARQL in Virtuoso comes close behind SQL in MonetDB, only a factor of 1.3. MySQL with InnoDB, which is not an analytical database, does the run in 2391s.
Initially, we were aiming at a slowdown factor of 2 when comparing SPARQL to SQL on Virtuoso. Of course, this is only worth something if the SQL performance is generally on a level with relational column stores. These goals have now been substantially attained.
This is a major beachhead for RDF as a branch of graph database technology, and for Virtuoso as a product. For the first time one can run a real database workload based on SPARQL at a speed that is comparable with SQL, and without any compromises on the schema flexibility that is the principal reason to use an RDF-based graph model in the first place. No SPARQL-to-SQL mapping, property tables, or such.
The technical accomplishment is divided in two parts. For query execution, there is a compressed column store with vectored execution and a good implementation of query parallelization, hash join, and aggregation. With a schema, there is a multicolumn table. and with RDF, there is a quads table. Both are compressed based on the characteristics of actual data. The SQL execution consists of a scan of the fact table, taking a few columns of this and applying one or more hash joins to column values, where the hash joins are usually selective. This is the same in SPARQL, except that instead of getting another column of the table there is a self-join. The self-join has a fairly dense access pattern and is in the same order as the previous one, thus it is not very expensive. This constitutes the difference.
While in SQL, the plan is obvious, especially since there are no indices on the fact table; however, getting the right plan with SPARQL is quite difficult. There are up to 12 triple-patterns in a query, leading to 12! (twelve-factorial, i.e., 12*11*...*2*1 = 479,001,600) possible join orders, multiplied by index and hash based variants of each join, where the hash-based variants further multiply the space by considering different combinations of patterns on the build side. Experiments confirm that the same plan is best for both SQL and SPARQL but getting this as the outcome of query optimization is far from self-evident. This requires a very precise cost model that correctly takes ordering of intermediate results and density of hits in index lookups into account, as well as the variability in hash join performance when the hash table size varies, i.e., CPU cache effects. All this modeling is of course also valid for SQL, but is not really required there because a much coarser model will also deliver the right plan, as there is much less choice. Further, making the right plan must be fast. Now, the longest query optimization time is 240ms for an execution time of 3.7s (Q12 at 30G in SPARQL). This is pretty good.
Using hash join in SPARQL has anyway been problematical because knowing when to use one requires high confidence in the cost model. If one ends up building a large hash table that is used only a few times, there is a steep penalty. Index-based plans, especially since RDF data tends to be indexed in both S-to-O and O-to-S directions, do not have bad worst-cases, but then a good hash-based plan is easily 5x faster than an index-based one.
So, for SPARQL the results are game changing. Finally up-to-date database. For SQL, Virtuoso 7 performs like an analytics column store is supposed to, but then that is what it is. For the SQL space, it is interesting that this is also open source, so Virtuoso may well be the best performing open source SQL analytics engine out there. We will later see how the comparison with other SQL column stores goes.
In summary, the central core of the Virtuoso 7 agenda has been accomplished. Incremental progress will continue around addressing more complex benchmarks like TPC-H and TPC-DS, both in SQL and in SPARQL translation, plus full-scale Open Streetmap in both SQL and SPARQL, and of course the benchmarks being developed in LDBC.
The scale of the dataset is 30GB, with 187M line-order rows. This comes to 3.2Gt (3.2 billion triples). The runs are from warm cache on all systems. The test system for all runs is a dual Xeon E5-2630 with 192GB RAM.
The scripts for duplicating the experiment on the Virtuoso 7 Open Source cut will be published later on this blog, when the open source cut incorporates the query optimization improvements discussed herein.
In the table below, the times are elapsed real times in milliseconds, with one query at a time, all from warm cache (i.e. the second run of the query set is reported and we make sure that all databases are running from memory). The MySQL is configured with InnoDB and 40GB of buffer pool, which should be enough for the 30GB dataset. The SQL versions do not declare any explicit indices, but do declare primary keys. The SPARQL version is with the default Virtuoso index scheme, but the query plans end up scanning each predicate in order of S, except for the build phases of hash-joins where an index from O-to-S is occasionally used.
Query | Virtuoso SQL | Virtuoso SPARQL | MonetDB SQL | MySQL SQL |
---|---|---|---|---|
Q1 | 413 | 1101 | 1659 | 82477 |
Q2 | 282 | 416 | 500 | 74436 |
Q3 | 253 | 295 | 494 | 75411 |
Q4 | 828 | 2484 | 958 | 226604 |
Q5 | 837 | 1915 | 648 | 222782 |
Q6 | 419 | 1813 | 541 | 219656 |
Q7 | 1062 | 2330 | 5658 | 237730 |
Q8 | 617 | 2182 | 521 | 194918 |
Q9 | 547 | 1290 | 381 | 186112 |
Q10 | 499 | 639 | 370 | 186123 |
Q11 | 1132 | 2142 | 2760 | 241045 |
Q12 | 863 | 3770 | 2127 | 241439 |
Q13 | 653 | 1612 | 1005 | 202817 |
One could argue that comparing against MySQL is unjustified, as MySQL is certainly not optimized for this workload, e.g., it does not do hash joins and does not parallelize queries. On the other hand, the previous work on SSBM for SPARQL, No Size Fits All by Benedikt Kaempgen and Andreas Harth, published at the 2013 ESWC, did make the comparison between MySQL and Virtuoso 6; thus we think it informative to include MySQL. To summarize No Size Fits All, SPARQL in Virtuoso 6 lost by a factor of 12 against MySQL, but in the present case, Virtuoso 7 SPARQL wins by a factor of 106 against MySQL.
The ESWC paper used a scale of 1G, while the present test uses a scale of 30G. One should remember that the MySQL times are single-threaded, and all other times are multi-threaded. The test system has 12 cores and 24 threads, so running at full platform utilization is at best 16x faster than single threaded. Running SPARQL single-threaded instead of 24 threads-per-query gives a total time of 175s, still over 10x better than MySQL. Compared to the multi-threaded time of 22.5s, the parallelism yields an average acceleration of 7.7x. Running with SPARQL with full threading but no hash-join gives a time of 80s, 3.5x worse than with hash-join. We note that SSB is a very hash-join intensive workload.
MonetDB is the more relevant comparison, as it does use the full CPU, with load peaks up to the theoretical 2400% (12 dual-threaded cores) and is the platform on which a lot of the science of the hash join was refined. MonetDB does relatively best with queries that can start by a very selective join (e.g., Q9) where it outperforms Virtuoso. This is probably due to more even splitting of the work among threads and to not having to deal with data compression. Virtuoso wins the most on queries that select a large fraction of the fact table, where MonetDB is penalized due to its policy of full materialization of intermediate results. Joins without any selection (e.g., adding up the lo_extendedprice
, and grouping by the d_year
of lo_orderdate
) show MonetDB at its worst. Such queries do not occur in SSB though, so SSB is a relatively MonetDB-friendly benchmark.
The back of the problem is broken. Big query without schema can be done. One could of course see this coming by experimenting with explicit plans. But nobody out there can manually optimize a query plan. Thus the final step consisted of having a good-enough cost model and a smart-enough search order to get the right plan fast. This will be in the next update of Virtuoso Open Source. At that time, we will publish the full queries and configuration files.
This is the breakthrough for RDF analytics. Incremental progress will follow, with more tricks being incorporated, like the ones known to be needed by TPC-H.
]]>Marko opened the panel by looking at the Google Trends search statistics for big data, semantics, business intelligence, data mining, and other such terms. Big data keeps climbing its hype-cycle hill, now above semantics and most of the other terms. But what do these in fact mean? In the leading books about big data, the word semantics does not occur.
I will first recap my 5 minute intro, and then summarize some questions and answers. This is from memory and is in no sense a full transcript.
Over the years we have maintained that what the RDF community most needs is good database. Indeed, RDF is relational in essence and, while it requires some new datatypes and other adaptations, there is nothing in it that is fundamentally foreign to RDBMS technology.
This spring, we came through on the promise, delivering Virtuoso 7, packed full of all the state-of-the-art tricks in analytics-oriented databasing, column-wise compressed storage, vectored execution, great parallelism, and flexible scale-out.
At this same ESWC, Benedikt Kaempgen and Andreas Harth presented a paper (No Size Fits All -- Running the Star Schema Benchmark with SPARQL and RDF Aggregate Views) comparing Virtuoso and MySQL on the star schema benchmark at 1G scale. We redid their experiments with Virtuoso 7 at 30x and at 300x the scale.
At present, when running the star schema benchmark in SQL, we outperform column-store pioneer MonetDB by a factor of 2. When running the same star schema benchmark in SPARQL against triples as opposed to tables, we see a slowdown of 5x. When scaling from 30 to 300G and from one to two machines, we get linear increase in throughput, 5x longer for 10x more data.
Coming back to MySQL, the run with 1G takes about 60 seconds. Virtuoso SPARQL does the same on 30x the data in 45 seconds. Well, you could say that we should go pick on somebody in our series and not MySQL, being not relevant for this. Comparing with MonetDB and other analytics column stores is of course more relevant.
For cluster scaling, one could say that star schema benchmark is easy, and so it is, but even with harder ones, which do joins across partitions all the time, like the BSBM BI workload, we get scaling that is close to linear.
So, for analytics, you can use SPARQL in Virtuoso, and run circles around some common SQL databases.
The difference between SQL and SPARQL comes from having no schema. Instead of scanning aligned columns in a table, you do an index lookup for each column. This is not too slow if there is locality, as there is, but still a lot more than when talking about a multicolumn column-compressed table. With more execution tricks, we can maybe cut this to 3x.
The beach-head of workable RDF-based analytics on schema-less data has been attained. Medium-scale data, to the single-digit terabytes, is OK on small clusters.
First, Big Data means more than querying. Before meaningful analytics can be done, the data must generally be prepared and massaged. This means fast bulk load and fast database-resident transformation. We have that via flexible, expressive, parallelizable stored procedures and run time hosting. One can do everything one does in MapReduce right inside the database.
Some analytics cannot be expressed in a query language. For example, graph algorithms like clustering generate large intermediate states and run in many passes. For this, bulk synchronous processing frameworks like Giraph are becoming popular. We can again do this right inside the DBMS, on RDF or SQL tables. There is great platform utilization and more flexibility than in strict BSP, while being able to do any BSP algorithm.
The history of technology is one of divergence followed by reintegration. New trends, like Column stores, RDF databases, key value stores, or MapReduce, start as one-off special-purpose products, and the technologies then find their way back into platforms addressing a broader functionality.
The whole semantic experiment might be seen as a break-away from the web, if also a little from database, for the specific purpose of exploring schemaless-ness, universal referenceability of data, self-describing data, and some inference.
With RDF, we see lasting value in globally consistent identifiers. The URI "superkey" is the ultimate silo-breaker. The future is in integrating more and more varied data and a schema-first approach is cost-prohibitive. If data is to be preserved over extended lengths of time, self-description is essential; the applications and people that produced the data might not be around. Same for publishing data for outside reuse.
In fact, many of these things are right now being pursued in mainstream IT. Everybody is reinventing the triple, whether by using non-first normal form key-value pairs in an RDB, tagging each row of a table with the name of the table, using XML documents, etc. The RDF model provides all these desirable features, but most applications that need these things do not run on RDF infrastructure.
Anyway, by revolutionizing RDF store performance, we make this technology a cost-effective alternative in places where it was not such before.
To get much further in performance, physical storage needs to adapt to the data. Thus, in the long term, we see RDF as a lingua franca of data interchange and publishing, supported by highly scalable and adaptive databases that exploit the structure implicit in the data to deliver performance equal to the best in SQL data warehousing. When we get the schema from the data, we have schema-last flexibility and schema-first performance. The genie is back in the bottle, and data models are unified.
David Karger: No, the shallow web (i.e., static web pages for purposes of search) is not big data. One can put it in a box and search. But for purposes of more complex processing, like analytics on the structure of the whole web, this is still big data.
Orri Erling: I am not sure about that, because when you have a stream -- whether this is network management and denial of service detection, or managing traffic in a city -- you know ahead of time what peak volume you are looking at, so you can size the system accordingly. And streams have a schema. So you can play all the database tricks. Vectored execution will work there just as it does for query processing, for example.
Orri Erling: Here we mean sliding windows and constant queries. The triple vs. row issue also seems the same. There will be some overhead from schema-lastness, but for streams, I would say each has a regular structure.
John Davies: For example, we gather gigabytes a minute of traffic data from sensors in the road network and all this data is very regular, with a fixed schema.
Manfred Hauswirth: Or is this always so? The internet of things has potentially huge diversity in schema, with everything producing a stream. The user of the stream has no control whatever on the schema.
Marko Grobelnik: Yes, we have had streams for a long time -- on Wall Street, for example, where these make a lot of money. But high frequency trading is a very specific application. There is a stream, some analytics, not very complicated, just fast. This is one specific solution, with fixed schema and very specific scope, no explicit semantics.
David Karger: Computer science has always been about big data; it is just the definition of big that changes. Big data is something one cannot conveniently process on a computer system. Not without unusual tricks, where something trivial, like shortest path, becomes difficult just because of volume. So it is that big data is very much about performance, and performance is usually obtained by sacrificing the general for the specific. The semantic world on the other hand is after something very general and about complex and expressive schema. When data gets big, the schema is vanishingly small in comparison with the data, and the schema work gets done by hand; the schema is not the problem there. Big data is not very internetty either, because the 40 TB produced by the telescope are centrally stored and you do not download them or otherwise transport them very much.
Manfred Hauswirth: The essential aspect is that data is machine interpretable, with sufficient machine readable context.
David Karger: Semantics has to do with complexity or heterogeneity in the schema. Big data has to do with large volume. Maybe semantic big data would be all the databases in the world with a million different schemas. But today we do not see such applications. If the volume is high, the schema is usually not very large.
Manfred Hauswirth: This is not so far as that, for example a telco has over a hundred distinct network management systems and each has a different schema.
Orri Erling: From the data angle, we have come to associate semantic with
In conclusion, the event was rather peaceful, with a good deal of agreement between the panelists and audience and no heated controversy. I hoped to get some reaction when I said that semantics was schema flexibility, but apparently this has become a politically acceptable stance. In the golden days of AI this would not have been so. But then Marko Grobelnik did point out that the whole landscape has become data driven. Even in fields like natural language, one looks more at statistics than deep structure: For example, if a phrase is often found on Google, it is proper usage.
]]>With Virtuoso 7, we dramatically improve the efficiency of all this. With databases in the billions of relations (also known as triples, or 3-tuples), we can fit about 3x as many relations in the same space (disk and RAM) as with Virtuoso 6. Single-threaded query speed is up to 3x better, plus there is intra-query parallelization even in single-server configurations. Graph data workloads are all about random lookups. With these, having data in RAM is all-important. With 3x space efficiency, you can run with 3x more data in the same space before starting to go to disk. In some benchmarks, this can make a 20x gain.
Also the Virtuoso scale-out support is fundamentally reworked, with much more parallelism and better deployment flexibility.
So, for graph data, Virtuoso 7 is a major step in the coming of age of the technology. Data keeps growing and time is getting scarcer, so we need more flexibility and more performance at the same time.
So, let’s talk about how we accomplish this. Column stores have been the trend in relational data warehousing for over a decade. With column stores comes vectored execution, i.e., running any operation on a large number of values at one time. Instead of running one operation on one value, then the next operation on the result, and so forth, you run the first operation on thousands or hundreds-of-thousands of values, then the next one on the results of this, and so on.
Column-wise storage brings space efficiency, since values in one column of a table tend to be alike -- whether repeating, sorted, within a specific range, or picked from a particular set of possible values. With graph data, where there are no columns as such, the situation is exactly the same -- just substitute the word predicate for column. Space efficiency brings speed -- first by keeping more of the data in memory; secondly by having less data travel between CPU and memory. Vectoring makes sure that data that are closely located get accessed in close temporal proximity, hence improving cache utilization. When there is no locality, there are a lot of operations pending at the same time, as things always get done on a set of values instead of on a single value. This is the crux of the science of columns and vectoring.
Of the prior work in column stores, Virtuoso may most resemble Vertica, well described in Daniel Abadi’s famous PhD thesis. Virtuoso itself is described in IEEE Data Engineering Bulletin, March 2012 (PDF). The first experiments in column store technology with Virtuoso were in 2009, published at the SemData workshop at VLDB 2010 in Singapore. We tried storing TPC H as graph data and in relational tables, each with both rows and columns, and found that we could get 6 bytes per quad space utilization with the RDF-ization of TPC H, as opposed to 27 bytes with the row-wise compressed RDF storage model. The row-wise compression itself is 3x more compact than a row-wise representation with no compression.
Memory is the key to speed, and space efficiency is the key to memory. Performance comes from two factors: locality and parallelism. Both are addressed by column store technology. This made me a convert.
At this time, we also started the EU FP7 project, LOD2, most specifically working with Peter Boncz of CWI, the king of the column store, famous for MonetDB and VectorWise. This cooperation goes on within LOD2 and has extended to LDBC, an FP7 for designing benchmarks for graph and RDF databases. Peter has given us a world of valuable insight and experience in all aspects of avant garde database, from adaptive techniques to query optimization and beyond. One thing that was recently published is the results for Virtuoso cluster at CWI, running analytics on 150 billion relations on CWI’s SciLens cluster.
The SQL relational table-oriented databases and property graph-oriented databases (Graph for short) are both rooted in relational database science. Graph management simply introduces extra challenges with regards to scalability. Hence, at OpenLink Software, having a good grounding in the best practices of relational columnar (or column-wise) database management technology is vital.
Virtuoso is more prominently known for high-performance RDF-based graph database technology, but the entirety of its SQL relational data management functionality (which is the foundation for graph store) is vectored, and even allows users to choose between row-wise and column-wise physical layouts, index by index.
It has been asked: is this a new NoSQL engine? Well, there isn’t really such a thing. There are of course database engines that do not have SQL support and it has become trendy to call them "NoSQL." So, in this space, Virtuoso is an engine that does support SQL, plus SPARQL, and is designed to do big joins and aggregation (i.e., analytics) and fast bulk load, as well as ACID transactions on small updates, all with column store space efficiency. It is not only for big scans, as people tend to think about column stores, since it can also be used in compact embedded form.
Virtuoso also delivers great parallelism and throughput in a scale-out setting, with no restrictions on transactions and no limits on joining. The base is in relational database science, but all the adaptations that RDF and graph workloads need are built-in, with core level support for run-time data-typing, URIs as native Reference types, user-defined custom data types, etc.
Now that the major milestone of releasing Virtuoso 7 (open source and commercial editions) has been reached, the next steps include enabling our current and future customers to attain increased agility from big (linked) open data exploits. Technically, it will also include continued participation in DBMS industry benchmarks, such as those from the TPC, and others under development via the Linked Data Benchmark Council (LDBC), plus other social-media-oriented challenges that arise in this exciting data access, integration, and management innovation continuum. Thus, continue to expect new optimization tricks to be introduced at frequent intervals through the open source development branch at GitHub, between major commercial releases.
Related
]]>In recent days, cyberspace has seen some discussion concerning the relationship of the EU FP7 project LDBC (Linked Data Benchmark Council) and sociotechnical considerations. It has been suggested that LDBC, to its own and the community’s detriment, ignores sociotechnical aspects.
LDBC, as research projects go, actually has an unusually large, and as of this early date, successful and thriving sociotechnical aspect, i.e., involvement of users and vendors alike. I will here discuss why, insofar as the technical output of the project goes, sociotechnical metrics are in fact out of scope. Then yet again, to what degree the benefits potentially obtained from the use of LDBC outcomes are in fact realized does have a strong dependence on community building, a social process.
One criticism of big data projects we sometimes encounter is the point that data without context is not useful. Further, one cannot just assume that one can throw several data sets together and get meaning from this, as there may be different semantics for similar looking things, just think of 7 different definitions of blood pressure.
In its initial user community meeting, LDBC was, according to its charter, focusing mostly on cases where the data is already in existence and of sufficient quality for the application at hand.
Michael Brodie, Chief Scientist at Verizon, is a well known advocate of focusing on meaning of data, not only on processing performance. There is a piece on this matter by him, Peter Boncz, Chris Bizer, and myself on the Sigmod Record: "The Meaningful Use of Big Data: Four Perspectives – Four Challenges".
I had a conversation with Michael at a DERI meeting a couple of years ago about measuring the total cost of technology adoption, thus including socio-technical aspects such as acceptance by users, learning curves of various stakeholders, whether in fact one could demonstrate an overall gain in productivity arising from semantic technologies. [in my words, paraphrased]
"Can one measure the effectiveness of different approaches to data integration?" asked I.
"Of course one can," answered Michael, "this only involves carrying out the same task with two different technologies, two different teams and then doing a double blind test with users. However, this never happens. Nobody does this because doing the task even once in a large organization is enormously costly and nobody will even seriously consider doubling the expense."
LDBC does in fact intend to address technical aspects of data integration, i.e., schema conversion, entity resolution, and the like. Addressing the sociotechnical aspects of this (whether one should integrate in the first place, whether the integration result adds value, whether it violates privacy or security concerns, whether users will understand the result, what the learning curves are, etc.) is simply too diverse and so totally domain dependent that a general purpose metric cannot be developed, at least not in the time and budget constraints of the project. Further, adding a large human element in the experimental setting (e.g., how skilled the developers are, how well the stakeholders can explain their needs, how often these needs change, etc.) will lead to experiments that are so expensive to carry out and whose results will have so many unquantifiable factors that these will constitute an insuperable barrier to adoption.
Experience demonstrates that even agreeing on the relative importance of quantifiable metrics of database performance is hard enough. Overreaching would compromise the project's ability to deliver its core value. Let us next talk about this.
It is only a natural part of the political landscape that the EC's research funding choices are criticized by some members of the public. Some criticism is about the emphasis on big data. Big data is a fact on the ground, and research and industry need to deal with it. Of course, there have been and will be critics of technology in general on moral or philosophical grounds. Instead of opening this topic, I will refer you to an article by Michael Brodie. In a world where big data is a given, lowering the entry threshold for big data applications, thus making them available not only to government agencies and the largest businesses, seems ethical to me, as per Brodie's checklist. LDBC will contribute to this by driving greater availability, better performance, and lower cost for these technologies.
Once we accept that big data is there and is important, we arrive at the issue of deriving actionable meaning from it. A prerequisite of deriving actionable meaning from big data is the ability to flexibly process this data. LDBC is about creating metrics for this. The prerequisites for flexibly working with data are fairly independent of the specific use case, while the criteria of meaning, let alone actionable analysis, are very domain specific. Therefore, in order to provide the greatest service to the broadest constituency, LDBC focuses on measuring that which is most generic, yet will underlie any decision support or other data processing deployment that involves RDF or graph data.
I would say that LDBC is an exceptionally effective use of taxpayer money. LDBC will produce metrics that will drive technological innovation for years to come. The total money spent towards pursuing goals set forth by LDBC is likely to vastly exceed the budget of LDBC. Only think of the person-centuries or even millennia that have gone into optimizing for TPC-C and TPC-H. The vast majority of the money spent for these pursuits is paid by industry, not by research funding. It is spent worldwide, not in Europe alone.
Thus, if LDBC is successful, a limited amount of EC research money will influence how much greater product development budgets are spent in the future. This multiplier effect applies of course to highly successful research outcomes in general but is especially clear with LDBC.
European research funding has played a significant role in creating the foundations of the RDF/Linked Data scene. LDBC is a continuation of this policy, however the focus has now shifted to reflect the greater maturity of the technology. LDBC is now about making the RDF and graph database sectors into mature industries whose products can predictably tackle the challenges out there.
]]>The Linked Data Benchmark Council (LDBC) project is officially starting now.
This represents a serious effort towards making relevant and well thought out metrics for RDF and graph databases and defining protocols for measurement and publishing of well documented and reproducible results. This also entails the creation of a TPC-analog for the graph and RDF domains.
The project brings together leading vendors, with OpenLink and Ontotext representing the RDF side and Neo Technology and Sparsity Technologies representing the graph database side. Peter Boncz of MonetDB and Vectorwise fame is the technical director, with participation from the Technical University of Munich with Thomas Neumann, known for RDF3X and HyPer. La Universitat Politècnica de Catalunya coordinates the project and brings strong academic expertise in graph databasing, also representing their Sparsity Technologies spinoff. FORTH (Foundation for Research and Technology - Hellas) of Crete contributes expertise in data integration and provenance. STI Innsbruck participates in community building and outreach.
The consortium has second-to-none understanding of benchmarking and has sufficient time allotted to the task for producing world class work, comparable to the TPC benchmarks. This has to date never been realized in the RDF or graph space.
History demonstrates that whenever something that is sufficiently important starts getting systematically measured, there is an improvement in the metric. The early days of the TPC saw a 40-fold increase in transaction processing speed. TPC-H continues to be, after 18 years, well used as a basis of quantifying advances in analytics databases.
A serious initiative for well-thought-out benchmarks for guiding the emerging RDF and graph database markets is nothing short of a necessary precondition for the emergence of a serious market with several vendors offering mutually comparable products.
Benchmarks are only as good as their credibility and adoption. For this reason, LDBC has been in touch with all graph and RDF vendors we could find, and has received a positive statement of intent from most, indicating that they would participate in a LDBC organization and contribute to shaping benchmarks.
There is further a Technical User Community, with its initial meeting this week, where present-day end users of RDF and graph databases will voice their wishes for benchmark development. Thus benchmarks will be grounded in use cases contributed by real users.
With these elements in place we have every reason to expect relevant benchmarks with broad adoption, with all the benefits this entails.
]]>First we wish to thank the many end user organizations that were present. This clearly validates the project's mission and demonstrates that there is acute awareness of the need for better metrics in the field. In the following, I will summarize the requirements that were brought forth.
Scale out - There was near unanimity among users that even if present workloads could be handled on single servers, a scale-out growth path was highly desirable. On the other hand, some applications were scale-out based from the get go. Even when not actually used, a scale-out capability is felt to be an insurance against future need.
Making limits explicit - How far can this technology go? Benchmarks need to demonstrate at what scales the products being considered work best, and where they will grind to a halt. Also, the impact of scale-out on performance needs to be made clear. The cost of solutions at different scales must be made explicit.
Many of these requirements will be met by simply following TPC practices. Now, vendors cannot be expected to publish numbers for cases where their products fail, but they do have incentives for publishing numbers on large data, and at least giving a price/performance point that exceeds most user needs.
Fault tolerance and operational characteristics - Present day benchmarks (e.g., the TPC ones) hardly address operational aspects that most enterprise deployments will encounter. This was already stated by Michael Stonebraker at the first TPC performance evaluation workshop some years back at VLDB in Lyon. Users want to know the price/performance impact of making fault-tolerant systems and wish to have metrics for things like backup and bulk load under online conditions. A need to operate across multiple geographies was present in more than one use case, thus requiring a degree of asynchronous replication such as log shipping.
Update-intensive workloads - Unlike one might think, RDF uses are not primarily load-once-plus-lookup. Freshness of data creates value, and databases, even if they are warehouses in character, need to be kept up to date much better than just by periodic reload. Online updates may be small, as for example refreshing news feeds or web crawls, where the unit of update is small but updates are many, but also replacing reference data sets of hundreds of millions of triples. The latter requirement exceeds what is practical in a single transaction. ACID was generally desired, with some interest also in eventual consistency. We did not get use cases with much repeatable read (e.g., updating account balances), but rather atomic and durable replacement of sets of statements.
Inference - Class and property hierarchies were common, followed by use of transitivity. owl:sameAs
was not in much use, being too dangerous, i.e., a single statement may potentially have huge effect and produce unpredictable sets of properties for instances, for which applications are not prepared. Beyond these, the wishes for inference, with use cases ranging from medicine to forensics, were outside of the OWL domain. These typically involved probability scores adding up the joint occurrence of complex criteria with some numeric computation (e.g. time intervals, geography, etc.).
As materialization of forward closure is the prevalent mode of implementing inference in RDF, users wished to have a measure of its cost in space and time, especially under online-update loads.
Text, XML, and Geospatial - There is no online application that does not have text search. In publishing, this is hardly ever provided by an RDF store, even if there is one in the mix. Even so, there is an understandable desire to consolidate systems, i.e., to not have an XML database for content and a separate RDF database for metadata. Also, many applications have a geospatial element. One wish was to combine XPATH/XQuery with SPARQL, and it was implied that query optimization should create good plans under these conditions.
There was extensive discussion especially on benchmarking full-text. Such a benchmark would need to address the quality of relevance ranking. Doing new work in this space is clearly out of scope for LDBC, but an IR benchmark could be reused as an add-on to provide a quality score. The performance score would come from the LDBC side of the benchmark. Now, many of the applications of text (e.g., news) might not even sort on text match score, but rather by time. Also if the text search is applied to metadata like labels or URI strings, the quality of a match is a non-issue, as there is no document context.
Data integration - Almost all applications had some element of data integration. Indeed, if one uses RDF in the first place, the motivation usually has to do with schema flexibility. Having a relational schema for everything is often seen to be too hard to maintain and to lead to too much development time before an initial version of an application or answer of a business question. Data integration is everywhere but stays elusive for benchmarking. Every time it is different and most vendors present do not offer products for this specific need. Many ideas were presented, including using SPARQL for entity resolution, and for checking consistency of an integration result.
A central issue of benchmark design is having an understandable metric. People cannot make sense of more than a few figures. The TPC practice of throughput at scale and price per unit of throughput at scale is a successful example. However, it may be difficult to agree on relative weights of components if a metric is an aggregate of too many things. Also, if a benchmark has too many optional parts, metrics easily become too complicated. On the other hand, requiring too many features (e.g. XML, full text, geospatial) restricts the number of possible participants.
To stimulate innovation, a benchmark needs to be difficult but restricted to a specific domain. TPC-H is a good example, favoring specialized systems built for analytics alone. To be a predictor of total cost and performance in a complex application, a benchmark must include much more functionality, and will favor general purpose systems that do many things but are not necessarily outstanding in any single aspect.
After 1-1/2 days with users, the project team met to discuss actual benchmark task forces to be started. The conclusion was that work would initially proceed around two use cases: publishing, and social networks. The present use of RDF by the BBC and the Press Association provides the background scenario for the publishing benchmark, and the work carried out around the Social Intelligence Benchmark (SIB) in LOD2 will provide a starting point for the social network benchmark. Additionally, user scenarios from the DEX graph database user base will help shape the SN workload.
A data integration task force needs more clarification, but work in this direction is in progress.
In practice, driving progress needs well-focused benchmarks with special trick questions intended to stress specific aspects of a database engine. Providing an overall perspective on cost and online operations needs a broad mix of features to be covered.
These needs will be reconciled by having many metrics inside a single use case, i.e., a social network data set can be used for transactional updates, for lookup queries, for graph analytics, and for TPC-H style business intelligence questions, especially if integrated with another more-relational dataset. Thus there will be a mix of metrics, from transactions to analytics, with single and multiuser workloads. Whether these are packaged as separate benchmarks, or as optional sections of one, remains to be seen. ]]>Questions on the exercise can be sent to the email specified in the previous post. I may schedule a phone call to answer questions based on the initial email contact.
We seek to have all applicants complete the exercise before October 1.
The exercise consists of implementing a part of the TPC-C workload in memory, in C
or C++
. TPC-C is the long-time industry standard benchmark for transaction processing performance. We use this as a starting point for an exercise for assessing developer skill level in writing heavily multithreaded, performance-critical code.
The application performs a series of transactions against an in-memory database, encountering lock contention and occasional deadlocks. The application needs to provide atomicity, consistency, and isolation for transactions. The task consists of writing the low-level data structures for storing the memory-resident database and for managing concurrency, including lock queueing, deadlock detection, and commit/rollback. The solutions are evaluated based on their actual measured multithreaded performance on commodity servers, e.g., 8- or 12-cores of Intel Xeon.
OpenLink provides the code for data generation and driving the test. This is part of the TPC-C kit in Virtuoso Open Source. The task is to replace the SQL API calls with equivalent in-process function calls against the in-memory database developed as part of the exercise.
We are aware that the best solution to the problem may be running transactions single-threaded against in-memory hash tables without any concurrency control. The application data may be partitioned so that a single transaction can be in most cases assigned to a partition, which it will get for itself for the few microseconds it takes to do its job. For this exercise, this solution is explicitly ruled out. The application must demonstrate shared access to data, with a transaction holding multiple concurrent locks and being liable to deadlock.
TPC-C can be written so as to avoid deadlocks by always locking in a certain order. This is also expressly prohibited; in specific, the stock rows of a new order transaction must be locked in the order they are specified in the invocation. In application terms this makes no sense, but for purposes of the exercise this will serve as a natural source of deadlocks.
The application needs to offer an interactive or scripted interface (command line is OK) which provides the following operations:
Clear and initialize a database of n warehouses.
Run n threads, each doing m new order transactions. Each thread has a home warehouse and occasionally accesses other warehouse's data. This reports the real time elapsed and the number of retries arising from deadlocks.
Check the consistency between the stock
, orders
, and order_line
data structures.
Report system status such as clocks spent waiting for specific mutexes. This is supplied as part of the OpenLink library used by the data generator.
The transactions are written as C
functions. The data is represented as C
structs, and tree indices or hash tables are used for value-based access to the structures by key. The application has no persistent storage. The structures reference each other by the key values as in the database, so no direct pointers. The key values are to be translated into pointers with a hash table or other index-like structure.
The application must be thread-safe, and transactions must be able to roll back. Transactions will sometimes wait for each other in updating shared resources such as stock or district or warehouse balances. The application must be written so as to implement fine-grained locking, and each transaction must be able to hold multiple locks. The application must be able to detect deadlocks. For deadlock recovery, it is acceptable to abort the transaction that detects the deadlock.
C++
template libraries may be used but one must pay attention to their efficiency.
The new order transaction is the only required transaction.
All numbers can be represented as integers. This holds equally for key columns as for monetary amounts.
All index structures (e.g., hash tables) in the application must be thread safe, so that an insert would be safe with concurrent access or concurrent inserts. This holds also for index structures for tables which do not get inserts in the test (e.g. item, customer, stock, etc.).
A sequence object must not be used for assigning new values to the O_ID
column of ORDERS
. These values must come from the D_NEXT_O_ID
column of the DISTRICT
table. If a new order transaction rolls back, its update of D_NEXT_O_ID
is also rolled back. This causes O_ID
values to always be consecutive within a district.
The application must implement the TPC-C new order transaction in full. This must not avoid deadlocks by ordering locking on stock rows. See the rules section.
The transaction must have the semantics specified in TPC-C, except for durability.
The test driver calling the transaction procedures is in tpccodbc.c
. This can be reused so as to call the transaction procedure in process instead of the ODBC exec.
The user interface may be a command line menu with run options for different numbers of transactions with different thread counts and an option for integrity check.
The integrity check consists of verifying s_cnt_order
against the orders and checking that max (O_ID)
and D_NEXT_O_ID
match within each district.
Running the application should give different statistics such as CPU%, cumulative time spent waiting for locks, etc. The rdtsc
instruction can be used for getting clock counts for timing.
This section summarizes some of the design patterns and coding tricks we expect to see in a solution to the exercise. These may seem self-evident to some, but experience indicates that this is not universally so.
The TPC-C transaction profile for new order specifies a semantics for the operation. The order of locking is left to the implementation as long as the semantics are in effect. The application will be tested with many clients on the same warehouse, running as fast as they can. So lock contention is expected. Therefore, the transaction should be written so as to acquire the locks with the greatest contention as late as possible. No locks need be acquired for the item table since none of the transactions will update it.
For implementing locks, using a mutex to serialize access to application resources is not enough. Many locks will be acquired by each transaction, in an unpredictable order. Unless explicit queueing for locks is implemented with deadlock detection, the application will not work.
If waiting for a mutex causes the operating system to stop a thread, even when there are cores free, the latency is multiple microseconds, even if the mutex is released by its owner on the next cycle after the waiting thread is suspended. This will destroy any benefit from parallelism unless one is very careful. Programmers do not seem to instinctively know this.
Therefore any structure to which access must be serialized (e.g. hash tables, locks, etc.) needs to be protected by a mutex but must be partitioned so that there are tens or hundreds of mutexes depending on which section of the structure one is accessing.
Submissions that protect a hash table or other index-like structure for a whole application table with a single mutex or rw
lock will be discarded off the bat.
Even while using many mutexes, one must hold them for a minimum of time. When accessing a hash table, do the invariant parts first; acquire the mutex after that. For example, if you calculate the hash number after acquiring the mutex for the hash table, the submission will be rejected.
The TPC-C application has some local and some scattered access. Orders are local, and stock and item lines are scattered. When doing scattered memory accesses, the program should be written so that the CPU will, from a single thread, have multiple concurrent cache misses in flight at all times. So, when accessing 10 stock lines, calculate the hash numbers first; then access the memory, deferring any branches based on the accessed values. In this way, out of order execution will miss the CPU cache for many independent addresses in parallel. One can use the gcc __builtin_prefetch
primitive, or simply write the program so as to have mutually data-independent memory accesses in close proximity.
For detecting deadlocks, a global transaction wait graph may have to be maintained. This will need to be maintained in a serialized manner. If many threads access this, the accesses must be serialized on a global mutex. This may be very bad if the deadlock detection takes a long time. Alternately, the wait graph may be maintained on another thread. The thread will get notices of waits and transacts from worker threads with some delay. Having spotted a cycle, it may kill one or another party. This will require some inter-thread communication. The submission may address this matter in any number of ways.
However, just acquiring a lock without wait must not involve getting a global mutex. Going to wait will have to do so, were it only for queueing a notice to a monitor thread. Using a socket-to-self might appear to circumvent this, but the communication stack will have mutexes inside so this is no better.
The exercise will be evaluated based on the run time performance, especially multicore scalability of the result.
Extra points are not given for implementing interfaces or for being object oriented. Interfaces, templates, and objects are not forbidden as such, but their cost must not exceed the difference between getting an address from a virtual table and calling a function directly.
The locking implementation must be correct. It can be limited to exclusive locks and need not support isolation other than repeatable read
. Running the application must demonstrate deadlocks and working recovery from these.
The TPC-C data generator and test driver are in the Virtuoso Open Source distribution, in the files binsrc/tests/tpcc*.c
and files included from these. You can make the exercise in the same directory and just alter the files or make script. The application is standalone and has no other relation to the Virtuoso code. The libsrc/Thread
threading wrappers may be used. If not using these, make a wrapper similar to mutex_enter
when MTX_METER
is defined so that it counts the waits and clocks spent during wait. Also have a report like that in mutex_stat()
for the mutex wait frequency and duration.
We are looking for exceptional talent to implement some of the hardest stuff in the industry. This ranges from new approaches to query optimization; to parallel execution (both scale up and scale out); to elastic cloud deployments and self-managing, self-tuning, fault-tolerant databases. We are most familiar to the RDF world, but also have full SQL support, and the present work will serve both use cases equally.
We are best known in the realms of high-performance database connectivity middleware and massively-scalable Linked-Data-oriented graph-model DBMS technology.
We have the basics -- SQL and SPARQL, column store, vectored execution, cost based optimization, parallel execution (local and cluster), and so forth. In short, we have everything you would expect from a DBMS. We do transactions as well as analytics, but the greater challenges at present are on the analytics side.
You will be working with my team covering:
Adaptive query optimization -- interleaving execution and optimization, so as to always make the correct plan choices based on actual data characteristics
Self-managing cloud deployments for elastic big data -- clusters that can grow themselves and redistribute load, recover from failures, etc.
Developing and analyzing new benchmarks for RDF and graph databases
Embedding complex geospatial reasoning inside the database engine. We have the basic R-tree and the OGC geometry data types; now we need to go beyond this
Every type of SQL optimizer and execution engine trick that serves to optimize for TPC-H and DS.
What do I mean by really good? It boils down to being a smart and fast programmer. We have over the years talked to people, including many who have worked on DBMS programming, and found that they actually know next to nothing of database science. For example, they might not know what a hash join is. Or they might not know that interprocess latency is in the tens of microseconds even within one box, and that in that time one can do tens of index lookups. Or they might not know that blocking on a mutex kills.
If you do core database work, we want you to know how many CPU cache misses you will have in flight at any point of the algorithm, and how many clocks will be spent waiting for them at what points. Same for distributed execution: The only way a cluster can perform is having max messages with max payload per message in flight at all times.
These are things that can be learned. So I do not necessarily expect that you have in-depth experience of these, especially since most developer jobs are concerned with something else. You may have to unlearn the bad habit of putting interfaces where they do not belong, for example. Or to learn that if there is an interface, then it must pass as much data as possible in one go.
Talent is the key. You need to be a self-starter with a passion for technology and have competitive drive. These can be found in many guises, so we place very few limits on the rest. If you show you can learn and code fast, we don't necessarily care about academic or career histories. You can be located anywhere in the world, and you can work from home. There may be some travel but not very much.
In the context of EU FP7 projects, we are working with some of the best minds in database, including Peter Boncz of CWI and VU Amsterdam (MonetDB, VectorWise) and Thomas Neumann of Technical University of Munich (RDF3X, HYPER). This is an extra guarantee that you will be working on the most relevant problems in database, informed by the results of the very best work to date.
For more background, please see the IEEE Computer Society Bulletin of the Technical Committee on Data Engineering, Special Issue on Column Store Systems.
All articles and references therein are relevant for the job. Be sure to read the CWI work on run time optimization (ROX), cracking, and recycling. Do not miss the many papers on architecture-conscious, cache-optimized algorithms; see the VectorWise and MonetDB articles in the bulletin for extensive references.
If you are interested in an opportunity with us, we will ask you to do a little exercise in multithreaded, performance-critical coding, to be detailed in a blog post in a few days. If you have done similar work in research or industry, we can substitute the exercise with a suitable sample of this, but only if this is core database code.
There is a dual message: The challenges will be the toughest a very tough race can offer. On the other hand, I do not want to scare you away prematurely. Nobody knows this stuff, except for the handful of people who actually do core database work. So we are not limiting this call to this small crowd and will teach you on the job if you just come with an aptitude to think in algorithms and code fast. Experience has pros and cons so we do not put formal bounds on this. "Just out of high school" may be good enough, if you are otherwise exceptional. Prior work in RDF or semantic web is not a factor. Sponsorship of your M.Sc. or Ph.D. thesis, if the topic is in our line of work and implementation can be done in our environment, is a further possibility. Seasoned pros are also welcome and will know the nature of the gig from the reading list.
We are aiming to fill the position(s) between now and October.
Resumes and inquiries can be sent to Hugh Williams, hwilliams@openlinksw.com. We will contact applicants for interviews.
]]>Abstract:
We discuss applying column store techniques to both graph (RDF) and relational data for mixed workloads ranging from lookup to analytics in the context of the OpenLink Virtuoso DBMS. In so doing, we need to obtain the excellent memory efficiency, locality and bulk read throughput that are the hallmark of column stores while retaining low-latency random reads and updates, under serializable isolation.
DBLP BibTeX Record 'journals/debu/Erling12' (XML)
]]>@article{DBLP:journals/debu/Erling12, author = {Orri Erling}, title = {Virtuoso, a Hybrid RDBMS/Graph Column Store}, journal = {IEEE Data Eng. Bull.}, volume = {35}, number = {1}, year = {2012}, pages = {3-8}, ee = {http://sites.computer.org/debull/A12mar/vicol.pdf}, bibsource = {DBLP, http://dblp.uni-trier.de} }
This is the thrust of what was said, noted from memory. My comments follow after the synopsis.
Jeremy Kepner: When Java was new we saw it as the coming thing and figured that in HPC we should find space for this. When MapReduce and Hadoop came along, we saw this as a sea change in parallel programming models. This was so simple literally anybody could make parallel algorithms whereas this was not so with MPI. Even parallel distributed arrays are harder. So MapReduce was a game changer, together with the cloud where anybody can get a cluster. Hardly a week passes without me having to explain to somebody in government what MapReduce and Hadoop are about.We have a lot of arrays and a custom database for them. But the arrays are sparse so this is in fact a triple store. Our users like to work in MATLAB, and any data management must run together with that.
Of course, MapReduce is not a real scheduler, and Hadoop is not a real file system. For deployment, we must integrate real schedulers and make HDFS look like a file system to applications. The abstraction of a file system is something people like. Being able to skip a time-consuming data-ingestion process with a database is an advantage with file-based paradigms like Hadoop. If this is enhanced with the right scheduling features, this can be a good component in the HPC toolbox.
Michael Stonebraker: Users of the data use math packages like R, MATLAB, SAS, SPSS, or similar. If business intelligence is about AVG, MIN, MAX, COUNT, and GROUP BY, science applications are much more diverse in their analytics. All science algorithms have an inner loop that resembles linear algebra operations like matrix multiplication. Data is more often than not a large array. There are some graphs in biology and chemistry, but the world is primarily rectangular. Relational databases can emulate sparse arrays but are 20x slower than a custom-made array database for dense arrays. And I will not finish without picking on MapReduce: I know of 2000-node MapReduce clusters. The work they do is maybe that of a 100-node parallel database. So if 2000 nodes is what you want to operate, be my guest.
Science database is a zero billion dollar business. We do not expect to make money from the science market with SciDB, which by now works and has commercial services supplied by Paradigm 4, while the code itself is open source, which is a must for the science community. The real business opportunity is in the analytics needed by insurance and financial services in general, which are next to identical with the science use cases SciDB tackles. This makes the vendors pay attention.
Alex Szalay: The way astronomy is done today is through surveys: a telescope scans through the sky and produces data. We have now for 10 years operated the Sloane Sky Survey and kept the data online. We have all the data, and complete query logs, available for anyone interested. When we set out to do this with Jim Gray, everybody found this a crazy idea, but it has worked out.
Anastasia Ailamaki: We do not use SciDB. We find a lot of spatial use cases. Researchers need access to simulation results which are usually over a spatial model, like in earthquake simulations and the brain. Off-the-shelf techniques like R trees do not work -- the objects overlap too much -- so we have made our own spatial indexing. We make custom software when it is necessary, and are not tied to vendors. In geospatial applications, we can create meshes of different shapes -- like tetrahedral or cubes for earthquakes, and cylinders for the brain -- and index these in a geospatial index. But since an R tree is inefficient when objects overlap too much, as these do, we just find one; and then because there is reachability from an object to neighboring ones, we use this to get all the objects in the area of interest.
* * *
This is obviously a diverse field. Probably the message that we can synthesize out of this is that flexibility and parallel programming models are what we need to pay attention to. There is a need to go beyond what one can do in SQL while continuing to stay close to the data. Also, allowing for plug-in data types and index structures may be useful; we sometimes get requests for such anyway.
The continuing argument around MapReduce and Hadoop is a lasting feature of the landscape. A parallel DB will beat MapReduce any day at joining across partitions; the problem is to overcome the mindset that sees Hadoop as the always-first answer to anything parallel. People will likely have to fail with this before they do anything else. For us, the matter is about having database-resident logic for extract-transform-load (ETL) that can do data-integration type-transformations and maybe iterative graph algorithms that constantly join across partitions, better than a MapReduce job, while still allowing application logic to be written in Java. Teaching sem-web-heads to write SQL procedures and to know about join order, join type, and partition locality, has proven to be difficult. People do not understand latency, whether in client-server or cluster settings. This is why they do not see the point of stored procedures or of shipping functions to data. This sounds like a terrible indictment, like saying that people do not understand why rivers flow downhill. Yet, it is true. This is also why MapReduce is maybe the only parallel programming paradigm that can be successfully deployed in the absence of this understanding, since it is actually quite latency-tolerant, not having any synchronous cross-partition operations except for the succession of the map and reduce steps themselves.
Maybe it is so that the database guys see MapReduce as an insult to their intelligence and the rest of the world sees it as the only understandable way of running grep and sed (Unix commands for string search/replace) in parallel, with the super bonus of letting you reshuffle the outputs so that you can compare everything to everything else, which grep alone never let you do.
* * *
Making a database that does not need data loading seems a nice idea, and CWI has actually done something in this direction in "Here are my Data Files. Here are my Queries. Where are my Results?"] However, there is another product called Algebra Data that claims to take in data without loading and to optimize storage based on access. We do not have immediate plans in this direction. Bulk load is already quite fast (take 100G TPC-H in 70 minutes or so), but faster is always possible.
]]>Graph DB and RDF/Linked Data are distinct, if neighboring disciplines. On one hand, graph problems predate Linked Data, and the RDF/Linked Data world is a web artifact, which graphs are not as such, so a slightly different cultural derivation also makes these disjoint. Besides, graphs may imply schema first whereas linked data basically cannot. Then another differentiation might be derived from edges not really being first class citizens in RDF, except for reification, at which the RDF reification vocabulary is miserably inadequate, as pointed out before.
RDF is being driven by the web-style publishing of Linked Open Data (LOD), with some standardization and uptake by publishers; Graph DB is not standardized but driven by diverse graph-analytics use cases.
There is no necessary reason why these could not converge, but it will be indefinitely long before any standards come to cover this, so best not hold one's breath. Communities are jealous of their borders, so if the neighbor does something similar one tends to emphasize the differences and not the commonalities.
So for some things, one could warehouse the original RDF of the web microformats and LOD, and then ETL into some other graph model for specific tasks, or just do these in RDF. Of course, then RDF systems need to offer suitable capabilities. These seem to be about very fast edge traversal within a rather local working set, and about accommodating large, iteratively-updated intermediate results, e.g., edge weights.
Judging by the benchmarks paper (Benchmarking traversal operations over graph databases (Slidedeck (ppt), paper (pdf)); Marek Ciglan, Alex Averbuch, and Ladialav Hluchy.) at the GDM workshop, the state of benchmarking in graph databases is even worse than in RDF, where the state is bad enough. The paper's premise was flawed to start, using application logic to do JOINs instead of doing them in the DBMS. In this way, latency comes to dominate, and only the most blatant differences are seen. There is nothing like this style of benchmarking to make an industry look bad. The supercomputer Graph 500 benchmark, on the other hand, lets the contestants make their own implementations on a diversity of architectures with random traversal as well as loading and generating large intermediate results. It is somewhat limited, but still broader than the the graph database benchmarks paper at the GDM workshop.
Returning to graphs, there were some papers on similarity search and clique detection. As players in this space, beyond just RDF, we might as well consider implementing necessary features for efficient expression of such problems. The algorithms discussed were expressed in procedural code against memory-based data structures; there is usually no query language or parallel/distributed processing involved.
MapReduce has become the default way in which people would tackle such problems at scale; in fact, people do not consider anything else, as far as I can tell. Well, they certainly do not consider MPI for example as a first choice. The parallel array things in Fortran do not at first sight seem very graphy, so this is likely not something that crosses one's mind either.
We should try some of the similarity search and clustering in SQL with a parallel programming model. We have excellent expression-evaluation speed from vectoring and unrestricted recursion between partitions, and no file system latencies like
Having some sort-of-agreed-upon benchmark for these workloads would make this more worthwhile. Again, we will see what emerges.
]]>Bryan Thompson of Systap (Bigdata® RDF store) was also invited, so we got to talk about our common interests. He told me about two cool things they have recently done, namely introducing tables to SPARQL, and adding a way of reifying statements that does not rely on extra columns. The table business is just about being able to store a multicolumn result set into a named persistent entity for subsequent processing. But this amounts to a SQL table, so the relational model has been re-arrived at, once more, from practical considerations. The reification just packs all the fields of a triple (or quad) into a single string and this string is then used as an RDF S or O (Subject or Object), less frequently a P or G (Predicate or Graph). This works because Bigdata® has variable length fields in all columns of the triple/quad table. The query notation then accepts a function-looking thing in a triple pattern to mark reification. Nice. Virtuoso has a variable length column in only the O but could of course have one in also S and even in P and G. The column store would still compress the same as long as reified values did not occur. These values on the other hand would be unlikely to compress very well but run length and dictionary would always work.
So, we could do it like Bigdata®, or we could add a "quad ID" column to one of the indices, to give a reification ID to quads. Again no penalty in a column store, if you do not access the column. Or we could make an extra table of PSOG->R.
Yet another variation would be to make the SPOG concatenation a literal that is interned in the RDF literal table, and then used as any literal would be in the O, and as an IRI in a special range when occurring as S. The relative merits depend on how often something will be reified and on whether one wishes to SELECT based on parts of reification. Whichever the case may be, the idea of a function-looking placeholder for a reification is a nice one and we should make a compatible syntax if we do special provenance/reification support. The model in the RDF reification vocabulary is a non-starter and a thing to discredit the sem web for anyone from database.
I heard from Bryan that the new W3 RDF WG had declared provenance out of scope, unfortunately. The word on the street on the other hand is that provenance is increasingly found to be an issue. This is confirmed by the active work of the W3 Provenance Working Group.
]]>Being involved with at least one of these, and being in the audience, I felt obligated to comment. The fact is, neither OpenLink's LOD Cloud Cache nor Sindice is a business, and there is not a business model which could justify keeping them timely on the web crawls they contain. Doing so is easy enough, if there is a good enough reason.
The talk did make a couple of worthwhile points: The data does change; and if one queries entities, one encounters large variation in change-frequency across entities and their attributes.
The authors suggested to have a piece of middleware decide what things can be safely retrieved from a copy and what have to be retrieved from the source. Not too much is in fact known about the change frequency of the data, except that it changes, as the authors pointed out.
The crux of the matter is that the thing that ought to know this best is the query processor at the LOD warehouse. For client-side middleware to split the query, it needs access to statistics that it must get from the warehouse or keep by itself. Of course, in concrete application scenarios, you go to the source if you ask about the weather or traffic jams, and otherwise go to the warehouse based on application-level knowledge.
But for actual business intelligence, one needs histories, so a search engine with only the present is not so interesting. At any rate, refreshing the data should leave a trail of past states. Exposing this for online query would just triple the price, so we forget about that for now. Just keeping an append-only table of history is not too much of a problem. One may make extracts from this table into a relational form for specific business questions. There is no point doing such analytics in RDF itself. One would have to just try to see if there is anything remotely exploitable in such histories. Making a history table is easy enough. Maybe I will add one.
Let us now see what it would take to operate a web crawl cache that would be properly provisioned, kept fresh, and managed. We base this on the Sindice crawl sizes and our experiments on these; the non-web-crawl LOD Cloud Cache is not included.
From previous experience we know the sizing: 5Gt/144GB RAM. Today's best price point is on 24-DIMM E5 boards, so 192GB RAM, or 6.67Gt. A unit like that (8TB HDD, 0.5TB SSD, 192GB RAM, 12 core E5, InfiniBand) costs about $6800.
The Sindice crawl is now about 20Gt, so $28K of gear (768GB RAM) is enough. Let us count this 4 times: 2x for anticipated growth; and 2x for running two copies -- one for online, and one for batch jobs. This is 3TB RAM. Power is 16x500W = 8KW, which we could round to 80A at 110V. Colocation comes to $500 for the space, and $1200 per month for power; make it $2500 per month with traffic included.
At this rate, 3 year TCO is $120K + ( 36 * $2.5K ) = $210K. This takes one person half time to operate, so this is another $50K per year.
We do not count software development in this, except some scripting that should be included in the yearly $50K DBA bill.
Under what circumstances is such a thing profitable? Or can such a thing be seen as a marketing demo, to be paid for by license or service sales?
A third party can operate a system of this sort, but then the cost will be dominated by software licenses if running on Virtuoso cluster.
For comparison, the TB at EC2 costs ((( 16 * $2 ) * 24 ) * 31 ) = $24,808 per month. With reserved instances, it is ( 16 * ( $2192 + ((( 0.7 * 24 ) * 365 ) * 3 ))) / 36 = $8938 per month for a 3 year term. Counting at 3TB, the 3 year TCO is $965K at EC2. AWS has volume discounts but they start higher than this; ( 3 * ( 16 * $2K )) = $96K reserved host premium is under $250K. So if you do not even exceed their first volume discount threshold, it does not look likely you can cut a special deal with AWS.
(The AWS prices are calculated with the high memory instances, approximately 64GB usable RAM each. The slightly better CC2 instance is a bit more expensive.)
Yet another experiment to make is whether a system as outlined will even run at anywhere close to the performance of physical equipment. This is uncertain; clouds are not for speed, based on what we have seen. They make the most sense when the monthly bill is negligible in relation to the cost of a couple of days of human time.
]]>The experiment is loading Sindice web crawls. The platform is 2 x Xeon 5520 and 144G RAM. The initial load rate is 200-180Kt, and drops to 100Kt at 5Gt because of I/O. The system is Virtuoso Column Store configured to run as 4 processes and 32 partitions, all on the same box. After 5Gt, we see just more I/O and going further is not relevant; one runs CPU-bound or not at all.
We use 4 Crucial SSDs in the setup. The hot structures like the RDF quad indices are on SSD, and the cold ones are on hard disk. A cold structure is a write-only index like the dictionary of literals (id to lit).
For bulk load, SSDs turn out not to be particularly useful. For a cold start on the other hand, SSDs cut warmup time of 144G RAM from over half an hour to a couple of minutes. It is possible that Intel SSDs would also help with bulk load, but this has not been tried. The SSD problem during bulk load is that these do not write very fast, and while there are writes in queue, read latency goes up; so under a constant write load, the SSD's famous instantaneous random read no longer works.
The fragment considered in the example is 4.95Gt: 8.1M pages worth of quads; 12.7M of literals and iris; and 4.71M of full text index. A page is 8KB. The files on disk contain empty pages, but these do not matter since they do not take up RAM. The quad indices take 13.4 bytes/quad. The row-wise equivalent used to be 38 or so bytes/quad with similar data. Two-thirds of the IRI and literal string data can benefit from column-wise stream compression. (This was not used but if it were, we could count on a 50% drop in size for the data affected, so instead of 12.7M pages, we could maybe get 8.5M on a good day. This could be worth doing but is not a priority.) The system was configured to have 12M database pages in RAM, so a little under half the database pages of the set fit in RAM at one time; thus one cannot call this a memory-only setup. Due to the locality in the unusually non-local data, this is as far as secondary storage can reach without becoming an over-2x slowdown. In practice, we are talking about under 1% of rows accessed coming from secondary storage, but that alone means half throughput.
We note that this data set represents the worst that we have seen. It has 129M distinct graphs, 38 t/g. Regular data like the synthetic benchmark sets take half the space per quad. This is about a third of a Sindice crawl; the other two-thirds look the same as far as we looked.
So if you are interested in hosting data like this, you can budget 144GB RAM for every 5Gt. Do not try it with anything less. Budgeting double this is wise, so that you have space to cook the data; this is important since in order to do things with it, one needs to at least copy things for materializing transformations.
If you are budget-constrained and hosting very regular content like UniProt, you can budget maybe 144GB RAM for every 10Gt.
As for CPU, this does not matter as much as long as you do not go to disk. Just for load speed, Dbpedia is loaded in 300s on a cluster of eight (8) dual AMD 2378 boxes at 2.6GHz (total 8 cores per host, so 64 cores in the cluster), and in 945s on one (1) dual Xeon 5520 box at 2.26GHz (total 8 cores in the host). Intel makes much better CPUs, as we see. Both scenarios are 100% in RAM. For even more regular data, the load rates are a bit higher: 1.3Mt/s for the AMD cluster, and 300Kt/s for the Xeon host.
The interconnect for the AMD cluster is 1 x gigE but this does not matter for load. For CPU-bound cross-partition JOINs, 1 or 2 x gigE is insufficient; 4 x gigE might barely make it; InfiniBand should be safe. When running cross-partition JOINs, a single 8-core Xeon box generates about 300MB/s of interconnect traffic; a gigE connection can maybe take 50MB/s with some luck.
Intel E5 is not dramatically better than Nehalem but this is something we will see in a while when we make measurements with real equipment. Prior to the E5 release, we tried Amazon EC2 CC2 ("Cluster Compute Eight Extra Large Instance" -- 2x8 core E5, 2.66GHz). The results were inconclusive; it never did more than 1.9x better than Xeon 5520 even when running an empty loop (i.e., recursive Fibonacci function in SQL, no cache misses, no I/O). With a database JOIN, 1.3x better is the best we saw. But this must be the fault of Amazon and not of E5.
We also tried AMD "Magny-Cours", but for 32 cores against 8 it never did over 2x better, more like 1.4x often enough, and and single thread speed was 50% worse, so not a good buy. We did not find a Bulldozer to try, and did not feel like buying one since the reviews did not promise more core speed over the Magny-Cours.
It seems that especially with Column Store, we are truly CPU-bound and not memory-latency- or bandwidth-bound. This is based on the observation that a Xeon 5620 with 2 of 3 memory channels populated loads BSBM data only 10% faster than the same with 1 of 3 channels populated, with CPU affinity set on a dual socket system.
So, if you have a choice between a $2K processor (E5-2690) and a $600 processor (E5-2630), buy the cheaper one and get RAM with the money saved. $1440 buys 128G in $90 8G DIMMs. Then buy E5 boards with 24 DIMMs -- one for every 7Gt of web crawl data. If your software licenses are priced per core, getting higher-clock 4-core E5’s might make sense.
While on the subject of bytes and quads/triples, we note that Bigdata®'s recent announcement says up to 50 billion triples per single server. Franz loaded at a good 800+ Kt/s rate up to a trillion triples. One is led to think from the spec that this was with less than full cpu but still with highly local data, considering 1.5 bytes a triple would hit very heavy I/O otherwise. Their statement to the effect of LUBM-like data corroborates this, so we are not talking about exactly the same thing.
So if you compare the claims, I am talking about running CPU-bound on the worst data there is. Franz and Bigdata® do not specify, so it is hard to compare. LOD2 should in principle publish actual metrics with at least Bigdata®; Franz is not participating in these races.
We may publish some more detailed measurements with more varied configurations later. The thing to remember is minimum 144GB RAM for every 5Gt of web crawls, if you want to load and refresh in RAM.
]]>The value is unquestionable both to Virtuoso users in the short-term, and to the state of science and to all RDF users and vendors in the mid-term.
The LOD2 claim of "linking the universe" (my words) will be tested soon enough, after we first put the universe in a bucket. This refers to a real-time quad store of Sindice crawls, plus a warehouse of the LOD data sets.
This effort raises a few questions that I will treat in a number of posts to follow, such as --
What is done now is under-provisioned and not kept up to date. We are talking about all the RDF on the web in near real time with arbitrary queries. This is very far from the "billion triples" data sets or vertical portals, which are both easy by comparison.
]]>Before this, we did, as promised, get the column store and vectored execution capabilities of Virtuoso 7 Single-Server Edition extended to Virtuoso 7 Cluster Edition. More interesting still, we decoupled storage from the database server process, so now database files can migrate between server processes. This means that clusters are now elastic, i.e., new servers can be added to a cluster and the load can be redistributed without reloading the data.
These things were long planned, but now are done. Measurements will be published in some weeks, as part of CWI's continued running of RDF store benchmarks, per the LOD2 plan.
Doing the column store and elastic cluster is work enough, so I do not in general participate in support or consultancy or the like. This has some pros and cons. On the plus side, there is a relative lack of noise and a very clear idea of focus. Of course, this work is most highly applied, thus always informed by use cases, thus forgetting what ought to be done out there is not the problem. Rather, the problem is forgetting how things in fact are done as opposed to how they could or should be done.
To cut a long story short, it has become clear to me that the DBMS must tell the application developer what to do. Of course, the application developer could also look at performance metrics, but they do not, and explaining these metrics is too much work and yields no lasting benefit. Developers will produce all kinds of performance diagnostic traces if requested, but going through this song and dance can also be avoided by the right automation.
So, I will introduce two new product features called Wazzup? and Saywhat?
Wazzup? is answered by a mood line, like "Heavily disk bound: 100G more memory will give 10x speedup" or "Network bound: Processing in larger batches will give 5x more throughput" and Saywhat? is answered by some commentary on the user's last action, for example "there is no ?order with o_totalprice < 0" or "there is no property O_misspelledtotallprrice."
Wazzup? is about overall system state, and Saywhat? is about the user session, specifically query plans. But an explanation of a query plan is not understandable, so this will just point out some salient facts, like the reason why the answer comes out empty.
The other thing that came to my attention is the fact that a user has no instinctive feel for ETL. A database person takes it for a self-evident truth that data is loaded in bulk, but the application developer does not think of that. Likewise, the line between warehousing and federating is not instinctively felt; actually the question is not even posed in these terms. So one will find Web protocols and end-points and glue code on the app server when one ought to have ETL and adequate hardware for running the consolidated database.
Further, under-provisioning of equipment is endemic with semanticists. The Semantic Web gets a needlessly bad rap just because we find too much data on too little equipment. For example, I was surprised to learn that the Linked Geodata demo ran on only 16 GB RAM and 6 processor cores with 2 billion triples and 350 million points in a geo index. Now, even with our greatest space efficiency advances, there is no way this will run from memory.
It is not that the Web 2.0 stack is necessarily efficient (we hear the wildest stories of lack of database understanding from that side too), but at least there is a culture of running with enough equipment. Surely when the web-scale data gear (e.g. Google Bigtable, Yahoo PNUTS, Amazon Dynamo) was new, by the operators' own admission there was no way for this to be particularly efficient, database-wise. Not if your eventual consistency is a client application to a shared MySQL back-end. For a lookup or single-record-update workload, who cares when there is enough hardware? For analytics, there is the de facto impossibility of doing big joins, but map reduce is for that, all offline. The big web houses have always known how to deal with data; it is the smaller Web 2.0 guys who patch systems together with duct tape and memcache. Even so, the online experience gets created.
Semanticism has no part of this outlook, except maybe for Freebase, but then they are from California and now have been inside Google for a while.
We quite understand that when one needs to get big data online, one makes a key-value store as a point solution, because this way one owns what one operates, and the time to market is a lot shorter than if one tried building all this inside a general-purpose DBMS. Besides, the people who can in fact do this almost do not exist, and even if one had a whole army of this rare breed, development is not very scalable in a tightly-integrated system like a high-performance DBMS. Still further, to even start, one needs to own the DBMS, meaning that the initial platform must be known through and through. This is an issue even though open source platforms exist.
The graph data, semdata, schema-last, RDF, linked data enterprise -- whatever one calls it -- makes the bold proposition of bringing complex-query-at-scale to heterogeneous data. This is a database claim.
In the meantime, test deployments are made in defiance of database best practices. This is a bit like test driving a race car in reverse gear and steering by looking in the rear-view mirror.
There is also no short-term scalable way to educate people. At the LOD2 review, one comment was that an integrated project ought to clearly indicate how to set up the tool chain for good performance, specially as concerns interfaces between the tools. This is very true. Experience shows that developers of tools cannot accurately anticipate what usage patterns will emerge in the field. Therefore, we propose to do better than just documentation; we will make the server recognize the common sources of inefficiency and point the user to the right action.
Imagine the following conversation:
DBMS: Your application does single-triple INSERTs over client-server protocol all day, from a single client. 57% of real time goes in client server latency, 40% in cluster interconnect latency, 2% in compiling the statements, and 1% in doing the work. Use array parameters or bulk load from a file.
Operator: My developers use industry-standard Java class libraries with a service-oriented architecture and strictly enforced interfaces. This is called software engineering. Watch out ere you raise your voice against the canon.
[Some weeks later, after the load job has gone on for 10 days and gotten a third of the way, developers have discovered that JDBC has array parameters and are trying these.]
DBMS: 60% of real time goes into waiting for locks. 10% of transactions get aborted for deadlock. Transactions consist of an average of 10 client-server operations. Use stored procedures; acquire locks in predictable order; do SELECT FOR UPDATE. Throughput will be 4x higher if client-server operations are merged into a single operation. The transactions only INSERT; hence consider bulk load instead.
Operator: We are using an enterprise-class three-tier architecture. It has "enterprise" in the name and all the big guys are using it, so it must be scalable. Besides, it is distributed transactions, and distributed computing is the wave of the future. You are a cluster yourself, so the pot's got no business calling the kettle black.
[After a while, the data gets loaded with bulk load, but now on a single stream.]
DBMS: CPU is at 400% for an INSERT workload; adding more parallel threads will get 4.5x better throughput.
[Some time has elapsed and there are Ajax client apps out there trying to use the data.]
DBMS: Will you really not give me another 140 GB RAM and 16 more cores?
Operator: No, on general principles I will not, shut up.
DBMS: Do you know that your page impression takes 3 seconds and anything over 0.25 seconds is visibly slow? 300 GB worth of distinct pages have been accessed in the last 24 hours for 160 GB of RAM. Latency will drop 10x by using SSD; 50x by increasing RAM.
Operator: No dice, bucket. Shut up, besides, when I scroll through the data I always use for testing, I get it fast enough, you are just doing this out of greed and self-importance. You are a server among many, just like the mail server; you databases are just pretentious.
Currently addressing any of the above sorts of issues takes a long time and involves mostly-avoidable support communication. Questions of this sort do occur. We can probably produce commentary like the above based on logging some 50 numbers, and making some 15 regularly-run reports over these. The patterns to watch out for are well known. No, we will not make a Zippy the Pinhead office assistant; a computer should not try to be cute. This one will talk only in terms of gains from adjusting the deployment or usage patterns.
Now, suppose the operator said yes to the request for more cores and memory; then it would be up to the DBMS to deliver. This entails a capacity to redistribute itself automatically, and to give a quantitative report on the success of this measure. This means usage-based repartitioning of the data to equalize load over a cluster. The relevant metric in the above case is the drop in response time. On the other hand, the DBMS should also notice if there is clearly unused capacity.
This all will be presented as a line in the status report, so there is no extra wizard or workload analyzer that one must remember to run. For programmatic use there are SQL views for the relevant reports.
As for ETL, even if the DBMS can detect that it is not being done right, this does not mean that the user will know what to do. Therefore, for all the Web harvesting we support, as well as any import from local file system or Web services, with some RDF-ization, we will simply implement a proper ETL utility that will do things right. Wazzup? can just point the user to that if the workload looks like loading. This will have its own status report giving a load and transform rate and will point out what takes the longest, after everything is duly parallelized and made asynchronous.
Beyond these lessons, there is more to say about the review and plenary, we will get to that a bit later. We did promise a new edition of the LOD cache in a couple of months, now on the clustered column-store platform. Look for advances in data discoverability.
]]>The Semantic Technology Institute (STI) is organizing a meeting around the questions of making semantic technology deliver on its promise. We were asked to present a position paper (reproduced below). This is another recap of our position on making graph databasing come of age. While the database technology matters are getting tackled, we are drawing closer to the question of deciding actually what kind of inference will be needed close to the data. My personal wish is to use this summit for clarifying exactly what is needed from the database in order to extract value from the data explosion. We have a good idea of what to do with queries but what is the exact requirement for transformation and alignment of schema and identifiers? What is the actual use case of inference, OWL or other, in this? It is time to get very concrete in terms of applications. We expect a mixed requirement but it is time to look closely at the details.
Databases and knowledge representation both have decades of history, but to date the exchange of ideas and techniques between these disciplines has been limited. The intuition that there would be value in greater cooperation has not failed to occur to researchers on either side; after all, both sides deal with data. From this, we have seen deductive databases emerge, as well as more recently "database friendly" profiles of OWL.
In this position paper we will examine what, in the most concrete terms, is needed in order to bring leading edge database technology together with expressive querying and reasoning. This draws on our experience in building Virtuoso, one of today's leading graph data stores. Following this, we argue for the creation of benchmarks and challenges that in fact do reflect reality and facilitate open and fair comparison of products and technologies.
Data integration is often mentioned as the motivating use case for GDB, commonly popularized today as RDF. Database research has over the past few years produced great advances for business intelligence (i.e., complex queries and read-mostly workloads). These advances are typified by compressed columnar storage and architecture-conscious execution models, mostly based on the idea of always processing multiple sets of values in each operation (vectoring). With these techniques, raw performance with relatively simple schemas and regular data (e.g., TPC-H) is no longer a barrier to extracting value from data.
A similar breakthrough has not been seen on the semantics side. Data integration still requires manual labor. Publishing GDB datasets is a good and necessary intermediate stage, but producing these datasets from diverse sources is not fundamentally different from doing the same work without GDB or RDF. Even so, GDB and RDF serve as a catalyst for a culture of publishing datasets.
GDB, as a base model for integration, offers the following benefits over a purely relational result format:
Obtaining this flexibility on a relational basis would simply require moving to an graph-like representation with essentially one-row-per-attribute. Indeed, we see key-value stores being used in online applications with high volatility of schema (e.g., social networks, search); and we also see relational applications making provisions for post-hoc addition of per-entity attributes (i.e., associating a bag of mixed non-first normal form data with entities). The benefits of a schema-last approach are recognized in many places.
GDB seems a priori a fit for all these requirements, thus how will it claim its place as a solution?
The first part of the answer lies in learning all the relevant database lessons. The second part lies in eliminating the impedance mismatch between querying and reasoning. The third and most important part consists of substantiating these claims in a manner that is understandable to the relevant publics, finally leading to the creation of a semantics-aware segment of the database industry. We will address each of these aspects in turn.
The problem is divided into storage format, execution, and query optimization. For the first two, Daniel Abadi's renowned Ph.D. thesis holds most of the keys. Space efficiency is specially important for Linked Data, since data is often voluminous, and many datasets have to be brought together for integration. Access patterns are also unpredictable, with indexed-random-access predominating, as opposed to RDB BI workloads where sequential scans and hash joins represent the bulk of the work. However, we find that a sorted column-wise compressed representation of Linked Data with a single quad table for all statements gives excellent space efficiency and good random access as well as random insert speed. The space efficiency is close to par with the equivalent column-wise relational format, since three of the four columns of the quad table compress to almost nothing. As many sort orders as are necessary may be maintained, but we find that two are enough, with some extra data structures for dealing with queries where the predicate is unspecified. The details are found in VLDB 2010 Semdata workshop paper, Directions and Challenges for Semantically Linked Data. Since GDB/RDF is a model typed at run time, the engine must support an "ANY
" data type for columns and query variables, where values on successive rows may be of different types. This is a straightforward enhancement.
Vectored execution is traditionally associated with column stores because the per-row access cost is relatively high, thus needing to access many nearby rows at a time in order to amortize the overhead. Aside this, vectored execution provides many opportunities for parallelism, from the instruction level all the way to threading and distributed execution on clusters, thus some form of execution on large numbers of concurrent query states is needed for RDF stores, just as it is needed for RDBMS"s.
Query optimization for GDBMS is similar to that for RDBMS, except that the statistics can no longer be collected by column and table, but must rather apply to individual entities and ranges of a single quad table. This can be provided through run-time sampling of the database based on constants in the query being optimized. This may take into account trivial inference such as expanding properties into the set of their sub-properties and the like. Beyond this, interleaving execution and optimization (as in ROX) seems to offer limitless possibilities, especially when inference is introduced, making optimizer statistics less predictive.
In summary, starting with an RDBMS and going to GDB entails changes to all parts of the engine, but these changes are not fundamental. One does need to own the engine; however, otherwise the expertise for efficiently implementing these changes will not exist. Essentially any DBMS technique may be translated to a GDB use case, if its application can be decided at run-time. GDB may be schema-less, yet most datasets have fairly regular structure; the question is simply to reconstruct the needed statistics and schema information from the data on an as you go basis. Techniques with high up-front cost, like constructing specially ordered materializations for optimizing specific queries, are harder to deploy but still conceivable for GDB also.
Compared to the straightforwardly performance oriented world of database engines, the contours of the landscape become less defined when moving to inference. Databases, whether relational or schema-less all perform roughly the same functions but inference is more diverse. We include here also techniques like machine learning and meta-reasoning for guiding reasoning, although these might not strictly fit the definition.
As we posit that data integration is the motivating use case for GDB as opposed to RDB (Relational Database Model), we must ask which modes of inference are actually required for data integration. Further, we need to ask whether these inferences ought to be applied as a preprocessing step (ETL or forward chaining), or as needed (backward chaining). Some low-hanging fruit can be collected by simply constructing class or property hierarchies; e.g., in the data at hand, the following properties have the meaning of company name, and the following classes have the meaning of company. We have found that such techniques can be efficiently supported at run-time, without materialization, if the support is simply built into the engine, which is in itself straightforward as long as one controls the engine. The same applies to trivial identity resolution, such as owl:sameAs
or resolution of identity based on sharing an inverse-functional property value. These things take longer at run-time, but if one caches and reuses the result, one can get around materialization.
We do not believe in weak statements of identity, as in X is similar to Y, since the meaning of similarity is entirely contextual. X and Y may or may not be interchangeable depending on the application; thus the statement on identity needs to be strong, but it must be easy to modify the grounds on which such a statement is made. This is a further argument for why one should not automatically materialize consequences of identity, particularly if dealing with web data where identity is especially problematic.
Real-world problems are however harder than just bundling properties, classes, or instances into sets of interchangeable equivalents, which is all we have mentioned thus far. There are differences of modeling ("address as many columns in customer table" vs. "address normalized away under a contact entity"), normalization ("first name" and "last name" as one or more properties; national conventions on person names; tags as comma-separated in a string or as a one-to-many), incomplete data (one customer table has family income bracket, the other does not), diversity in units of measurement (Imperial vs. metric), variability in the definition of units (seven different things all called blood pressure), variability in unit conversions (currency exchange rates), to name a few. What a world!
If data exists, the conversion questions are often answerable but their answer depends on context -- e.g., date of transaction for currency exchange rate; source of data for the definition of blood pressure.
Alongside these, there remain issues of identity, e.g., depending on the perspective, a national subsidiary is or is not the same entity as the parent company, companies with the same name can be entirely unrelated in different jurisdictions.
It appears that we may need a multi-level approach, combining different techniques for different phases of the integration process. We do not a priori believe that using SQL VIEWs for unit and modeling conversion, and then OWL for unifying terminology on top of this, were the whole solution. Even if this were the solution, the pipeline from the relational sources to SPARQL and OWL needs to be optimized for real-world BI information volumes, and the query language needs to be able to express the business questions and needs to interface with the reporting tools the analyst has come to expect.
Our answer so far consists of a SPARQL extension with non-recursive rules, roughly equivalent to SQL VIEWs in expressive power, tightly integrated to the query engine. There is also limited support for recursion through transitive subqueries; thus one can compactly express things like "all parts of all assemblies and subassemblies must satisfy applicable safety requirements, where the requirements depend on the type of the part in question."
This is only an intermediate step. We believe that a database-scale generic inference engine with at least Datalog power, with second-order extensions like computed predicates, is needed, executing inside the DBMS, benefiting from the whole array of optimizations database-science expects of execution engines, as part of the answer.
This will not relieve the analyst of having to consider that the currency rates in effect at the time of conversion must be taken into account when calculating profits, but this will at least make expressing this and similar pieces of context more compact.
We note that time-to-answer has historically won over raw performance. This was also the case for RDBMS when these were the fresh challenger to the CODASYL incumbents, just as was the case with the adoption of high-level languages. The key is that the raw performance must be sufficient for the real world task. With the adoption of the database lessons outlined in the previous section, we believe this to be the case for GDB (and thus, RDF).
Benchmarks have a stellar record for improving any metric they measure. The question is, how can we make a metric that measures GDB's ability to deliver on its claim to fame -- time-to-answer for big data -- with all the integration and other complexities this entails?
So far, GDB benchmarks have consisted of workloads where RDBMS are clearly better (e.g., LUBM, or the Berlin SPARQL Benchmark). This does not remove their usefulness for GDB, but does not constitute a GDB selling point, either.
We suggest a dual approach. The first part is demonstrating that GDB is scalable for BI: We take the industry standard decision support benchmark TPC-H, which is very favorable to RDB and quite unfavorable to GDB, and show that we can tackle the workload at reasonable cost. If TPC-H is all one wants, an RDBMS will stay a better fit, but then this benchmark does not capture any of the heterogeneity, schema evolution, or other such requirements faced by real-world data warehouses. This is still a qualification test, not the selling point.
The issue of benchmark is inextricably tied to the issue of messaging. There must be a compelling story, with which the IT community can identify. Further, the benchmark must capture real-world challenges in the area of interest. With all this, the benchmark should not be too expensive to run. Here too, a multistage approach suggests itself.
Our tentative answer to this question is the Social Intelligence Benchmark (SIB), developed together with CWI and other partners in the LOD2 consortium. This simulates a social network and combines an online workload with complex analytics. This benchmark should cover all of the target areas of the LOD2 project, so that the project itself generates its own metric of success. The project has clear data integration targets, especially as applies to Web and Linked Data. Questions of integration with enterprise sources need to be further developed; for example, comparing CRM data with extractions from the online conversation space for market research.
Data integration will invariably involve human effort, and the area cannot be satisfactorily covered with metrics of scale and throughput alone. Development time, accuracy of results, and cost of maintenance are all factors. Furthermore, the task being modeled must correspond to reality, still without being too domain-specific or prohibitively time-consuming to implement.
The data driven world will increase rewards for efficiency in data integration. We believe that such efficiency crucially depends on semantics. Real world requirements just might throw the database and AI communities together with enough heat and pressure for fusion to ignite, allegorically speaking. Without a clear and present need, the geek world analog of electrostatic repulsion will keep the communities separate, as has been the case thus far, and no new, qualitatively-different element will arise.
Efforts such as this STI Summit and the LOD2 Project are needed for setting directions and communicating the requirement to the research world. In our fusion analogy, this is the field which directs the nuclei to collide.
Once there is an actual reaction that produces more than it consumes by a sufficient margin, regular business dynamics will take over, and we will have an industry with several products of comparable capability, as well as a set of metrics, all to the benefit of the end user.
TPC-H results pages
Daniel Abadi's Ph.D. Thesis, Query Execution in Column-Oriented Database Systems ( PDF )
Our VLDB 2010 Semdata workshop paper, Directions and Challenges for Semantically Linked Data ( HTML | PDF )
CWI's ROX: Run-time Optimization of XQueries ( PDF )
This is substantially about the intersection of AI, knowledge representation, and databases. As we have said before, the database side has not been very prominent in these meetings in the past, but this time we had Peter Boncz of CWI, of MonetDB and VectorWise fame, attending the proceedings.
Will DB and AI finally meet? Well, they have met, but how do they get along? Before I try to answer this, let us look at some background.
At present, CWI and OpenLink are working together in the LOD2 EU FP7 project, around the general topic of bringing the best of Relational Database (RDB) science to the Graph Database (GDB) world. Virtuoso has for a few months had a column store capability (which is about to be made available for public preview). CWI has a long history of column store work, with MonetDB and Ingres VectorWise as results. OpenLink's column store implementation is separate in terms of code but is of course influenced by the work at CWI and other published column store results. The plan is to transplant the applicable CWI innovations into the graph context within Virtuoso. These improvements naturally also benefit Virtuoso RDB (SQL), but the LOD2 project is primarily concerned with GDB applications. The RDB yardstick for much of this work is TPC-H, of which we have made a GDB translation. CWI is uniquely qualified as concerns this in light of VectorWise holding some of the top places in the TPC-H charts.
Even now, we do in fact run the 22 TPC-H queries in SPARQL against the Virtuoso column store. True, these run faster in SQL against relational tables but we have established a beach head. From this initial position, we can incrementally improve the GDB/SPARQL and RDB/SQL functions, and see how close to SQL we get with SPARQL. I will make a separate post commenting on the differences between SQL and SPARQL.
So let's get back to Riga. Mark Greaves said in his opening comments that he would be sick if he once again heard complaining about how bad and un-scalable the tools were. From all the talks, I did get the overall impression that just better databasing for Graph Data is still needed. OK, we have 1-1/2 years of unreleased work just for that about to hit the street; advances are substantial. Along these lines, the people from Bio2RDF pointed out that there still is a cost to publishing query services, specially for complex queries. Well, this cost will be substantially reduced.
The takeaway from the meeting is that the most useful thing, for both our public and ourselves, is simply to keep advancing database tech for graph data. In the first instance, this is about launching what we already have; in the second, about going through the CWI record of innovation and adapting this to GDB.
The thinking is that once query-answering on some tens-of-billions of triples is easily interactive no matter what question one asks, a tipping point will be reached, and GDB can efficiently play the role of data-melting-pot that has been envisioned for it.
This is just a beginning, though. Michael Brodie has on a number of occasions pointed out that that (relational) database guys are only about performance with little or no regard to meaning or even questions of the applicability of the relational model. Peter Boncz then comments back that it can well be that the bulk of IT expenditure worldwide in fact goes into data integration. However, data integration is an "AI-complete" problem with infinite variety and consequent difficulty of measurement. So, making better database engines stands a much greater chance of success and has the nicety of relatively unambiguous metrics.
Quite so. We are somewhere in the middle. I'd say that GDB is still at the stage where making better databases is a matter of make-or-break and not a matter of cutting already vanishingly-short response times just for the sake of it. We will have progress if we just keep at it; for now, performance is still a basic need and not a luxury.
Now that there is all this potentially integrable data published as graphs (most commonly as RDF serializations), what do we do? Someone at the Riga meeting suggested we take a look across the tracks to the RDB world to see what is being done there for data integration. The question is raised, what does GDB have for data integration? The automatic answer that GDB and RDF have OWL is not adequate, as was rightly pointed out by many. Having schema-last, global identifiers, and some culture of vocabulary reuse is nice, but this is only a start. To cite an example, owl:sameAs
will not work when entities simply do not align: One database models a product as a parts hierarchy; another does the same but now based on the materials used in the parts. One tree just has a node that is not in the other. Besides, things like string matching (as in extracting area codes from phone numbers) are common, and OWL specifically excludes any such functions.
It is now time to look at what will come after all the database advances. In my talk I outlined some things that have or are about to get solutions:
Database technology: Applying advances from RDB (specifically columns, vectoring, and some adaptive query execution) will make GDB a possibility for data warehousing at some scale.
Benchmarks: These advances will be demonstrable through benchmarking. There is a better suite of benchmarks with many variations of BSBM, an GDB-modified TPC-H, and the upcoming Social Intelligence Benchmark (SIBB) with actual graph data. There are the beginnings of an auditing process for result publishing, and a fair chance the semdata world will get its analog of the TPC.
After these basics are more or less in hand, we have a vista of more diverse questions:
What to do about inference? We do not want OWL or RIF for their own sake; instead we want whatever will declaratively facilitate making sense of data. This is an entirely use-case-driven question. If this can have a reasonably generic answer, we will build it into the engine.
Data integration is highly diverse, and tool sets like IBM Infosphere have thousands of modules and functions for different aspects of the problem. To what degree does it make sense to put DI-oriented capabilities into a DBMS?
Is it the case that SQL or SPARQL, plus or minus a few details, is as powerful as a language can be while staying application domain-agnostic? In other words, if more powerful reasoning is built into the query language, will the requirements vary so much between application domains that the work is not generally applicable? Datalog is general enough, but can we demonstrate substantially reduced time to answer with big data if this is built into the engine? Berkeley Orders Of Magnitude claims this, even though their claim is not exactly in a database context. We need use cases to refine the actual requirement for inference.
In all these questions, we of necessity turn to the user community. In fact we do not follow the usage of these technologies as much as we ought to. One outcome of the Riga summit is a set of public challenges that will hopefully ameliorate this state of matters, to be released soon.
The general feeling was that there is more going on on the data side than the AI side. The LOD movement proceeds and lightweight everything predominates, also for knowledge representation. There was some discussion about "pay as you go" integration. On the one hand, there is no up-front integration of information systems just for its own sake, so pay as you go is the only kind that exists, system by system, as the need becomes sufficient. On the other hand, each such integration is a process which has its distinct steps and maintenance and within itself it is planned, and thus pre-paid, so to speak. We need more work with the data itself to better understand the matter. The open government data should offer a playground for this and there will be a special challenge around this.
Schema.org and Microdata got their share of discussion. As we see it, it is good that search engines make their pre-competitive data open. This is better than, for example, Google wanting retailers to put their catalogs in Google Base. We do not care about the specific syntax in which data is embedded; we support them all. Microdata converts easily to triples, and if one wants to make a tabular extraction for use with relational tools, this too is simple enough. Applications will have to do their own entity resolution, but this is independent of data publication format.
All in all, the mood was positive. Mark Greaves noted in his closing remarks that there has been a 1000x increase in published GDB data over a few years. There is in fact a large quantity of technology for tackling almost any aspect of the LOD value chain, but people do not necessarily know about this nor is it easy to integrate. Still there would be great value in integration. Getting software to interoperate in a meaningful way is manual labor, so it might make sense to organize hackathons around this. While the STI Summit is for the senior people, there could be a parallel track of events for bringing the coders together to actually practice tool integration and interoperation.
]]>In the following, we will talk in terms of triples, but the discussion can be trivially generalized to quads. We will use numbers for IRIs and literals. In most implementations, the internal representation for these is indeed a number (or at least some data type that has a well defined collation order). For ease of presentation, we consider a single index with key parts SPO
. Any other index-like setting with any possible key order will have similar issues.
INSERT
and DELETE
as defined in SPARQL are queries which generate a result set which is then used for instantiating triple patterns. We note that a DELETE
may delete a triple which the DELETE
has not read; thus the delete set is not a subset of the read set. The SQL equivalent is the
DELETE FROM table WHERE key IN
( SELECT key1 FROM other_table )
expression, supposing it were implemented as a scan of other_table
and an index lookup followed by DELETE
on table.
The meaning of INSERT
is that the triples in question exist after the operation, and the meaning of DELETE
is that said triples do not exist. In a transactional context, this means that the after-image of the transaction is guaranteed either to have or not-have said triples.
Suppose that the triples { 1 0 0 }
, { 1 5 6 }
, and { 1 5 7 }
exist in the beginning. If we DELETE { 1 ?x ?y }
and concurrently INSERT { 1 2 4 . 1 2 3 . 1 3 5 }
, then whichever was considered to be first by the concurrency control of the DBMS would complete first, and the other after that. Thus the end state would either have no triples with subject 1
or would have the three just inserted.
Suppose the INSERT
inserts the first triple, { 1 2 4 }
. The DELETE
at the same time reads all triples with subject 1
. The exclusive read waits for the uncommitted INSERT
. The INSERT
then inserts the second triple, { 1 2 3 }
. Depending on the isolation of the read, this either succeeds, since no { 1 2 3 }
was read, or causes a deadlock. The first corresponds to REPEATABLE READ
isolation; the second to SERIALIZABLE
.
We would not get the desired end-state of either all the inserted triples or no triples with subject 1
if the read or the DELETE
were not serializable.
Furthermore if a DELETE
template produced a triple that did not exist in the pre-image, the DELETE
semantics still imply that this also does not exist in the after-image, which implies serializability.
Let us consider the prototypical transaction example of transferring funds from one account to another. Two balances are updated, and a history record is inserted.
The initial state is
a balance 10
b balance 10
We transfer 1
from a
to b
, and at the same time transfer 2
from b
to a
. The end state must have a
at 11
and b
at 9
.
A relational database needs REPEATABLE READ
isolation for this.
With RDF, txn1
reads that a
has a balance
of 10
. At the same time, txn1
reads the balance
of a
. txn2
waits because the read of txn1
is exclusive. txn1
proceeds and read the balance
of b
. It then updates the balance
of a
and b
.
All goes without the deadlock which is always cited in this scenario, because the locks are acquired in the same order. The act of updating the balance of a
, since RDF does not really have an update-in-place, consists of deleting { a balance 10 }
and inserting { a balance 9 }
. This gets done and txn1
commits. At this point, txn2
proceeds after its wait on the row that stated { a balance 10 }
. This row is now gone, and txn2
sees that a
has no balance, which is quite possible in RDF's schema-less model.
We see that REPEATABLE READ
is not adequate with RDF, even though it is with relational. The reason why there is no UPDATE
-in-place is that the PRIMARY KEY
of the triple includes all the parts, including the object. Even in a RDBMS, an UPDATE
of a primary key part amounts to a DELETE
-plus-INSERT
. One could here argue that an implementation might still UPDATE
-in-place if the key order were not changed. This would resolve the special case of the accounts but not a more general case.
Thus we see that the read of the balance must be SERIALIZABLE
. This means that the read locks the space before the first balance, so that no insertion may take place. In this way the read of txn2
waits on the lock that is conceptually before the first possible match of { a balance ?x }
.
To implement TPC-C, I would update the table with the highest cardinality first, and then all tables in descending order of cardinality. In this way, the locks with the highest likelihood for contention are held for the least time. If locking multiple rows of a table, these should be locked in a deterministic order, e.g., lowest key-value first. In this way, the workload would not deadlock. In actual fact, with clusters and parallel execution, the lock acquisition will not be guaranteed to be serial, so deadlocks do not entirely go away, but still may get fewer. Besides, any outside transaction might still lock in the wrong order and cause deadlocks, which is why the OLTP application must in any case be built to deal with the possibility of deadlock.
This is the conventional relational view of the matter. In more recent times, in-memory schemes with deterministic lock acquisition (Abadi VLDB 2010) or single-threaded atomic execution of transactions (Uni Munich BIRTE workshop at VLDB2010, VoltDB) have been proposed. There the transaction is described as a stored procedure, possibly with extra annotations. These techniques might apply to RDF also. RDF is however an unlikely model for transaction-intensive applications, so we will not for now examine these further.
RDBMS usually implement row-level locking. This means that once a column of a row has an uncommitted state, any other transaction is prevented from changing the row. This has no ready RDF equivalent. RDF is usually implemented as a row-per-triple system and applying row-level locking to this does not give the semantic one expects of a relational row.
I would argue that it is not essential to enforce transactional guarantees in units of rows. The guarantees must apply between data that is read and written by a transaction. It does not need to apply to columns that the transaction does not reference. To take the TPC-C example, the new order transaction updates the stock level and the delivery transaction updates the delivery count on the stock table. In practice, a delivery and a new order falling on the same row of stock will lock each other out, but nothing in the semantics of the workload mandates this.
It does not seem a priori necessary to recreate the row as a unit of concurrency control in RDF. One could say that a multi-attribute whole (such as an address) ought to be atomic for concurrency control, but then applications updating addresses will most likely read and update all the fields together even if only the street name changes.
We have so far spoken only in terms of row-level locking, which is to my knowledge the most widely used model in RDBMS, and one we implement ourselves. Some databases (e.g., MonetDB and VectorWise) implement optimistic concurrency control. The general idea is that each transaction has a read and write set and when a transaction commits, any other transactions whose read or write set intersects with the write set of the committing transaction are marked un-committable. Once a transaction thus becomes un-committable, it may presumably continue reading indefinitely but may no longer commit its updates. Optimistic concurrency is generally coupled with multi-version semantics where the pre-image of a transaction is a clean committed state of the database as of a specific point in time, i.e., snapshot isolation.
To implement SERIALIZABLE
isolation, i.e., the guarantee that if a transaction twice performs a COUNT
the result will be the same, one locks also the row that precedes the set of selected rows and marks each lock so as to prevent an insert to the right of the lock in key order. The same thing may be done in an optimistic setting.
Positional Handling of Updates in Column Stores [Heman, Zukowski, CWI science library] discusses management of multiple consecutive snapshots in some detail. The paper does not go into the details of different levels of isolation but nothing there suggests that serializability could not be supported. There is some complexity in marking the space between ordered rows as non-insertable across multiple versions but this should be feasible enough.
The issue of optimistic Vs. pessimistic concurrency does not seem to be affected by the differences between RDF and relational models. We note that an OLTP workload can be made to run with very few transaction aborts (deadlocks) by properly ordering operations when using a locking scheme. The same does not work with optimistic concurrency since updates happen immediately and transaction aborts occur whenever the writes of one intersect the reads or writes of another, regardless of the order in which these were made.
Developers seldom understand transactions; therefore DBMS should, within the limits of the possible, optimize locking order for locking schemes. A simple example is locking in key order when doing an operation on a set of values. A more complex variant would consist of analyzing data dependencies in stored procedures and reordering updates so as to get the highest cardinality tables first. We note that this latter trick also benefits optimistic schemes.
In RDF, the same principles apply but distinguishing cardinality of an updated set will have to rely on statistics of predicate cardinality. Such are anyhow needed for query optimization.
Web scale systems that need to maintain consistent state across multiple data centers sometimes use "eventual consistency" schemes. Two-phase-commit becomes very inefficient as latency increases, thus strict transactional semantics have prohibitive cost if the system is more distributed than a cluster with a fast interconnect.
Eventual consistency schemes (Amazon Dynamo, Yahoo! PNUTS) maintain history information on the record which is the unit of concurrency control. The record is typically a non-first normal form chunk of related data that it makes sense to store together from the application's viewpoint. Application logic can then be applied to reconciling differing copies of the same logical record.
Such a scheme seems a priori ill-suited for RDF, where the natural unit of concurrency control would seem to be the quad. We first note that only recently changed (i.e., DELETEd + INSERTed
quads, as there is no UPDATE
-in-place) need history information. This history information can be stored away from the quad itself, thus not disrupting compression. When detecting that one site has INSERTed
a quad that another has DELETEd
in the same general time period, application logic can still be applied for reading related quads in order to arrive at a decision on how to reconcile two databases that have diverged. The same can apply to conflicting values of properties that for the application should be single-valued. Comparing time-stamped transaction logs on quads is not fundamentally different from comparing record histories in Dynamo or PNUTS.
As we overcome the data size penalties that have until recently been associated with RDF, RDF becomes even more interesting as a data model for large online systems such as social network platforms where frequent application changes lead to volatility of schema. Key value stores are currently found in such applications, but they generally do not provide the query flexibility at which RDF excels.
We have gone over basic aspects of the endlessly complex and variable topic of transactions, and drawn parallels as well as outlined two basic differences between relational and RDF systems: What used to be REPEATABLE READ
becomes SERIALIZABLE
; and row-level locking becomes locking at the level of a single attribute value. For the rest, we see that the optimistic and pessimistic modes of concurrency control, as well as guidelines for writing transaction procedures, remain much the same.
Based on this overview, it should be possible to design an ACID test for describing the ACID behavior of benchmarked systems. We do not intend to make transaction support a qualification requirement for an RDF benchmark, but information on transaction support will still be valuable in comparing different systems.
]]>Transactions are certainly not the first thing that comes to mind when one hears "RDF". We have at times used a recruitment questionnaire where we ask applicants to define a transaction. Many vaguely remember that it is a unit of work, but usually not more than that. We sometimes get questions from users about why they get an error message that says "deadlock". "Deadlock" is what happens when multiple users concurrently update balances on multiple bank accounts in the wrong order. What does this have to do with RDF?
There are in fact users who even use XA with a Virtuoso-based RDF application. Franz also has publicized their development of full ACID capabilities for AllegroGraph. RDF is a database schema model, and transactions will inevitably become an issue in databases.
At the same time, the developer population trained with MySQL and PHP is not particularly transaction-aware. Transactions have gone out of style, declares the No-SQL crowd. Well, it is not so much SQL they object to but ACID, i.e., transactional guarantees. We will talk more about this in the next post. The SPARQL language and protocol do not go into transactions, except for expressing the wish that an UPDATE
request to an end-point be atomic. But beware -- atomicity is a gateway drug, and soon one finds oneself on full ACID.
If one says that a thing will either happen in its entirety or not at all, which is what (A) atomicity means, then the question arises of (I) isolation; that is, what happens if somebody else does something to the same data at the same time? Then comes the question of whether a thing, once having happened, will stay that way; i.e., (D) durability. Finally, there is (C) consistency, which means that the transaction's result must not contradict restrictions the database is supposed to enforce. RDF usually has no restrictions; thus consistency mostly means that the internal state of the DBMS must be consistent, e.g., different indices on triples/quads should contain the same data.
There are, of course, database-like consistency criteria that one can express in RDF Schema and OWL, concerning data types, mandatory presence of properties, or restrictions on cardinality (i.e., one may only have one spouse at a time, and the like).
If one indeed did enforce them all, then RDF would be very like the relational model -- with all the restrictions, but without the 40 years of work on RDBMS performance. For this reason, RDF use tends to involve data that is not structured enough to be a good fit for RDBMS.
There is of course the OWL side, where consistency is important but is defined in such complex ways that they again are not a good fit for RDBMS. RDF could be seen to be split between the schema-last world and the knowledge representation world. I will here focus on the schema-last side.
Transactions are relevant in RDF in two cases: 1. If data is trickle loaded in small chunks, one likes to know that the chunks do not get lost or corrupted; 2. If the application has any semantics that reserve resources, then these operations need transactions. The latter is not so common with RDF but examples include read-write situations, like checking if a seat is available and then reserving it. Transactionality guarantees that the same seat does not get reserved twice.
Web people argue with some justification that since the four cardinal virtues of database never existed on the web to begin with, applying strict ACID to web data is beside the point, like locking the stable after the horse has long since run away. This may be so; yet the systems used for processing data, whether that data is dirty or not, benefit from predictable operation under concurrency and from not losing data.
Analytics workloads are not primarily about transactions, but still need to specify what happens with updates. Analyzing data from measurements may not have concurrent updates, but there the transaction issue is replaced by the question of making explicit how the data was acquired and what processing has been applied to it before storage.
As mentioned before, the LOD2 project is at the crossroads of RDF and database. I construe its mission to be the making of RDF into a respectable database discipline. Database respectability in turn is as good as inconceivable without addressing the very bedrock on which this science was founded: transactions.
As previously argued, we need well-defined and auditable benchmarks. This again brings up the topic of transactions. Once we embark on the database benchmark route, there is no way around this. TPC-H mandates that the system under test support transactions, and the audit involves a test for this. We can do no less.
This has led me to more closely examine the issue of RDF and transactions, and whether there exist differences between transactions applied to RDF and to relational data.
As concerns Virtuoso, our position has been that one can get full ACID in Virtuoso, whether in SQL or SPARQL, by using a connected client (e.g., ODBC, JDBC, or the Jena or Sesame frameworks), and setting the isolation options on the connection. Having taken this step, one then must take the next step, which consists of dealing with deadlocks; i.e., with concurrent utilization, it may happen that the database at any time notifies the client that the transaction got aborted and the client must retry.
Web developers especially do not like this, because this is not what MySQL has taught them to expect. MySQL does have transactional back-ends like InnoDB, but often gets used without transactions.
With the March 2011 Virtuoso releases, we have taken a closer look at transactions with RDF. It is more practical to reduce the possibility of errors than to require developers to pay attention. For this reason we have automated isolation settings for RDF, greatly reduced the incidence of deadlocks, and even incorporated automatic deadlock retries where applicable.
If all users lock resources they need in the same order, there will be no deadlocks. This is what we do with RDF load in Virtuoso 7; thus any mix of concurrent INSERTs
and DELETEs
, if these are under a certain size (normally 10000 quads) are guaranteed never to fail due to locking. These could still fail due to running out of space, though. With previous versions, there always was a possibility of having an INSERT
or DELETE
fail because of deadlock with multiple users. Vectored INSERT
and DELETE
are sufficient for making web crawling or archive maintenance practically deadlock free, since there the primary transaction is the INSERT
or DELETE
of a small graph.
Furthermore, since the SPARQL protocol has no way of specifying transactions consisting of multiple client-server exchanges, the SPARQL end-point may deal with deadlocks by itself. If all else fails, it can simply execute requests one after the other, thus eliminating any possibility of locking. We note that many statements will be intrinsically free of deadlocks by virtue of always locking in key order, but this cannot be universally guaranteed with arbitrary size operations; thus concurrent operations might still sometimes deadlock. Anyway, vectored execution as introduced in Virtuoso 7, besides getting easily double-speed random access, also greatly reduces deadlocks by virtue of ordering operations.
In the next post we will talk about what transactions mean with RDF and whether there is any difference with the relational model.
]]>Drill-down mode - For queries that have a product type as parameter, the test driver will invoke the query multiple times with each time a random subtype of the product type of the previous invocation. The starting point of the drill-down is an a random type from a settable level in the hierarchy. The rationale for the drill-down mode is that depending on the parameter choice, there can be 1000x differences in query run time. Thus run times of consecutive query mixes will be incomparable unless we guarantee that each mix has a predictable number of queries with a product type from each level in the hierarchy.
New metrics - The BI Power is the geometric mean of query run times scaled to queries per hour and multiplied by the scale factor, where 100 Mt is considered the unit scale. The BI Throughput is the arithmetic mean of the run times scaled to QPH and adjusted to scale as with the Power metric. These are analogous to the TPC-H Power and Throughput metrics.
The Power is defined as
(scale_factor / 284826) * 3600 / ((t0 * t1 * ... * tn) ^(1 / n))
The Throughput is defined as
(scale_factor / 284826) * 3600 / ((t0 + t2 + ... + tn) / n)
The magic number 284826 is the scale that generates approximately 100 million triples (100 Mt). We consider this "scale one." The reason for the multiplication is that scores at different scales should get similar numbers, otherwise 10x larger scale would result roughly in 10x lower throughput with the BI queries.
We also show the percentage each query represents from the total time the test driver waits for responses.
Deadlock retry - When running update mixes, it is possible that a transaction gets aborted by a deadlock. We have made a retry logic for this.
Cluster mode - Cluster databases may have multiple interchangeable HTTP listeners. With this mode, one can specify multiple end-points so a multi-user workload can divide itself evenly over these.
Identifying matter - A version number was added to test driver output. Use of the new switches is also indicated in the test driver output.
SUT CPU - In comparing results it is crucial to differentiate between in memory runs and IO bound runs. To make this easier, we have added an option to report server CPU times over the timed portion (excluding warm-ups). A pluggable self-script determines the CPU times for the system; thus clusters can be handled, too. The time is given as a sum of the time the server processes have aged during the run and as a percentage over the wall-clock time.
These changes will soon be available as a diff and as a source tree. This version is labeled BSBM Test Driver 1.1-opl
; the -opl
signifies OpenLink additions.
We invite FU Berlin to include these enhancements into their Source Forge repository of the BSBM test driver. There is more precise documentation of these options in the README file in the above distribution.
The next planned upgrade of the test driver concerns adding support for "RDF-H", the RDF adaptation of the industry standard TPC-H decision support benchmark for RDBMS.
Our intent here is to look at whether the metric works, and to see what results will look like in general. We are as much testing the benchmark as we are testing the system-under-test (SUT). The results shown here will likely not be comparable with future ones because we will most likely change the composition of the workload since it seems a bit out of balance. Anyway, for the sake of disclosure, we attach the query templates. The test driver we used will be made available soon, so the interested may still try a comparison with their systems. If you practice with this workload for the coming races, the effort will surely not be wasted.
Once we have come up with a rules document, we will redo all that we have published so far by-the-book, and have it audited as part of the LOD2 service we plan for this (see previous posts in this series). This will introduce comparability; but before we get that far with the BI workload, the workload needs to evolve a bit.
Below we show samples of test driver output; the whole output is downloadable.
100 Mt Single User
bsbm/testdriver -runs 1 -w 0 -idir /bs/1 -drill \
-ucf bsbm/usecases/businessIntelligence/sparql.txt \
-dg http://bsbm.org http://localhost:8604/sparql
0: 43348.14ms, total: 43440ms
Scale factor: 284826
Explore Endpoints: 1
Update Endpoints: 1
Drilldown: on
Number of warmup runs: 0
Seed: 808080
Number of query mix runs (without warmups): 1 times
min/max Querymix runtime: 43.3481s / 43.3481s
Elapsed runtime: 43.348 seconds
QMpH: 83.049 query mixes per hour
CQET: 43.348 seconds average runtime of query mix
CQET (geom.): 43.348 seconds geometric mean runtime of query mix
AQET (geom.): 0.492 seconds geometric mean runtime of query
Throughput: 1494.874 BSBM-BI throughput: qph*scale
BI Power: 7309.820 BSBM-BI Power: qph*scale (geom)
100 Mt 8 User
Thread 6: query mix 3: 195793.09ms, total: 196086.18ms
Thread 8: query mix 0: 197843.84ms, total: 198010.50ms
Thread 7: query mix 4: 201806.28ms, total: 201996.26ms
Thread 2: query mix 5: 221983.93ms, total: 222105.96ms
Thread 4: query mix 7: 225127.55ms, total: 225317.49ms
Thread 3: query mix 6: 225860.49ms, total: 226050.17ms
Thread 5: query mix 2: 230884.93ms, total: 231067.61ms
Thread 1: query mix 1: 237836.61ms, total: 237959.11ms
Benchmark run completed in 237.985427s
Scale factor: 284826
Explore Endpoints: 1
Update Endpoints: 1
Drilldown: on
Number of warmup runs: 0
Number of clients: 8
Seed: 808080
Number of query mix runs (without warmups): 8 times
min/max Querymix runtime: 195.7931s / 237.8366s
Total runtime (sum): 1737.137 seconds
Elapsed runtime: 1737.137 seconds
QMpH: 121.016 query mixes per hour
CQET: 217.142 seconds average runtime of query mix
CQET (geom.): 216.603 seconds geometric mean runtime of query mix
AQET (geom.): 2.156 seconds geometric mean runtime of query
Throughput: 2178.285 BSBM-BI throughput: qph*scale
BI Power: 1669.745 BSBM-BI Power: qph*scale (geom)
1000 Mt Single User
0: 608707.03ms, total: 608768ms
Scale factor: 2848260
Explore Endpoints: 1
Update Endpoints: 1
Drilldown: on
Number of warmup runs: 0
Seed: 808080
Number of query mix runs (without warmups): 1 times
min/max Querymix runtime: 608.7070s / 608.7070s
Elapsed runtime: 608.707 seconds
QMpH: 5.914 query mixes per hour
CQET: 608.707 seconds average runtime of query mix
CQET (geom.): 608.707 seconds geometric mean runtime of query mix
AQET (geom.): 5.167 seconds geometric mean runtime of query
Throughput: 1064.552 BSBM-BI throughput: qph*scale
BI Power: 6967.325 BSBM-BI Power: qph*scale (geom)
1000 Mt 8 User
bsbm/testdriver -runs 8 -mt 8 -w 0 -idir /bs/10 -drill \
-ucf bsbm/usecases/businessIntelligence/sparql.txt \
-dg http://bsbm.org http://localhost:8604/sparql
Thread 3: query mix 4: 2211275.25ms, total: 2211371.60ms
Thread 4: query mix 0: 2212316.87ms, total: 2212417.99ms
Thread 8: query mix 3: 2275942.63ms, total: 2276058.03ms
Thread 5: query mix 5: 2441378.35ms, total: 2441448.66ms
Thread 6: query mix 7: 2804001.05ms, total: 2804098.81ms
Thread 2: query mix 2: 2808374.66ms, total: 2808473.71ms
Thread 1: query mix 6: 2839407.12ms, total: 2839510.63ms
Thread 7: query mix 1: 2889199.23ms, total: 2889263.17ms
Benchmark run completed in 2889.302566s
Scale factor: 2848260
Explore Endpoints: 1
Update Endpoints: 1
Drilldown: on
Number of warmup runs: 0
Number of clients: 8
Seed: 808080
Number of query mix runs (without warmups): 8 times
min/max Querymix runtime: 2211.2753s / 2889.1992s
Total runtime (sum): 20481.895 seconds
Elapsed runtime: 20481.895 seconds
QMpH: 9.968 query mixes per hour
CQET: 2560.237 seconds average runtime of query mix
CQET (geom.): 2544.284 seconds geometric mean runtime of query mix
AQET (geom.): 13.556 seconds geometric mean runtime of query
Throughput: 1794.205 BSBM-BI throughput: qph*scale
BI Power: 2655.678 BSBM-BI Power: qph*scale (geom)
Metrics for Query: 1
Count: 8 times executed in whole run
Time share 2.120884% of total execution time
AQET: 54.299656 seconds (arithmetic mean)
AQET(geom.): 34.607302 seconds (geometric mean)
QPS: 0.13 Queries per second
minQET/maxQET: 11.71547600s / 148.65379700s
Metrics for Query: 2
Count: 8 times executed in whole run
Time share 0.207382% of total execution time
AQET: 5.309462 seconds (arithmetic mean)
AQET(geom.): 2.737696 seconds (geometric mean)
QPS: 1.34 Queries per second
minQET/maxQET: 0.78729800s / 25.80948200s
Metrics for Query: 3
Count: 8 times executed in whole run
Time share 17.650472% of total execution time
AQET: 451.893890 seconds (arithmetic mean)
AQET(geom.): 410.481088 seconds (geometric mean)
QPS: 0.02 Queries per second
minQET/maxQET: 171.07262500s / 721.72939200s
Metrics for Query: 5
Count: 32 times executed in whole run
Time share 6.196565% of total execution time
AQET: 39.661685 seconds (arithmetic mean)
AQET(geom.): 6.849882 seconds (geometric mean)
QPS: 0.18 Queries per second
minQET/maxQET: 0.15696500s / 189.00906200s
Metrics for Query: 6
Count: 8 times executed in whole run
Time share 0.119916% of total execution time
AQET: 3.070136 seconds (arithmetic mean)
AQET(geom.): 2.056059 seconds (geometric mean)
QPS: 2.31 Queries per second
minQET/maxQET: 0.41524400s / 7.55655300s
Metrics for Query: 7
Count: 40 times executed in whole run
Time share 1.577963% of total execution time
AQET: 8.079921 seconds (arithmetic mean)
AQET(geom.): 1.342079 seconds (geometric mean)
QPS: 0.88 Queries per second
minQET/maxQET: 0.02205800s / 40.27761500s
Metrics for Query: 8
Count: 40 times executed in whole run
Time share 72.126818% of total execution time
AQET: 369.323481 seconds (arithmetic mean)
AQET(geom.): 114.431863 seconds (geometric mean)
QPS: 0.02 Queries per second
minQET/maxQET: 5.94377300s / 1824.57867400s
The CPU for the multiuser runs stays above 1500% for the whole run. The CPU for the single user 100 Mt run is 630%; for the 1000 Mt run, this is 574%. This can be improved since the queries usually have a lot of data to work on. But final optimization is not our goal yet; we are just surveying the race track. The difference between a warm single user run and a cold single user run is about 15% with data on SSD; with data on disk, this would be more. The numbers shown are with warm cache. The single-user and multi-user Throughput difference, 1064 single-user vs. 1794 multi-user, is about what one would expect from the CPU utilization.
With these numbers, the CPU does not appear badly memory-bound, else the increase would be less; also core multi-threading seems to bring some benefit. If the single-user run was at 800%, the Throughput would be 1488. The speed in excess of this may be attributed to core multi-threading, although we must remember that not every query mix is exactly the same length, so the figure is not exact. Core multi-threading does not seem to hurt, at the very least. Comparison of the same numbers with the column store will be interesting since it misses the cache a lot less and accordingly has better SMP scaling. The Intel Nehalem memory subsystem is really pretty good.
For reference, we show a run with Virtuoso 6 at 100Mt.
0: 424754.40ms, total: 424829ms
Scale factor: 284826
Explore Endpoints: 1
Update Endpoints: 1
Drilldown: on
Number of warmup runs: 0
Seed: 808080
Number of query mix runs (without warmups): 1 times
min/max Querymix runtime: 424.7544s / 424.7544s
Elapsed runtime: 424.754 seconds
QMpH: 8.475 query mixes per hour
CQET: 424.754 seconds average runtime of query mix
CQET (geom.): 424.754 seconds geometric mean runtime of query mix
AQET (geom.): 1.097 seconds geometric mean runtime of query
Throughput: 152.559 BSBM-BI throughput: qph*scale
BI Power: 3281.150 BSBM-BI Power: qph*scale (geom)
and 8 user
Thread 5: query mix 3: 616997.86ms, total: 617042.83ms
Thread 7: query mix 4: 625522.18ms, total: 625559.09ms
Thread 3: query mix 7: 626247.62ms, total: 626304.96ms
Thread 1: query mix 0: 629675.17ms, total: 629724.98ms
Thread 4: query mix 6: 667633.36ms, total: 667670.07ms
Thread 8: query mix 2: 674206.07ms, total: 674256.72ms
Thread 6: query mix 5: 695020.21ms, total: 695052.29ms
Thread 2: query mix 1: 701824.67ms, total: 701864.91ms
Benchmark run completed in 701.909341s
Scale factor: 284826
Explore Endpoints: 1
Update Endpoints: 1
Drilldown: on
Number of warmup runs: 0
Number of clients: 8
Seed: 808080
Number of query mix runs (without warmups): 8 times
min/max Querymix runtime: 616.9979s / 701.8247s
Total runtime (sum): 5237.127 seconds
Elapsed runtime: 5237.127 seconds
QMpH: 41.031 query mixes per hour
CQET: 654.641 seconds average runtime of query mix
CQET (geom.): 653.873 seconds geometric mean runtime of query mix
AQET (geom.): 2.557 seconds geometric mean runtime of query
Throughput: 738.557 BSBM-BI throughput: qph*scale
BI Power: 1408.133 BSBM-BI Power: qph*scale (geom)
Having the numbers, let us look at the metric and its scaling. We take the geometric mean of the single-user Power and the multiuser Throughput.
100 Mt: sqrt ( 7771 * 2178 ); = 4114
1000 Mt: sqrt ( 6967 * 1794 ); = 3535
Scaling seems to work; the results are in the same general ballpark. The real times for the 1000 Mt run are a bit over 10x the times for the 100Mt run, as expected. The relative percentages of the queries are about the same on both scales, with the drill-down in Q8 alone being 77% and 72% respectively. The Q8 drill-down starts at the root of the product hierarchy. If we made this start one level from the top, its share would drop. This seems reasonable.
Conversely, Q2 is out of place, with far too little share of the time. It takes a product as a starting point and shows a list of products with common features, sorted by descending count of common features. This would more appropriately be applied to a leaf product category instead, measuring how many of the products in the category have the top 20 features found in this category, to name an example.
Also there should be more queries.
At present it appears that BSBM-BI is definitely runnable, but a cursory look suffices to show that the workload needs more development and variety. We remember that I dreamt up the business questions last fall without much analysis, and that these questions were subsequently translated to SPARQL by FU Berlin. So, on one hand, BSBM-BI is of crucial importance because it is the first attempt at doing a benchmark with long running queries in SPARQL. On the other hand, BSBM-BI is not very good as a benchmark; TPC-H is a lot better. This stands to reason, as TPC-H has had years and years of development and participation by many people.
Benchmark queries are trick questions: For example, TPC-H Q18 cannot be done without changing an IN
into a JOIN
with the IN
subquery in the outer loop and doing streaming aggregation. Q13 cannot be done without a well-optimized HASH JOIN
which besides must be partitioned at the larger scales.
Having such trick questions in an important benchmark eventually results in everybody doing the optimizations that the benchmark clearly calls for. Making benchmarks thus entails a responsibility ultimately to the end user, because an irrelevant benchmark might in the worst case send developers chasing things that are beside the point.
In the following, we will look at what BSBM-BI requires from the database and how these requirements can be further developed and extended.
BSBM-BI does not have any clear trick questions, at least not premeditatedly. BSBM-BI just requires a cost model that can guess the fanout of a JOIN
and the cardinality of a GROUP BY
; it is enough to distinguish smaller from greater; the guess does not otherwise have to be very good. Further, the queries are written in the benchmark text so that joining from left to right would work, so not even a cost-based optimizer is strictly needed. I did however have to add some cardinality statistics to get reasonable JOIN
order since we always reorder the query regardless of the source formulation.
BSBM-BI does have variable selectivity from the drill-downs; thus these may call for different JOIN
orders for different parameter values. I have not looked into whether this really makes a difference, though.
There are places in BSBM-BI where using a HASH JOIN
makes sense. We do not use HASH JOINs
with RDF because there is an index for everything and making a HASH JOIN
in the wrong place can have a large up-front cost, so one is more robust against cost model errors if one does not do HASH JOINs
. This said, a HASH JOIN
in the right place is a lot better than an index lookup. With TPC-H Q13, our best HASH JOIN
is over 2x better than the best INDEX
-based JOIN
, both being well tuned. For questions like "count the hairballs made in Germany reviewed by Japanese Hello Kitty fans," where two ends of a JOIN
path are fairly selective doing the other as a HASH JOIN
is good. This can, if the JOIN
is always cardinality-reducing, even be merged inside an INDEX
lookup. We have such capabilities since we have been for a while gearing up for the relational races, but are not using any of these with BSBM-BI, although they would be useful.
Let us see the profile for a single user 100 Mt run.
The database activity summary is --
select db_activity (0, 'http');
161.3M rnd 210.2M seq 0 same seg 104.5M same pg 45.08M same par 0 disk 0 spec disk 0B / 0 messages 2.393K fork
See the post "What Does BSBM Explore Measure" for an explanation of the numbers. We see that there is more sequential access than random and the random has fair locality with over half on the same page as the previous and a lot of the rest falling under the same parent. Funnily enough, the explore mix has more locality. Running with a longer vector size would probably increase performance by getting better locality. There is an optimization that adjusts vector size on the fly if locality is not sufficient but this is not being used here. So we manually set vector size to 100000 instead of the default 10000. We get --
172.4M rnd 220.8M seq 0 same seg 149.6M same pg 10.99M same par 21 disk 861 spec disk 0B / 0 messages 754 fork
The throughput goes from 1494 to 1779. We see more hits on the same page, as expected. We do not make this setting a default since it raises the cost for small queries; therefore the vector size must be self-adjusting -- besides, expecting a DBA to tune this is not reasonable. We will just have to correctly tune the self-adjust logic, and we have again clear gains.
Let us now go back to the first run with vector size 10000.
The top of the CPU oprofile
is as follows:
722309 15.4507 cmpf_iri64n_iri64n
434791 9.3005 cmpf_iri64n_iri64n_anyn_iri64n
294712 6.3041 itc_next_set
273488 5.8501 itc_vec_split_search
203970 4.3631 itc_dive_transit
199687 4.2714 itc_page_rcf_search
181614 3.8848 dc_itc_append_any
173043 3.7015 itc_bm_vec_row_check
146727 3.1386 cmpf_int64n
128224 2.7428 itc_vec_row_check
113515 2.4282 dk_alloc
97296 2.0812 page_wait_access
62523 1.3374 qst_vec_get_int64
59014 1.2623 itc_next_set_parent
53589 1.1463 sslr_qst_get
48003 1.0268 ds_add
46641 0.9977 dk_free_tree
44551 0.9530 kc_var_col
43650 0.9337 page_col_cmp_1
35297 0.7550 cmpf_iri64n_iri64n_anyn_gt_lt
34589 0.7399 dv_compare
25864 0.5532 cmpf_iri64n_anyn_iri64n_iri64n_lte
23088 0.4939 dk_free
The top 10 are all index traversal, with the key compare for two leading IRI keys in the lead, corresponding to a lookup with P
and S
given. The one after that is with all parts given, corresponding to an existence test. The existence tests could probably be converted to HASH JOIN
lookups to good advantage. Aggregation and arithmetic are absent. We should probably add a query like TPC-H Q1 that does nothing but these two. Considering the overall profile, GROUP BY
seems to be around 3%. We should probably put in a query that makes a very large number of groups and could make use of streaming aggregation, i.e., take advantage of a situation where aggregation input comes already grouped by the grouping columns.
A BI use case should offer no problem with including arithmetic, but there are not that many numbers in the BSBM set. Some code sections in the queries with conditional execution and costly tests inside ANDs
and ORs
would be good. TPC-H has such in Q21 and Q19. An OR
with existences where there would be gain from good guesses of a subquery's selectivity would be appropriate. Also, there should be conditional expressions somewhere with a lot of data, like the CASE-WHEN
in TPC-H Q12.
We can make BSBM-BI more interesting by putting in the above. Also we will have to see where we can profit from HASH JOIN
, both small and large. There should be such places in the workload already so this is a matter of just playing a bit more.
This post amounts to a cheat sheet for the BSBM-BI runs a bit farther down the road. By then we should be operational with the column store and Virtuoso 7 Cluster, though, so not everything is yet on the table.
We will publish results according to the definitions given here and recommend that any interested parties do likewise. The rationales are given in the text.
We have removed Q4 from the mix because it is quadratic to the scale factor. The other queries are roughly n * log (n)
.
All queries that take a product type as parameter are run in flights of several query invocations where the product type goes from broader to more specific. The initial product type specifies either the root product type or an immediate subtype of this, and the last in the drill-down is a leaf type.
The rationale for this is that the choice of product type may make several orders of magnitude difference in the run time of a query. In order to make consecutive query mixes roughly comparable in execution time, all mixes should have a predictable number of query invocations with product types of each level.
In the BI mix, when running multiple concurrent clients, each query mix is submitted in a random order. Queries which do drill-downs always have the steps of the drill-down as consecutive in the session, but the query templates are permuted. This is done so as to make less likely that there were two concurrent queries accessing exactly the same data. In this way, scans cannot be trivially shared between queries -- but there are still opportunities for reuse of results and adapting execution to working set, e.g., starting with what is in memory.
We use a TPC-H-like metric. This metric consists of a single-user part and a multi-user part, called respectively Power and Throughput. The Power metric is a geometric mean of query run-time. The Throughput is the total run-time divided by the number of queries completed. After taking the mean, the time is converted into queries-per-hour. This time is then multiplied by the scale factor divided by the scale factor for 100 Mt. In other words, we consider the 100 Mt data set as the unit scale.
The Power is defined as
( scale_factor / 284826 ) * 3600 / ( ( t1 * t1 * ... * tn ) ^ ( 1 / n ) )
The Throughput is defined as
( scale_factor / 284826 ) * 3600 / ( ( t1 + t2 + ... + tn ) / n )
The magic number 284826
is the scale that generates approximately 100 million triples (100 Mt). We consider this scale "one". The reason for the multiplication is that scores at different scales should get similar numbers; otherwise 10x larger scale would result roughly in 10x lower throughput with the BI queries.
The Composite metric is the geometric mean of the Power and Throughput metrics. A complete report shows both Power and Throughput metrics, as well as individual query times for all queries. The rationale for using a geometric mean is to give an equal importance to long and short queries. Halving the execution time of either a long query or a short query will have the same effect on the metric. This is good for encouraging research into all aspects of query processing. On the other hand, real-life users are more interested in halving the time of queries that take one hour than of queries that take one second; therefore, the throughput metric considers run times.
Taking the geometric mean of the two metrics gives more weight to the lower of the two than an arithmetic mean, hence we pay more attention to the worse of the two.
Single-user and multi-user metrics are separate because of the relative importance of intra-query parallelization in BI workloads: There may not be large numbers of concurrent users, yet queries are still complex, and it is important to have maximum parallelization. Therefore the metric rewards single-user performance.
In the next post we will look at the use of this metric and the actual content of BSBM BI.
A point that often gets brought up by RDF-ers when talking about benchmarks is that there already exist systems which perform very well at TPC-H and similar workloads, and therefore there is no need for RDF to go there. It is, as it were, somebody else's problem; besides, it is a solved one.
On the other hand, being able to express what is generally expected of a query language might not be a core competence or a competitive edge, but it certainly is a checklist item.
BSBM seems to be adopted as a de facto RDF benchmark, as there indeed is almost nothing else. But we should not lose sight of the fact that this is in fact a relational schema and workload that has just been straightforwardly transformed to RDF. BSBM was made, after all, in part for measuring RDB to RDF mapping. Thus BSBM is no more RDF-ish than a trivially RDF-ized TPC-H would be. TPC-H is however a bit more difficult if also a better thought out benchmark than the BSBM BI Mix proposal. But I do not expect an RDF audience to have any enthusiasm for this as this is indeed a very tough race by now, and besides one in which RDB and SQL will keep some advantage. However, using this as a validation test is meaningful, as there exists a validation dataset and queries that we already have RDF-ized. We could publish these and call this "RDF-H".
In the following I will outline what would constitute an RDF-friendly, scientifically interesting benchmark. The points are in part based on discussions with Peter Boncz of CWI.
The Social Network Intelligence Benchmark (SNIB) takes the social web Facebook-style schema Ivan Mikhailov and I made last year under the name of Botnet BM. In LOD2, CWI is presently working on this.
The data includes DBpedia as a base component used for providing conversation topics, information about geographical locales of simulated users, etc. DBpedia is not very large, around 200M-300M triples, but it is diverse enough.
The data will have correlations, e.g., people who talk about sports tend to know other people who talk about the same sport, and they are more likely to know people from their geographical area than from elsewhere.
The bulk of the data consists of a rich history of interactions including messages to individuals and groups, linking to people, dropping links, joining and leaving groups, and so forth. The messages are tagged using real-world concepts from DBpedia, and there is correlation between tagging and textual content since both are generated from Dbpedia articles. Since there is such correlation, NLP techniques like entity and relationship extraction can be used with the data even though this is not the primary thrust of SNIB.
There is variation in frequency of online interaction, and this interaction consist of sessions. For example, one could analyze user behavior per time of day for online ad placement.
The data probably should include propagating memes, fashions, and trends that travel on the social network. With this, one could query about their origin and speed of propagation.
There should probably be cases of duplicate identities in the data, i.e., one real person using many online accounts to push an agenda. Resolving duplicate identities makes for nice queries.
Ragged data with half-filled profiles and misspelled identifiers like person and place names are a natural part of the social web use case. The data generator should take this into account.
Distribution of popularity and activity should follow a power-law-like pattern; actual measures of popularity can be sampled from existing social networks even though large quantities of data cannot easily be extracted.
The dataset should be predictably scalable. For the workload considered, the relative importance of the queries or other measured tasks should not change dramatically with the scale.
For example some queries are logarithmic to data size (e.g., find connections to a person), some are linear (e.g., find average online time of sports fans on Sundays), and some are quadratic or worse (e.g., find two extremists of the same ideology that are otherwise unrelated). Making a single metric from such parts may not be meaningful. Therefore, SNIB might be structured into different workloads.
The first would be an online mix with typically short lookups and updates, around O ( log ( n ) )
.
The Business Intelligence Mix would be composed of queries around OO ( n log ( n ) )
. Even so, with real data, choice of parameters will provide dramatic changes in query run-time. Therefore a run should be specified to have a predictable distribution of "hard" and "easy" parameter choices. In the BSBM BI mix modification, I did this by defining some to be drill downs from a more general to a more specific level of a hierarchy. This could be done here too in some cases; other cases would have to be defined with buckets of values.
Both the real world and LOD2 are largely concerned with data integration. The SNIB workload can have aspects of this, for example, in resolving duplicate identities. These operations are more complex than typical database queries, as the attributes used for joining might not even match in the initial data.
One characteristic of these is the production of sometimes large intermediate results that need to be materialized. Doing these operations in practice requires procedural control. Further, running algorithms like network analytics (e.g., Page rank, centrality, etc.) involves aggregation of intermediate results that is not very well expressible in a query language. Some basic graph operations like shortest path are expressible but then are not in unextended SPARQL 1.1; as these would for example involve returning paths, which are explicitly excluded from the spec.
These are however the areas where we need to go for a benchmark that is more than a repackaging of a relational BI workload.
We find that such a workload will have procedural sections either in application code or stored procedures. Map-reduce is sometimes used for scaling these. As one would expect, many cluster databases have their own version of these control structures. Therefore some of the SNIB workload could even be implemented as map-reduce jobs alongside parallel database implementations. We might here touch base with the LarKC map-reduce work to see if it could be applied to SNIB workloads.
We see a three-level structure emerging. There is an Online mix which is a bit like the BSBM Explore mix, and an Analytics mix which is on the same order of complexity as TPC-H. These may have a more-or-less fixed query formulation and test driver. Beyond these, yet working on the same data, we have a set of Predefined Tasks which the test sponsor may implement in a manner of their choice.
We would finally get to the "raging conflict" between the "declarativists" and the "map reductionists." Last year's VLDB had a lot of map-reduce papers. I know of comparisons between Vertica and map reduce for doing a fairly simple SQL query on a lot of data, but here we would be talking about much more complex jobs on more interesting (i.e., less uniform) data.
We might even interest some of the cluster RDBMS players (Teradata, Vertica, Greenplum, Oracle Exadata, ParAccel, and/or Aster Data, to name a few) in running this workload using their map-reduce analogs.
We see that as we get to topics beyond relational BI, we do not find ourselves in an RDF-only world but very much at a crossroads of many technologies, e.g., map-reduce and its database analogs, various custom built databases, graph libraries, data integration and cleaning tools, and so forth.
There is not, nor ought there to be, a sheltered, RDF-only enclave. RDF will have to justify itself in a world of alternatives.
This must be reflected in our benchmark development, so relational BI is not irrelevant; in fact, it is what everybody does. RDF cannot be a total failure at this, even if this were not RDF's claim to fame. The claim to fame comes after we pass this stage, which is what we intend to explore in SNIB.
Here I will talk about how this could be organized in a way that is tractable, and takes vendor and end user interests into account. These are my views on the subject and do not represent a LOD2 members consensus, but have been discussed in the consortium.
My colleague Ivan Mikhailov once proposed that the only way to get benchmarks run right is to package them as a single script that does everything, like instant noodles -- just add water! But even instant noodles can be abused: Cook too long, add too much water, maybe forget to light the stove, and complain that the result is unsatisfyingly hard and brittle, lacking the suppleness one has grown to expect from this delicacy. No, the answer lies at the other end of the culinary spectrum, in gourmet cooking. Let the best cooks show what they can do, and let them work at it; let those who in fact have capacity and motivation for creating le chef d'oeuvre culinaire ("the culinary masterpiece") create it. Even so, there are many value points along the dimensions of preparation time, cost, and esthetic layout, not to forget taste and nutritional values. Indeed, an intimate knowledge de la vie secrete du canard ("the secret life of duck") is required in order to liberate the aroma that it might take flight and soar. In the previous, I have shed some light on how we prepare le canard, and if le canard be such then la dinde (turkey) might in some ways be analogous; who is to say?
In other words, as a vendor, we want to have complete control over the benchmarking process, and have it take place in our environment at a time of our choice. In exchange for this, we are ready to document and observe possibly complicated rules, document how the runs are made, and let others monitor and repeat them on the equipment on which the results are obtained. This is the TPC (Transaction Processing Performance Council) model.
Another culture of doing benchmarks is the periodic challenge model used in TREC, the Billion Triples Challenge, the Semantic Search Challenge and others. In this model, vendors prepare the benchmark submission and agree to joint publication.
A third party performing benchmarks by itself is uncommon in databases. Licenses even often explicitly prohibit this, for understandable reasons.
The LOD2 project has an outreach activity called Publink where we offer to help owners of data to publish it as Linked Data. Similarly, since FP 7s are supposed to offer a visible service to their communities, I proposed that LOD2 offer to serve a role in disseminating and auditing RDF store benchmarks.
One representative of an RDF store vendor I talked to, in relation to setting up a benchmark configuration of their product, told me that we could do this and that they would give some advice but that such an exercise was by its nature fundamentally flawed and could not possibly produce worthwhile results. The reason for this was that OpenLink engineers could not possibly learn enough about the other products nor unlearn enough of their own to make this a meaningful comparison.
Isn't this the very truth? Let the chefs mix their own spices.
This does not mean that there would not be comparability of results. If the benchmarks and processes are well defined, documented, and checked by a third party, these can be considered legitimate and not just one-off best-case results without further import.
In order to stretch the envelope, which is very much a LOD2 goal, this benchmarking should be done on a variety of equipment -- whatever works best at the scale in question. Increasing the scale remains a stated objective. LOD2 even promised to run things with a trillion triples in another 3 years.
Imagine that the unimpeachably impartial Berliners made house calls. Would this debase Justice to be a servant of mere show-off? Or would this on the contrary combine strict Justice with edifying Charity? Who indeed is in greater need of the light of objective evaluation than the vendor whose very nature makes a being of bias and prejudice?
Even better, CWI, with its stellar database pedigree, agreed in principle to audit RDF benchmarks in LOD2.
In this way one could get a stamp of approval for one's results regardless of when they were produced, and be free of the arbitrary schedule of third party benchmarking runs. On the relational side this is a process of some cost and complexity, but since the RDF side is still young and more on mutually friendly terms, the process can be somewhat lighter here. I did promise to draft some extra descriptions of process and result disclosure so that we could see how this goes.
We could even do this unilaterally -- just publish Virtuoso results according to a predefined reporting and verification format. If others wished to publish by the same rules, LOD2 could use some of the benchmarking funds for auditing the proceedings. This could all take place over the net, so we are not talking about any huge cost or prohibitive amount of trouble. It would be in the FP7 spirit that LOD2 provide this service for free, naturally within reason.
Then there is the matter of the BSBM Business Intelligence (BI) mix. At present, it seems everybody has chosen to defer the matter to another round of BSBM runs in the summer. This seems to fit the pattern of a public challenge with a few months given for contenders to prepare their submissions. Here we certainly should look at bigger scales and more diverse hardware than in the Berlin runs published this time around. The BI workload is in fact fairly cluster friendly, with big joins and aggregations that parallelize well. There it would definitely make sense to reserve an actual cluster, and have all contenders set up their gear on it. If all have access to the run environment and to monitoring tools, we can be reasonably sure that things will be done in a transparent manner.
(I will talk about the BI mix in more detail in part 13 and part 14 of this series.)
Once the BI mix has settled and there are a few interoperable implementations, likely in the summer, we could pass from the challenge model to a situation where vendors may publish results as they become available, with LOD2 offering its services for audit.
Of course, this could be done even before then, but the content of the mix might not be settled. We likely need to check it on a few implementations first.
For equipment, people can use their own, or LOD2 partners might on a case-by-case basis make some equipment available for running on the same hardware on which say the Virtuoso results were obtained. For example, FU Berlin could give people a login to get their recently published results fixed. Now this might or might not happen, so I will not hold my breath waiting for this but instead close with a proposal.
As a unilateral diplomatic overture I put forth the following: If other vendors are interested in 1:1 comparison of their results with our publications, we can offer them a login to the same equipment. They can set up and tune their systems, and perform the runs. We will just watch. As an extra quid pro quo, they can try Virtuoso as configured for the results we have published, with the same data. Like this, both parties get to see the others' technology with proper tuning and installation. What, if anything, is reported about this activity is up to the owner of the technology being tested. We will publish a set of benchmark rules that can serve as a guideline for mutually comparable reporting, but we cannot force anybody to use these. This all will function as a catalyst for technological advance, all to the ultimate benefit of the end user. If you wish to take advantage of this offer, you may contact Hugh Williams at OpenLink Software, and we will see how this can be arranged in practice.
The next post will talk about the actual content of benchmarks. The milestone after this will be when we publish the measurement and reporting protocols.
At first sight, the BSBM Explore mix appears very cluster-unfriendly, as it contains short queries that access data at random. There is every opportunity for latency and few opportunities for parallelism.
For this reason we had not even run the BSBM mix with Virtuoso Cluster. We were not surprised to learn that Garlik hadn't run BSBM either. We have understood from Systap that their Bigdata BSBM experiments were on a single-process configuration.
But the 4Store results in the recent Berlin report were with a distributed setup, as 4Store always runs a multiprocess configuration, even on a single server, so it seemed interesting to us to compare how Virtuoso Cluster compares with Virtuoso Single with this workload. These tests were run on a different box than the recent BSBM tests, so those 4Store figures are not directly comparable.
The setup here consists of 8 partitions, each managed by its own process, all running on the same box. Any of these processes can have its HTTP and SQL listener and can provide the same service. Most access to data goes over the interconnect, except when the data is co-resident in the process which is coordinating the query. The interconnect is Unix domain sockets since all 8 processes are on the same box.
6 Cluster - Load Rates and Times | |||
---|---|---|---|
Scale | Rate (quads per second) |
Load time (seconds) |
Checkpoint time (seconds) |
100 Mt | 119,204 | 749 | 89 |
200 Mt | 121,607 | 1486 | 157 |
1000 Mt | 102,694 | 8737 | 979 |
6 Single - Load Rates and Times | |||
---|---|---|---|
Scale | Rate (quads per second) |
Load time (seconds) |
Checkpoint time (seconds) |
100 Mt | 74,713 | 1192 | 145 |
The load times are systematically better than for 6 Single. This is also not bad compared to the 7 Single vectored load rates of 220 Kt/s or so. We note that loading is a cluster friendly operation, going at a steady 1400+% CPU utilization with an aggregate message throughput of 40MB/s. 7 Single is faster because of vectoring at the index level, not because the clusters were hitting communication overheads. 6 Cluster is faster than 6 Single because scale-out in this case diminishes contention, even on a single box.
Throughput is as follows:
6 Cluster - Throughput (QMpH, query mixes per hour) |
||
---|---|---|
Scale | Single User | 16 User |
100 Mt | 7318 | 43120 |
200 Mt | 6222 | 29981 |
1000 Mt | 2526 | 11156 |
6 Single - Throughput (QMpH, query mixes per hour) |
||
---|---|---|
Scale | Single User | 16 User |
100 Mt | 7641 | 29433 |
200 Mt | 6017 | 13335 |
1000 Mt | 1770 | 2487 |
Below is a snapshot of status during the 6 Cluster 100 Mt run.
Cluster 8 nodes, 15 s.
25784 m/s 25682 KB/s 1160% cpu 0% read 740% clw threads 18r 0w 10i buffers 1133459 12 d 4 w 0 pfs
cl 1: 10851 m/s 3911 KB/s 597% cpu 0% read 668% clw threads 17r 0w 10i buffers 143992 4 d 0 w 0 pfs
cl 2: 2194 m/s 7959 KB/s 107% cpu 0% read 9% clw threads 1r 0w 0i buffers 143616 3 d 2 w 0 pfs
cl 3: 2186 m/s 7818 KB/s 107% cpu 0% read 9% clw threads 0r 0w 0i buffers 140787 0 d 0 w 0 pfs
cl 4: 2174 m/s 2804 KB/s 77% cpu 0% read 10% clw threads 0r 0w 0i buffers 140654 0 d 2 w 0 pfs
cl 5: 2127 m/s 1612 KB/s 71% cpu 0% read 9% clw threads 0r 0w 0i buffers 140949 1 d 0 w 0 pfs
cl 6: 2060 m/s 544 KB/s 66% cpu 0% read 10% clw threads 0r 0w 0i buffers 141295 2 d 0 w 0 pfs
cl 7: 2072 m/s 517 KB/s 65% cpu 0% read 11% clw threads 0r 0w 0i buffers 141111 1 d 0 w 0 pfs
cl 8: 2105 m/s 522 KB/s 66% cpu 0% read 10% clw threads 0r 0w 0i buffers 141055 1 d 0 w 0 pfs
The main meters for cluster execution are the messages-per-second (m/s), the message volume (KB/s), and the total CPU% of the processes.
We note that CPU utilization is highly uneven and messages are short, about 1K on the average, compared to about 100K during the load. CPU would be evenly divided between the nodes if each got a share of the HTTP requests. We changed the test driver to round-robin requests between multiple end points. The work does then get evenly divided, but the speed is not affected. Also, this does not improve the message sizes since the workload consists mostly of short lookups. However, with the processes spread over multiple servers, the round-robin would be essential for CPU and especially for interconnect throughput.
Then we try 6 Cluster at 1000 Mt. For Single User, we get 1180 m/s, 6955 KB/s, and 173% cpu. For 16 User, this is 6573 m/s, 44366 KB/s, 1470% cpu.
This is a lot better than the figures with 6 Single, due to lower contention on the index tree, as discussed in A Benchmarking Story. Also Single User throughput on 6 Cluster outperforms 6 Single, due to the natural parallelism of doing the Q5 joins in parallel in each partition. The larger the scale, the more weight this has in the metric. We see this also in the average message size, i.e., the KB/s throughput is almost double while the messages/s is a bit under a third.
The small-scale 6 Cluster run is about even with the 6 Single figure. Looking at the details, we see that the qps for Q1 in 6 Cluster is half of that on 6 Single, whereas the qps for Q5 on 6 Cluster is about double that of the 6 Single. This is as one might expect; longer queries are favored, and single row lookups are penalized.
Looking further at the 6 Cluster status we see the cluster wait (clw
) to be 740%. For 16 Users, this means that about half of the execution real time is spent waiting for responses from other partitions. A high figure means uneven distribution between partitions; a low figure means even. This is as expected, since many queries are concerned with just one S and its related objects.
We will update this section once 7 Cluster is ready. This will implement vectored execution and column store inside the cluster nodes.
A transaction benchmark ought to have something to say about this. The SPARUL (also known as SPARQL/Update) language does not say anything about transactionality, but I suppose it is in the spirit of the SPARUL protocol to promise atomicity and durability.
We begin by running Virtuoso 7 Single, with Single User and 16 User, each at scales of 100 Mt, 200 Mt, and 1000 Mt. The transactionality is default, meaning SERIALIZABLE
isolation between INSERTs
and DELETEs
, and READ COMMITTED
isolation between READ
and any UPDATE
transaction. (Figures for Virtuoso 6 will also be presented here in the near future, as they are the currently shipping production versions.)
Virtuoso 7 Single, Full ACID (QMpH, query mixes per hour) |
||
---|---|---|
Scale | Single User | 16 User |
100 Mt | 9,969 | 65,537 |
200 Mt | 8,646 | 40,527 |
1000 Mt | 5,512 | 17,293 |
Virtuoso 6 Cluster, Full ACID (QMpH, query mixes per hour) |
||
---|---|---|
Scale | Single User | 16 User |
100 Mt | 5604.520 | 34079.019 |
1000 Mt | 2866.616 | 10028.325 |
Virtuoso 6 Single, Full ACID (QMpH, query mixes per hour) |
||
---|---|---|
Scale | Single User | 16 User |
100 Mt | 7,152 | 21,065 |
200 Mt | 5,862 | 16,895 |
1000 Mt | 1,542 | 4,548 |
Each run is preceded by a warm-up of 500 or 300 mixes (the exact number is not material), resulting in a warm cache; see previous post on read-ahead for details. All runs do 1000 Explore and Update mixes. The initial database is in the state following the Explore only runs.
The results are in line with the Explore results. There is a fair amount of variability between consecutive runs; the 16 User run at 1000 Mt varies between 14K and 19K QMpH depending on the measurement. The smaller runs exhibit less variability.
In the following we will look at transactions and at how the definition of the workload and reporting could be made complete.
Full ACID means serializable semantic of concurrent insert and delete of the same quad. Non-transactional means that on concurrent insert and delete of overlapping sets of quads the result is undefined. Further if one logged such "transactions," the replay would give serialization although the initial execution did not, hence further confusing the issue. Considering the hypothetical use case of an e-commerce information portal, there is little chance of deletes and inserts actually needing serialization. An insert-only workload does not need serializability because an insert cannot fail. If the data already exists the insert does nothing, if the quad does not previously exist it is created. The same applies to deletes alone. If a delete and insert overlap, serialization would be needed but the semantics implicit in the use case make this improbable.
Read-only transactions (i.e., the Explore mix in the Explore and Update scenario) will be run as READ COMMITTED
. These do not see uncommitted data and never block for lock wait. The reads may not be repeatable.
Our first point of call is to determine the cost of ACID. We run 1000 mixes of Explore and Update at 1000 Mt. The throughput is 19214 after a warm-up of 500 mixes. This is pretty good in comparison with the diverse read-only results at this scale.
We look at the pertinent statistics:
SELECT TOP 5 * FROM sys_l_stat ORDER BY waits DESC;
KEY_TABLE INDEX_NAME LOCKS WAITS WAIT_PCT DEADLOCKS LOCK_ESC WAIT_MSECS
=============== ============= ====== ===== ======== ========= ======== ==========
DB.DBA.RDF_QUAD RDF_QUAD_POGS 179205 934 0 0 0 35164
DB.DBA.RDF_IRI RDF_IRI 20752 217 1 0 0 16445
DB.DBA.RDF_QUAD RDF_QUAD_SP 9244 3 0 0 0 235
We see 934 waits with a total duration of 35 seconds on the index with the most contention. The run was 187 seconds, real time. The lock wait time is not real time since this is the total elapsed wait time summed over all threads. The lock wait frequency is a little over one per query mix, meaning a little over one per five locking transactions.
We note that we do not get deadlocks since all inserts and deletes are in ascending key order due to vectoring. This guarantees the absence of deadlocks for single insert transactions, as long as the transaction stays within the vector size. This is always the case since the inserts are a few hundred triples at the maximum. The waits concentrate on POGS, because this is a bitmap index where the locking resolution is less than a row, and the values do not correlate with insert order. The locking behavior could be better with the column store, where we would have row level locking also for this index. This is to be seen. The column store would otherwise tend to have higher cost per random insert.
Considering these results it does not seem crucial to "drop ACID," though doing so would save some time. We will now run measurements for all scales with 16 Users and ACID.
Let us now see what the benchmark writes:
SELECT TOP 10 * FROM sys_d_stat ORDER BY n_dirty DESC;
KEY_TABLE INDEX_NAME TOUCHES READS READ_PCT N_DIRTY N_BUFFERS
=========================== ============================ ========= ======= ======== ======= =========
DB.DBA.RDF_QUAD RDF_QUAD_POGS 763846891 237436 0 58040 228606
DB.DBA.RDF_QUAD RDF_QUAD 213282706 1991836 0 30226 1940280
DB.DBA.RDF_OBJ RO_VAL 15474 17837 115 13438 17431
DB.DBA.RO_START RO_START 10573 11195 105 10228 11227
DB.DBA.RDF_IRI RDF_IRI 61902 125711 203 7705 121300
DB.DBA.RDF_OBJ RDF_OBJ 23809053 3205963 13 636 3072517
DB.DBA.RDF_IRI DB_DBA_RDF_IRI_UNQC_RI_ID 3237687 504486 15 340 488797
DB.DBA.RDF_QUAD RDF_QUAD_SP 89995 70446 78 99 68340
DB.DBA.RDF_QUAD RDF_QUAD_OP 19440 47541 244 66 45583
DB.DBA.VTLOG_DB_DBA_RDF_OBJ VTLOG_DB_DBA_RDF_OBJ 3014 1 0 11 11
DB.DBA.RDF_QUAD RDF_QUAD_GS 1261 801 63 10 751
DB.DBA.RDF_PREFIX RDF_PREFIX 14 168 1120 1 153
DB.DBA.RDF_PREFIX DB_DBA_RDF_PREFIX_UNQC_RP_ID 1807 200 11 1 200
The most dirty pages are on the POGS
index, which is reasonable; values are spread out at random. After this we have the PSOG
index, likely because of random deletes. New IRIs tend to get consecutive numbers and do not make many dirty pages. Literals come next, with the index from leading string or hash of the literal to id leading, as one would expect, again because of values being distributed at random. After this come IRIs. The distribution of updates is generally as one would expect.
* * *
Going back to BSBM, at least the following aspects of the benchmark have to be further specified:
Disclosure of ACID properties. If the benchmark required full ACID many would not run this at all. Besides full ACID is not necessarily an absolute requirement based on the hypothetical usage scenario of the benchmark. However, when publishing numbers the guarantees that go with the numbers must be made explicit. This includes logging, checkpoint frequency or equivalent etc.
Steady state. The working set of the Update mix is different from that of the Explore mixes. This touches more indices than Explore. The Explore warm-up is in part good but does not represent steady state.
Checkpoint and sustained throughput. Benchmarks involving update generally have rules for checkpointing the state and for sustained throughput. In specific, the throughput of an update benchmark cannot rely on never flushing to persistent storage. Even bulk load must be timed with a checkpoint guaranteeing durability at the end. A steady update stream should be timed with a test interval of sufficient length involving a few checkpoints; for example, a minimum duration of 30 minutes with no less than 3 completed checkpoints in the interval with at least 9 minutes between the end of one and the start of the next. Not all DBMSs work with logs and checkpoints, but if an alternate scheme is used then this needs to be described.
Memory and warm-up issues.We have seen the test data generator run out of memory when trying to generate update streams of meaningful length. Also the test driver should allow running updates in timed and non-timed mode (warm-up).
With an update benchmark, many more things need to be defined, and the set-up becomes more system specific, than with a read-only workload. We will address these shortcomings in the measurement rules proposal to come. Especially with update workloads, the vendors need to provide tuning expertise; however, this will not happen if the benchmark does not properly set the expectations. If benchmarks serve as a catalyst for clearly defining how things are to be set up, then they will have served the end user.
We will here look at database-running statistics for BSBM at different scales. Finally, we look at CPU profiles.
But first, let us see what BSBM reads in general. The system is in steady state after around 1500 query mixes; after this the working set does not shift much. After several thousand query mixes, we have:
SELECT TOP 10 * FROM sys_d_stat ORDER BY reads DESC;
KEY_TABLE INDEX_NAME TOUCHES READS READ_PCT N_DIRTY N_BUFFERS
================= ============================ ========== ======= ======== ======= =========
DB.DBA.RDF_OBJ RDF_OBJ 114105938 3302150 2 0 3171275
DB.DBA.RDF_QUAD RDF_QUAD 977426773 2041156 0 0 1970712
DB.DBA.RDF_IRI DB_DBA_RDF_IRI_UNQC_RI_ID 8250414 509239 6 15 491631
DB.DBA.RDF_QUAD RDF_QUAD_POGS 3677233812 183860 0 0 175386
DB.DBA.RDF_IRI RDF_IRI 32 99710 302151 5 95353
DB.DBA.RDF_QUAD RDF_QUAD_OP 30597 51593 168 0 48941
DB.DBA.RDF_QUAD RDF_QUAD_SP 265474 47210 17 0 46078
DB.DBA.RDF_PREFIX DB_DBA_RDF_PREFIX_UNQC_RP_ID 6020 212 3 0 212
DB.DBA.RDF_PREFIX RDF_PREFIX 0 167 16700 0 157
The first column is the table, then the index, then the number of times a row was found. The fourth number is the count of disk pages read. The last number is the count of 8K buffer pool pages in use for caching pages of the index in question. Note that the index is clustered, i.e., there is no table data structure separate from the index. Most of the reads are for strings or RDF literals. After this comes the PSOG
index for getting a property value given the subject. After this, but much lower, we have lookups of IRI strings given the ID. The index from object value to subject is used the most but the number of pages is small; only a few properties seem to be concerned. The rest is minimal in comparison.
Now let us reset the counts and see what the steady state I/O profile is.
SELECT key_stat (key_table, name_part (key_name, 2), 'reset') FROM sys_keys WHERE key_migrate_to IS NULL;
SELECT TOP 10 * FROM sys_d_stat ORDER BY reads DESC;
KEY_TABLE INDEX_NAME TOUCHES READS READ_PCT N_DIRTY N_BUFFERS
================= ============================ ========== ======= ======== ======= =========
DB.DBA.RDF_OBJ RDF_OBJ 30155789 79659 0 0 3191391
DB.DBA.RDF_QUAD RDF_QUAD 259008064 8904 0 0 1948707
DB.DBA.RDF_QUAD RDF_QUAD_SP 68002 7730 11 0 53360
DB.DBA.RDF_IRI RDF_IRI 12 5415 41653 6 98804
DB.DBA.RDF_QUAD RDF_QUAD_POGS 975147136 1597 0 0 173459
DB.DBA.RDF_IRI DB_DBA_RDF_IRI_UNQC_RI_ID 2213525 1286 0 17 485093
DB.DBA.RDF_QUAD RDF_QUAD_OP 7999 904 11 0 48568
DB.DBA.RDF_PREFIX DB_DBA_RDF_PREFIX_UNQC_RP_ID 1494 1 0 0 213
Literal strings dominate. The SP
index is used only for situations where the P
is not specified, i.e., the DESCRIBE
query. Based on this, I/O seems to be attributable mostly to this. The first RDF_IRI
represents translations from string to IRI id; the second represents translations from IRI id to string. The touch count for the first RDF_IRI
is not properly recorded, hence the miss % is out of line. We see SP
missing the cache the most since its use is infrequent in the mix.
We will next look at query processing statistics. For this we introduce a new meter.
The db_activity
SQL function provides a session-by-session cumulative statistic of activity. The fields are:
rnd
- Count of random index lookups. Each first row of a select or insert counts as one, regardless of whether something was found.seq
- Count of sequential rows. Every move to next row on a cursor counts as 1, regardless of whether conditions match.same seg
- For column store only; counts how many times the next row in a vectored join using an index falls in the same segment as the previous random access. A segment is the stretch of rows between entries in the sparse top level index on the column projection.same pg
- Counts how many times a vectored index join finds the next match on the same page as the previous one.same par
- Counts how many times the next lookup in a vectored index join falls on a different page than the previous but still under the same parent.disk
- Counts how many disk reads were made, including any speculative reads initiated.spec disk
- Counts speculative disk reads.messages
- Counts cluster interconnect messages B (KB, MB, GB)
- is the total length of the cluster interconnect messages.fork
- Counts how many times a thread was forked (started) for query parallelization.The numbers are given with 4 significant digits and a scale suffix. G is 10^9 (1,000,000,000); M is 10^6 (1,000,000), K is 10^3 (1,000).
We run 2000 query mixes with 16 Users. The special http
account keeps a cumulative account of all activity on web server threads.
SELECT db_activity (2, 'http');
1.674G rnd 3.223G seq 0 same seg 1.286G same pg 314.8M same par 6.186M disk 6.461M spec disk 0B / 0 messages 298.6K fork
We see that random access dominates. The seq
number is about twice the rnd
number, meaning that the average random lookup gets two rows. Getting a row at random obviously takes more time than getting the next row. Since the index used is row-wise, the same seg
is 0; the same pg
indicates that 77% of the random accesses fall on the same page as the previous random access; most of the remaining random accesses fall under the same parent as the previous one.
There are more speculative reads than disk reads which is an artifact of counting some concurrently speculated reads twice. This does indicate that speculative reads dominate. This is because a large part of the run was in the warm-up state with aggressive speculative reading. We reset the counts and run another 2000 mixes.
Now let us look at the same reading after 2000 mixes, 16 user at 100Mt.
234.3M rnd 420.5M seq 0 same seg 188.8M same pg 29.09M same par 808.9K disk 919.9K spec disk 0B / 0 messages 76K fork
We note that the ratios between the random and sequential and same page/parent counts are about the same. The sequential number looks to be even a bit smaller in proportion. The count of random accesses for the 100Mt run is 14% of the count for the 1000Mt run. The count of query parallelization threads is also much lower since it is worthwhile to schedule a new thread only if there are at least a few thousand operations to perform on it. The precise criterion for making a thread is that according to the cost model guess, the thread must have at least 5ms worth of work.
We note that the 100 Mt throughput is a little over three-times that of the 1000 Mt throughput, as reported before. We might justifiably ask why the 100 Mt run is not seven-times faster instead, for this much less work.
We note that for one-off random access, it makes no real difference whether the tree has 100 M or 1000 M rows; this translates to roughly 27 vs 30 comparisons, so the depth of the tree is not a factor per se. Besides, vectoring makes the tree often look only one or two levels deep, so the total row count matters even less there.
To elucidate this last question, we look at the CPU profiles. We take an oprofile of 100 Single User mixes at both scales.
For 100 Mt:
61161 10.1723 cmpf_iri64n_iri64n_anyn_gt_lt
31321 5.2093 box_equal
19027 3.1646 sqlo_parse_tree_has_node
15905 2.6453 dk_alloc
15647 2.6024 itc_next_set_neq
12702 2.1126 itc_vec_split_search
12487 2.0768 itc_dive_transit
11450 1.9044 itc_bm_vec_row_check
10646 1.7706 itc_page_rcf_search
9223 1.5340 id_hash_get
9215 1.5326 gen_qsort
8867 1.4748 sqlo_key_part_best
8807 1.4648 itc_param_cmp
8062 1.3409 cmpf_iri64n_iri64n
6820 1.1343 sqlo_in_list
6005 0.9987 dc_iri_id_cmp
5905 0.9821 dk_free_tree
5801 0.9648 box_hash
5509 0.9163 dks_esc_write
5444 0.9054 sql_tree_hash_1
For 1000 Mt
754331 31.4149 cmpf_iri64n_iri64n_anyn_gt_lt
146165 6.0872 itc_vec_split_search
144795 6.0301 itc_next_set_neq
131671 5.4836 itc_dive_transit
110870 4.6173 itc_page_rcf_search
66780 2.7811 gen_qsort
66434 2.7667 itc_param_cmp
58450 2.4342 itc_bm_vec_row_check
55213 2.2994 dk_alloc
47793 1.9904 cmpf_iri64n_iri64n
44277 1.8440 dc_iri_id_cmp
39489 1.6446 cmpf_int64n
36880 1.5359 dc_append_bytes
36601 1.5243 dv_compare
31286 1.3029 dc_any_value_prefetch
25457 1.0602 itc_next_set
20852 0.8684 box_equal
19895 0.8285 dk_free_tree
19698 0.8203 itc_page_insert_search
19367 0.8066 dc_copy
The top function in both is the compare for an equality of two leading IRIs and a range for the trailing any. This corresponds to the range check in Q5. At the larger scale this is three times more important. At the smaller scale, the share of query optimization is about 6.5 times greater. The top function in this category is box_equal
with 5.2% vs 0.87%. The remaining SQL compiler functions are all in proportion to this, totaling 14.3% of the 100 Mt top-20 profile.
From this sample it appears ten times more scale is seven times more database operations. This is not taken into account in the metric. Query compilation is significant at the small end, and no longer significant at 1000 Mt. From these numbers, we could say that Virtuoso is about two times more efficient in terms of database operation throughput at 1000 Mt than at 100 Mt.
We may conclude that different BSBM scales measure different things. The TPC workloads are relatively better in that they have a balance between metric components that stay relatively constant across a large range of scales.
This is not necessarily something that should be fixed in the BSBM Explore mix. We must however take these factors better into account in developing the BI mix.
Let us also remember that BSBM Explore is a relational workload. Future posts in this series will outline how we propose to make RDF-friendlier benchmarks.
So I changed the default behavior to use a very long window for triggering read-ahead as long as the buffer pool was not full. After the initial filling of the buffer pool, the read ahead would require more temporal locality before kicking in.
Still, the scheme was not really good since the rest of the extent would go for background-read and the triggering read would be done right then, leading to extra seeks. Well, this is good for latency but bad for throughput. So I changed this too, going to an "elevator only" scheme where reads that triggered read-ahead would go with the read-ahead batch. Reads that did not trigger read-ahead would still be done right in place, thus favoring latency but breaking any sequentiality with its attendant 10+ ms penalty.
We keep in mind that the test we target is BSBM warm-up time, which is purely a throughput business. One could have timeouts and could penalize queries that sacrificed too much latency to throughput.
We note that even for this very simple metric, just reading the allocated database pages from start to end is not good since a large number of pages in fact never get read during a run.
We further note that the vectored read-ahead without any speculation will be useful as-is for cases with few threads and striping, since at least one thread's random I/Os get to go to multiple threads. The benefit is less in multiuser situations where disks are randomly busy anyhow.
In the previous I/O experiments, we saw that with vectored read ahead and no speculation, there were around 50 pages waiting for I/O at all times. With an easily-triggered extent read-ahead, there were around 4000 pages waiting. The more pages are waiting for I/O, the greater the benefit from the elevator algorithm of servicing I/O in order of file offset.
In Virtuoso 5 we had a trick that would, if the buffer pool was not full, speculatively read every uncached sibling of every index tree node it visited. This filled the cache quite fast, but was useless after the cache was full. The extent read ahead first implemented in 6 was less aggressive, but would continue working with full cache and did in fact help with shifts in the working set.
The next logical step is to combine the vector and extent read-ahead modes. We see what pages we will be getting, then take the distinct extents; if we have been to this extent within the time window, we just add all the uncached allocated pages of the extent to the batch.
With this setting, especially at the start of the run, we get large read-ahead batches and maintain I/O queues of 5000 to 20000 pages. The SSD starting time drops to about 120 seconds from cold start to reach 1200% CPU. We see transfer rates of up to 150 MB/s per SSD. With HDDs, we see transfer rates around 14 MB/s per drive, mostly reading chunks of an average of seventy-one (71) 8K pages.
The BSBM workload does not offer better possibilities for optimization, short of pre-reading the whole database, which is not practical at large scales.
First we start from cold disk, with and without mandatory read of the whole extent on the touch.
Without any speculation but with vectored read-ahead, here are the times for the first 11 query mixes:
0: 151560.82 ms, total: 151718 ms
1: 179589.08 ms, total: 179648 ms
2: 71974.49 ms, total: 72017 ms
3: 102701.73 ms, total: 102729 ms
4: 58834.41 ms, total: 58856 ms
5: 65926.34 ms, total: 65944 ms
6: 68244.69 ms, total: 68274 ms
7: 39197.15 ms, total: 39215 ms
8: 45654.93 ms, total: 45674 ms
9: 34850.30 ms, total: 34878 ms
10: 100061.30 ms, total: 100079 ms
The average CPU during this time was 5%. The best read throughput was 2.5 MB/s; the average was 1.35 MB/s. The average disk read was 16 ms.
With vectored read-ahead and full extents only, i.e., max speculation:
0: 178854.23 ms, total: 179034 ms
1: 110826.68 ms, total: 110887 ms
2: 19896.11 ms, total: 19941 ms
3: 36724.43 ms, total: 36753 ms
4: 21253.70 ms, total: 21285 ms
5: 18417.73 ms, total: 18439 ms
6: 21668.92 ms, total: 21690 ms
7: 12236.49 ms, total: 12267 ms
8: 14922.74 ms, total: 14945 ms
9: 11502.96 ms, total: 11523 ms
10: 15762.34 ms, total: 15792 ms
...
90: 1747.62 ms, total: 1761 ms
91: 1701.01 ms, total: 1714 ms
92: 1300.62 ms, total: 1318 ms
93: 1873.15 ms, total: 1886 ms
94: 1508.24 ms, total: 1524 ms
95: 1748.15 ms, total: 1761 ms
96: 2076.92 ms, total: 2090 ms
97: 2199.38 ms, total: 2212 ms
98: 2305.75 ms, total: 2319 ms
99: 1771.91 ms, total: 1784 ms
Scale factor: 2848260
Number of warmup runs: 0
Seed: 808080
Number of query mix runs
(without warmups): 100 times
min/max Querymix runtime: 1.3006s / 178.8542s
Elapsed runtime: 872.993 seconds
QMpH: 412.374 query mixes per hour
The peak throughput is 91 MB/s, with average around 50 MB/s; CPU average around 50%.
We note that the latency of the first query mix is hardly greater than in the non-speculative run, but starting from mix 3 the speed is clearly better.
Then the same with cold SSDs. First with no speculation:
0: 5177.68 ms, total: 5302 ms
1: 2570.16 ms, total: 2614 ms
2: 1353.06 ms, total: 1391 ms
3: 1957.63 ms, total: 1978 ms
4: 1371.13 ms, total: 1386 ms
5: 1765.55 ms, total: 1781 ms
6: 1658.23 ms, total: 1673 ms
7: 1273.87 ms, total: 1289 ms
8: 1355.19 ms, total: 1380 ms
9: 1152.78 ms, total: 1167 ms
10: 1787.91 ms, total: 1802 ms
...
90: 1116.25 ms, total: 1128 ms
91: 989.50 ms, total: 1001 ms
92: 833.24 ms, total: 844 ms
93: 1137.83 ms, total: 1150 ms
94: 969.47 ms, total: 982 ms
95: 1138.04 ms, total: 1149 ms
96: 1155.98 ms, total: 1168 ms
97: 1178.15 ms, total: 1193 ms
98: 1120.18 ms, total: 1132 ms
99: 1013.16 ms, total: 1025 ms
Scale factor: 2848260
Number of warmup runs: 0
Seed: 808080
Number of query mix runs
(without warmups): 100 times
min/max Querymix runtime: 0.8201s / 5.1777s
Elapsed runtime: 127.555 seconds
QMpH: 2822.321 query mixes per hour
The peak I/O is 45 MB/s, with average 28.3 MB/s; CPU average is 168%.
Now, SSDs with max speculation.
0: 44670.34 ms, total: 44809 ms
1: 18490.44 ms, total: 18548 ms
2: 7306.12 ms, total: 7353 ms
3: 9452.66 ms, total: 9485 ms
4: 5648.56 ms, total: 5668 ms
5: 5493.21 ms, total: 5511 ms
6: 5951.48 ms, total: 5970 ms
7: 3815.59 ms, total: 3834 ms
8: 4560.71 ms, total: 4579 ms
9: 3523.74 ms, total: 3543 ms
10: 4724.04 ms, total: 4741 ms
...
90: 673.53 ms, total: 685 ms
91: 534.62 ms, total: 545 ms
92: 730.81 ms, total: 742 ms
93: 1358.14 ms, total: 1370 ms
94: 1098.64 ms, total: 1110 ms
95: 1232.20 ms, total: 1243 ms
96: 1259.57 ms, total: 1273 ms
97: 1298.95 ms, total: 1310 ms
98: 1156.01 ms, total: 1166 ms
99: 1025.45 ms, total: 1034 ms
Scale factor: 2848260
Number of warmup runs: 0
Seed: 808080
Number of query mix runs
(without warmups): 100 times
min/max Querymix runtime: 0.4725s / 44.6703s
Elapsed runtime: 269.323 seconds
QMpH: 1336.683 query mixes per hour
The peak I/O is 339 MB/s, with average 192 MB/s; average CPU is 121%.
The above was measured with the read-ahead thread doing single-page reads. We repeated the test with merging reads with small differences. The max IO was 353 MB/s, and average 173 MB/s; average CPU 113%.
We see that the start latency is quite a bit longer than without speculation and the CPU % is lower due to higher latency of individual I/O. The I/O rate is fair. We would expect more throughput however.
We find that a supposedly better use of the API, doing single requests of up to 100 pages instead of consecutive requests of 1 page, does not make a lot of difference. The peak I/O is a bit higher; overall throughput is a bit lower.
We will have to retry these experiments with a better controller. We have at no point seen anything like the 50K 4KB random I/Os promised for the SSDs by the manufacturer. We know for a fact that the controller gives about 700 MB/s sequential read with cat file /dev/null
and two drives busy. With 4 drives busy, this does not get better. The best 30 second stretch we saw in a multiuser BSBM warm-up was 590 MB/s, which is consistent with the cat
to /dev/null
figure. We will later test with 8 SSDs with better controllers.
Note that the average I/O and CPU are averages over 30 second measurement windows; thus for short running tests, there is some error from the window during which the activity ended.
Let us now see if we can make a BSBM instance warm up from disk in a reasonable time. We run 16 users with max speculation. We note that after reading 7,500,000 buffers we are not entirely free of disk. The max speculation read-ahead filled the cache in 17 minutes, with an average of 58 MB/s. After the cache is filled, the system shifts to a more conservative policy on extent read-ahead; one which in fact never gets triggered with the BSBM Explore in steady state. The vectored read-ahead is kept on since this by itself does not read pages that are not needed. However, the vectored read-ahead does not run either, because the data that is accessed in larger batches is already in memory. Thus there remains a trickle of an average 0.49 MB/s from disk. This keeps CPU around 350%. With SSDs, the trickle is about 1.5 MB/s and CPU is around 1300% in steady state. Thus SSDs give approximately triple the throughput in a situation where there is a tiny amount of continuous random disk access. The disk access in question is 80% for retrieving RDF literal strings, presumably on behalf of the DESCRIBE
query in the mix. This query touches things no other query touches and does so one subject at a time, in a way that can neither be anticipated nor optimized.
The Virtuoso 7 column store will deal with this better because it is more space efficient overall. If we apply stream-compression to literals, these will go in under half the space, while quads will go in maybe one-quarter the space. Thus 3000 Mt all from memory should be possible with 72 GB RAM. 1000 Mt row-wise did fit in in 72 GB RAM except for the random literals accessed by the the DESCRIBE
. This alone drops throughput to under a third of the memory-only throughput if using HDDs. SSDs, on the other hand, can largely neutralize this effect.
We have looked at basics of I/O. SSDs have been found to be a readily available solution to I/O bottlenecks without need for reconfiguration or complex I/O policies. We have been able to get a decent read rate under conditions of server warm-up or shift of working set even with HDDs.
More advanced I/O matters will be covered with the column store. We note that the techniques discussed here apply identically to rows and columns.
As concerns BSBM, it seems appropriate to include a warm-up time. In practice, this means that the store just must eagerly pre-read. This is not hard to do and can be quite useful.
There are two approaches:
run twice or otherwise make sure one runs from memory and forget about I/O, or
make rules and metrics for warm-up.
We will see if the second is possible with BSBM.
From this starting point, we look at various ways of scheduling I/O in Virtuoso using a 1000 Mt BSBM database on sets of each of HDDs (hard disk devices) and SSDs (solid-state storage devices). We will see that SSDs in this specific application can make a significant difference.
In this test we have the same 4 stripes of a 1000 Mt BSBM database on each of two storage arrays.
Storage Arrays | ||||||||
---|---|---|---|---|---|---|---|---|
Type | Quantity | Maker | Size | Speed | Interface speed | Controller | Drive Cache | RAID |
SSD | 4 | Crucial | 128 GB | N/A | 6Gbit SATA | RocketRaid 640 | 128 MB | None |
HDD | 4 | Samsung | 1000 GB | 7200 RPM | 3Gbit SATA | Intel ICH on Supermicro motherboard | 16 MB | None |
We make sure that the files are not in OS cache by filling it with other big files, reading a total of 120 GB off SSDs with `cat file > /dev/null`
.
The configuration files are as in the report on the 1000 Mt run. We note as significant that we have a few file descriptors for each stripe, and that read-ahead for each is handled by its own thread.
Two different read-ahead schemes are used:
With 6 Single, if a 2MB extent gets a second read within a given time after the first, the whole extent is scheduled for background read.
With 7 Single, as an index search is vectored, we know a large number of values to fetch at one time and these values are sorted into an ascending sequence. Therefore, by looking at a node in an index tree, we can determine which sub-trees will be accessed and schedule these for read-ahead, skipping any that will not be accessed.
In either model, a sequential scan touching more than a couple of consecutive index leaf pages triggers a read-ahead, to the end of the scanned range or to the next 3000 index leaves, whichever comes first. However, there are no sequential scans of significant size in BSBM.
There are a few different possibilities for the physical I/O:
Using a separate read system call for each page. There may be several open file descriptors on a file so that many such calls can proceed concurrently on different threads; the OS will order the operations.
A thread finds it needs a page and reads it.
Using Unix asynchronous I/O, aio.h
, with the aio_*
and lio_listio
functions.
Using single-read system calls for adjacent pages. In this way, the drive sees longer requests and should give better throughput. If there are short gaps in the sequence, the gaps are also read, wasting bandwidth but saving on latency.
The two latter apply only to bulk I/O that are scheduled on background threads, one per independently-addressable device (HDD, SSD, or RAID-set). These bulk-reads operate on an elevator model, keeping a sorted queue of things to read or write and moving through this queue from start to end. At any time, the queue may get more work from other threads.
There is a further choice when seeing single-page random requests. They can either go to the elevator or they can be done in place. Taking the elevator is presumably good for throughput but bad for latency. In general, the elevator should have a notion of fairness; these matters are discussed in the CWI collaborative scan paper. Here we do not have long queries, so we do not have to talk about elevator policies or scan sharing; there are no scans. We may touch on these questions later with the column store, the BSBM BI mix, and TPC-H.
While we may know principles, I/O has always given us surprises; the only way to optimize this is to measure.
The metric we try to optimize here is the time it takes for a multiuser BSBM run starting from cold cache to get to 1200% CPU. When running from memory, the CPU is around 1350% for the system in question.
This depends on getting I/O throughput, which in turn depends on having a lot of speculative reading since the workload itself does not give any long stretches to read.
The test driver is set at 16 clients, and the run continues for 2000 query mixes or until target throughput is reached. Target throughput is deemed reached after the first 20 second stretch with CPU at 1200% or higher.
The meter is a stored procedure that records the CPU time, count of reads, cumulative elapsed time spent waiting for I/O, and other metrics. The code for this procedure (for 7 Single; this file will not work on Virtuoso 6 or earlier) is available here.
The database space allocation gives each index a number of 2MB segments, each with 256 8K pages. When a page splits, the new page is allocated from the same extent if possible, or from a specific second extent which is designated as the overflow extent of this extent. This scheme provides for a sort of pseudo-locality within extents over random insert order. Thus there is a chance that pre-reading an extent will get key values in the same range a the ones on the page being requested in the first place. At least the pre-read pages will be from the same index tree. There are insertion orders that do not create good locality with this allocation scheme, though. In order to generally improve locality, one could shuffle pages of an all-dirty subtree before writing this out so as to have physical order match key order. We will look at some tricks in this vein with the column store.
For the sake of simplicity we only run 7 Single with the 1000 Mt scale.
The first experiment was with SSDs and the vectored read-ahead. The target throughput was reached after 280 seconds.
The next test was with HDDs and extent read-ahead. One hour into the experiment, the CPU was about 70% after processing around 1000 query mixes. It might have been hours before HDD reads became rare enough for hitting 1200% CPU. The test was not worth continuing.
The result with HDDs and vectored read-ahead would be worse since vectored read-ahead leads to smaller read-ahead batches and to less contiguous read patterns. The individual read times here, are over twice the individual read times with per-extent read-ahead. The fact that vectored read-ahead does not read potentially unneeded pages makes no difference. Hence this test is also not worth running to completion.
There are other possibilities for improving HDD I/O. If only 2MB read requests are made, a transfer will be about 20 ms at a sequential transfer speed of 50 MB/s. Then seeking to the next 2MB extent will be a few ms, most often less than 20, so the HDD should give at least half the nominal throughput.
We note that, when reading sequential 8K pages inside a single 2MB (256 page) extent, the seek latency is not 0 as one would expect but an extreme 5 ms. One would think that the drive would buffer a whole track, and a track would hold a large number of 2MB sections, but apparently this is not so.
Therefore, now if we have a sequential read pattern that is more dense than 1 page out of 10, we read all the pages and just keep the ones we want.
So now we set the read-ahead to merge reads that fall within 10 pages. This wastes bandwidth, but supposedly saves on latency. We will see.
So we try, and we find that read-ahead does not account for most pages since it does not get triggered. Thus, we change the triggering condition to be the 2nd read to fall in the extent within 20 seconds of the first.
The HDDs were in all cases 700% busy for 4 HDDs. But with the new setting we get longer requests, most often full extents, which gets a per-HDD transfer rate of about 5 MB/s. With the looser condition for starting read-ahead, 89% of all pages were read in a read-ahead batch. We see the I/O throughput decrease during the run because there are more single-page reads that do not trigger extent read-ahead. So HDDs have 1.7 concurrent operations pending, but the batch size drops, dropping the throughput.
Thus with the best settings, the test with 2000 query mixes finishes in 46 minutes, and the CPU utilization is steadily increasing, hitting 392% for the last minute. In comparison, with SSDs and our worst read-ahead setting we got 1200% CPU in under 5 minutes from cold start. The I/O system can be further tuned; for example, by only reading full extents as long as the buffer pool is not full. In the next post we will measure some more.
We look at query times with semi-warm cache, with CPU around 400%. We note that Q8-Q12 are especially bad. Q5 runs at about half speed. Q12 runs at under 1/10th speed. The relatively slowest queries appear to be single-instance lookups. Nothing short of the most aggressive speculative reading can help there. Neither query nor workload has any exploitable pattern. Therefore if an I/O component is to be included in a BSBM metric, the only way to score in this is to use speculative read to the maximum.
Some of the queries take consecutive property values of a single instance. One could parallelize this pipeline, but this would be a one-off and would make sense only when reading from storage (whether HDD, SSD, or otherwise). Multithreading for single rows is not worth the overhead.
A metric for BSBM warm-up is not interesting for database science, but may still be of practical interest in the specific case of RDF stores. Specially reading large chunks at startup time is good, so putting a section in BSBM that would force one to implement this would be a service to most end users. Measuring and reporting such I/O performance would favor space efficiency in general. Space efficiency is generally a good thing, especially at larger scales, so we can put an optional section in the report for warm-up. This is also good for comparing HDDs and SSDs, and for testing read-ahead, which is still something a database is expected to do. Implementors have it easy; just speculatively read everything.
Looking at the BSBM fictional use case, anybody running such a portal would do this from RAM only, so it makes sense to define the primary metric as running from warm cache, in practice 100% from memory.
Threading - What settings should be used (e.g., for query parallelization, I/O parallelization [e.g., prefetch, flush of dirty], thread pools [e,.g. web server], any other thread related)? We will run with 8 and 32 cores, so if there are settings controlling number of read/write (R/W) locks or mutexes or such for serializing diverse things, these should be set accordingly to minimize contention.
The following three settings are all in the [Parameters]
section of the virtuoso.ini
file.
AsyncQueueMaxThreads
controls the size of a pool of extra threads that can be used for query parallelization. This should be set to either 1.5 * the number of cores or 1.5 * the number of core threads; see which works better.
ThreadsPerQuery
is the maximum number of threads a single query will take. This should be set to either the number of cores or the number of core threads; see which works better.
IndexTreeMaps
is the number of mutexes over which control for buffering an index tree is split. This can generally be left at default (256 in normal operation; valid settings are powers of 2 from 2 to 1024), but setting to 64, 128, or 512 may be beneficial.
A low number will lead to frequent contention; upwards of 64 will have little contention. We have sometimes seen a multiuser workload go 10% faster when setting this to 64 (down from 256), which seems counter-intuitive. This may be a cache artifact.
In the [HTTPServer]
section of the virtuoso.ini
file, the ServerThreads
setting is the number of web server threads, i.e., the maximum number of concurrent SPARQL protocol requests. Having a value larger than the number of concurrent clients is OK; for large numbers of concurrent clients a lower value may be better, which will result in requests waiting for a thread to be available.
Note — The [HTTPServer] ServerThreads
are taken from the total pool made available by the [Parameters] ServerThreads
. Thus, the [Parameters] ServerThreads
should always be at least as large as (and is best set greater than) the [HTTPServer] ServerThreads
, and if using the closed-source Commercial Version, [Parameters] ServerThreads
cannot exceed the licensed thread count.
File layout - Are there settings for striping over multiple devices? Settings for other file access parallelism? Settings for SSDs (e.g., SSD based cache of hot set of larger db files on disk)? The target config is for 4 independent disks and 4 independent SSDs. If you depend on RAID, are there settings for this? If you need RAID to be set up, please provide the settings/script for doing this with 4 SSDs on Linux (RH and Debian). This will be software RAID, as we find the hardware RAID to be much worse than an independent disk setup on the system in question.
It is best to stripe database files over all available disks, and to not use RAID. If RAID is desired, then stripe database files across many RAID sets. Use the segment
declaration in the virtuoso.ini
file. It is very important to give each independently seekable device its own I/O queue thread. See the documentation on the TPC-C sample for examples.
in the [Parameters]
section of the virtuoso.ini
file, set FDsPerFile
to be (the number of concurrent threads * 1.5) ÷ the number of distinct database files
.
There are no SSD specific settings.
Loading - How many parallel streams work best? We are looking for non-transactional bulk load, with no inference materialization. For partitioned cluster settings, do we divide the load streams over server processes?
Use one stream per core (not per core thread). In the case of a cluster, divide load streams evenly across all processes. The total number of streams on a cluster can equal the total number of cores; adjust up or down depending on what is observed.
Use the built-in bulk load facility, i.e.,
ld_dir ('<source-filename-or-directory>', '<file name pattern>', '<destination graph iri>');
For example,
SQL> ld_dir ('/path/to/files', '*.n3', 'http://dbpedia.org');
Then do a rdf_loader_run ()
on enough connections. For example, you can use the shell command
isql rdf_loader_run () &
to start one in a background isql process. When starting background load commands from the shell, you can use the shell wait
command to wait for completion. If starting from isql, use the wait_for_children;
command (see isql documentation for details).
See the BSBM disclosure report for an example load script.
What command should be used after non-transactional bulk load, to ensure a consistent persistent state on disk, like a log checkpoint or similar? Load and checkpoint will be timed separately, load being CPU-bound and checkpoint being I/O-bound. No roll-forward log or similar is required; the load does not have to recover if it fails before the checkpoint.
Execute
CHECKPOINT;
through a SQL client, e.g., isql
. This is not a SPARQL statement and cannot be executed over the SPARQL protocol.
What settings should be used for trickle load of small triple sets into a pre-existing graph? This should be as transactional as supported; at least there should be a roll forward log, unlike the case for the bulk load.
No special settings are needed for load testing; defaults will produce transactional behavior with a roll forward log. Default transaction isolation is REPEATABLE READ
, but this may be altered via SQL session settings or at Virtuoso server start-up through the [Parameters]
section of the virtuoso.ini
file, with
DefaultIsolation = 4
Transaction isolation cannot be set over the SPARQL protocol.
NOTE: When testing full CRUD operations, other isolation settings may be preferable, due to ACID considerations. See answer #12, below, and detailed discussion in part 8 of this series, BSBM Explore and Update.
What settings control allocation of memory for database caching? We will be running mostly from memory, so we need to make sure that there is enough memory configured.
In the [Parameters]
section of the virtuoso.ini
file, NumberOfBuffers
controls the amount of RAM used by Virtuoso to cache database files. One buffer caches an 8KB database page. In practice, count 10KB of memory per page. If "swappiness" on Linux is low (e.g., 2), two-thirds or more of physical memory can be used for database buffers. If swapping occurs, decrease the setting.
What command gives status on memory allocation (e.g., number of buffers, number of dirty buffers, etc.) so that we can verify that things are indeed in server memory and not, for example, being served from OS disk cache. If the cached format is different from the disk layout (e.g., decompression after disk read), is there a command for space statistics for database cache?
In an isql
session, execute
STATUS ( ? ? );
The second result paragraph gives counts of total, used, and dirty buffers. If used buffers is steady and less than total, and if the disk read count on the line below does not increase, the system is running from memory. The cached format is the same as the disk based format.
What command gives information on disk allocation for different things? We are looking for the total size of allocated database pages for quads (including table, indices, anything else associated with quads) and dictionaries for literals, IRI names, etc. If there is a text index on literals, what command gives space stats for this? We count used pages, excluding any preallocated unused pages or other gaps. There is one number for quads and another for the dictionaries or other such structures, optionally a third for text index.
Execute on an isql
session:
CHECKPOINT;
SELECT TOP 20 * FROM sys_index_space_stats ORDER BY iss_pages DESC;
The iss_pages
column is the total pages for each index, including blob pages. Pages are 8KB. Only used pages are reported, gaps and unused pages are not counted. The rows pertaining to RDF_QUAD
are for quads; RDF_IRI
, RDF_PREFIX
, RO_START
, RDF_OBJ
are for dictionaries; RDF_OBJ_RO_FLAGS_WORDS
and VTLOG_DB_DBA_RDF_OBJ
are for text index.
If there is a choice between triples and quads, we will run with quads. How do we ascertain that the run is with quads? How do we find out the index scheme? Should be use an alternate index scheme? Most of the data will be in a single big graph.
The default scheme uses quads. The default index layout is PSOG
, POGS
, GS
, SP
, OP
. To see the current index scheme, use an isql
session to execute
STATISTICS DB.DBA.RDF_QUAD;
For partitioned cluster settings, are there partitioning-related settings to control even distribution of data between partitions? For example, is there a way to set partitioning by S
or O
depending on which is first in key order for each index?
The default partitioning settings are good, i.e., partitioning is on O
or S
, whichever is first in key order.
For partitioned clusters, are there settings to control message batching or similar? What are the statistics available for checking interconnect operation, e.g. message counts, latencies, total aggregate throughput of interconnect?
In the [Cluster]
section of the cluster.ini
file, ReqBatchSize
is the number of query states dispatched between cluster nodes per message round trip. This may be incremented from the default of 10000
to 50000
or so if this is seen to be useful.
To change this on the fly, the following can be issued through an isql
session:
cl_exec ( ' __dbf_set (''cl_request_batch_size'', 50000) ' );
The commands below may be executed through an isql
session to get a summary of CPU and message traffic for the whole cluster or process-by-process, respectively. The documentation details the fields.
STATUS ('cluster') ;; whole cluster
STATUS ('cluster_d') ;; process-by-process
Other settings - Are there settings for limiting query planning, when appropriate? For example, the BSBM Explore mix has a large component of unnecessary query optimizer time, since the queries themselves access almost no data. Any other relevant settings?
For BSBM, needless query optimization should be capped at Virtuoso server start-up through the [Parameters]
section of the virtuoso.ini
, with
StopCompilerWhenXOverRun = 1
When testing full CRUD operations (not simply CREATE, i.e., load, as discussed in #5, above), it is essential to make queries run with transaction isolation of READ COMMITTED
, to remove most lock contention. Transaction isolation cannot be adjusted via SPARQL. This can be changed through SQL session settings, or at Virtuoso server start-up through the [Parameters]
section of the virtuoso.ini
file, with
DefaultIsolation = 2
The load time in the recent Berlin report was measured with the wrong function, and so far as we can tell, without multiple threads. The intermediate cut of Virtuoso they tested also had broken SPARQL/Update (also known as SPARUL) features. We have fixed this since, and give here the right numbers.
In the course of the discussion to follow, we talk about 3 different kinds of Virtuoso:
6 Single is the generally available single server configuration of Virtuoso. Whether this is open source or not does not make a difference.
6 Cluster is the generally available commercial only cluster-capable Virtuoso.
7 Single is the next generation single server Virtuoso, about to be released as a preview.
To understand the numbers, we must explain how these differ from each other in execution:
6 Single has one thread-per-query, and operates on one state of the query at a time.
6 Cluster has one thread-per-query-per-process, and between processes it operates on batches of some tens-of-thousands of simultaneous query states. Within each node, these batches run through the execution pipeline one state at a time. Aggregation is distributed, and the query optimizer is generally smart about shipping colocated functions together.
7 Single has multiple threads-per-query and in all situations operates on batches of 10,000 or more simultaneous query states. This means, for example, that index lookups get large numbers of parameters which then are sorted to get an ascending search pattern which benefits from locality, so the n * log(n)
index access for the batch becomes more like linear if the data accessed has any locality. Furthermore, if there are many operands to an operator, these can be split on multiple threads. Also, scans of consecutive rows can be split before the scan on multiple threads, each doing a range of the scan. These features are called vectored execution and query parallelization. These techniques will also be applied to the cluster variant in due time.
The version 6 and 7 variants discussed here use the same physical storage layout with row-wise key compression. Additionally, there exists a column-wise storage option in 7 that can fit 4x the number of quads in the same space. This column store option is not used here because it still has some problems with random order inserts.
We will first consider loading. Below are the load times and rates for 7 at each scale.
7 Single | |||
---|---|---|---|
Scale | Rate (quads per second) |
Load time (seconds) |
Checkpoint time (seconds) |
100 Mt | 261,366 | 301 | 82 |
200 Mt | 216,000 | 802 | 123 |
1000 Mt | 130,378 | 6641 | 1012 |
In each case the load was made on 8 concurrent streams, each reading a file from a pool of 80 files for the two smaller scales and 360 files for the larger scale.
We also loaded the smallest data set with 6 Single using the same load script.
6 Single | |||
---|---|---|---|
Scale | Rate (quads per second) |
Load time (seconds) |
Checkpoint time (seconds) |
100 Mt | 74,713 | 1192 | 145 |
CPU time with 6 Single was 8047 seconds. We compare this to 4453 seconds of CPU for the same load on 7 Single. The CPU% during the run was on either side of 700% for 6 Single and 1300% for 7 Single. Note that high percentages involve core threads, not real cores.
The difference is mostly attributable to vectoring and the introduction of a non-transactional insert. The 6 Single inserts transactionally but makes very frequent commits and writes no log, resulting in de facto non-transactional behavior but still there is a lock and commit cycle. Inserts in RDF load usually exhibit locality on all SPOG. Sorting by value gives ascending insert order and eliminates much of the lookup time for deciding where the next row will go. Contention on page read-write locks is less because the engine stays longer on a page, inserting multiple values in one go, instead of re-acquiring the read-write lock and possible transaction locks for each row.
Furthermore, for single stream loading the non-transactional mode can serve one thread doing the parsing with many threads doing the inserting; hence, in practice the speed is bounded by the parsing speed. In multi-stream load this parallelization also happens but is less significant, as adding threads past the count of core threads is not useful. Writes are all in-place, and no delta-merge mechanism is involved. For transactional inserts, the uncommitted rows are not visible to read-committed readers, which do not block. Repeatable and serializable readers would block before an uncommitted insert.
Now for the run (larger numbers indicate more queries executed, and are therefore better):
6 Single Throughput (QMpH, query mixes per hour) |
||
---|---|---|
Scale | Single User | 16 User |
100 Mt | 7641 | 29433 |
200 Mt | 6017 | 13335 |
1000 Mt | 1770 | 2487 |
7 Single Throughput (QMpH, query mixes per hour) |
||
---|---|---|
Scale | Single User | 16 User |
100 Mt | 11742 | 72278 |
200 Mt | 10225 | 60951 |
1000 Mt | 6262 | 24672 |
The 100 Mt and 200 Mt runs are entirely in memory; the 1000 Mt run is mostly in memory, with about a 1.6 MB/s trickle from SSD in steady state. Accordingly, the 1000 Mt run is longer, with 2000 query mixes in the timed period, preceded by a warm-up of 2000 mixes with a different seed. For the memory-only scales, we run 500 mixes twice, and take the timing of the second run.
Looking at single user speeds, 6 Single and 7 Single are closest at the small end and drift farther apart at the larger scales. This comes from the increased opportunity to parallelize Q5, since this works on more data and is relatively more important as the scale gets larger. The 100 Mt run of 7 Single has about 130% CPU, and the 1000 Mt run has about 270%. This also explains why adding clients gives a larger boost at the smaller scale.
Now let us look at the relative effects of parallelizing and vectoring in 7 Single. We run 50 mixes of Single User Explore: 6132 QMpH with both parallelizing and vectoring on; 2805 QMpH with execution limited to a single thread. Then we set the vector size to 1, meaning that the query pipeline runs one row at a time. This gets us 1319 QMpH which is a bit worse than 6 Single. This is to be expected since there is some overhead to running vectored with single-element vectors. Q5 on 7 Single with vectoring and a single thread runs at 1.9 qps; with single-element vectors, at 0.8 qps. The 6 Single engine runs Q5 at 1.13 qps.
The 100 Mt scale 7 Single gains the most from adding clients; the 1000 Mt 6 Single gains the least. The reason for the latter is covered in detail in A Benchmarking Story. We note that while vectoring is primarily geared to better single-thread speed and better cache hit rates, it delivers a huge multithreaded benefit by eliminating the mutex contention at the index tree top which stops 6 Single dead at 1000 Mt.
In conclusion, we see that even with a workload of short queries and little opportunity for parallelism, we get substantial benefits from query parallelization and vectoring. When moving to more complex workloads, the benefits become more pronounced. For a single user complex query load, we can get 7x speed-up from parallelism (8 core), plus up to 3x from vectoring. These numbers do not take into account the benefits of the column store; those will be analyzed separately a bit later.
The full run details will be supplied at the end of this blog series.
This is an edifying story about benchmarks and how databases work. I will show how one detail makes a 5+x difference, and how one really must understand how things work in order to make sense of benchmarks.
We begin right after the publication of the recent Berlin report. This report gives us OK performance for queries and very bad performance for loading. Trickle updates were not measurable. This comes as a consequence of testing intermediate software cuts and having incomplete instructions for operating them. I will cover the whole BSBM matter and the general benchmarking question in forthcoming posts; for now, let's talk about specifics.
In the course of the discussion to follow, we talk about 3 different kinds of Virtuoso:
6 Single is the generally available single-instance-server configuration of Virtuoso. Whether this is open source or not does not make a difference.
6 Cluster is the generally available, commercial-only, cluster-capable Virtuoso.
7 Single is the next-generation single-instance-server Virtuoso, about to be released as a preview.
We began by running the various parts of BSBM at different scales with different Virtuoso variants. In so doing, we noticed that the BSBM Explore mix at one scale got better throughput as we added more clients, approximately as one would expect based on CPU usage and number of cores, while at another scale this was not so.
At the 1-billion-triple scale (1000 Mt; 1 Mt = 1 Megatriple, or one million triples) we saw CPU going from 200% with 1 client to 1400% with 16 clients but throughput increased by less than 20%.
When we ran the same scale with our shared-nothing 6 Cluster, running 8 processes on the same box, throughput increased normally with the client count. We have not previously tried BSBM with 6 Cluster simply because there is little to gain and a lot to lose by distributing this workload. But here we got a multiuser throughput with 6 Cluster that is easily 3 times that of the single server, even with a cluster-unfriendly workload.
See, sometimes scaling out even within a shared memory multiprocessor pays! Still, what we saw was rather anomalous.
Over the years we have looked at performance any number of times and have a lot of built-in meters. For cases of high CPU with no throughput, the prime suspect is contention on critical sections. Quite right, when building with the mutex meter enabled, counting how many times each mutex is acquired and how many times this results in a wait, we found a mutex which gets acquired 600M times in the run, of which an insane 450M result in a wait. One can count a microsecond of real time each time a mutex wait results in the kernel switching tasks. The run took 500 s or so, of which 450 s of real time were attributable to the overhead of waiting for this one mutex.
Waiting for a mutex is a real train wreck. We have tried spinning a few times before it, which the OS does anyhow, but this does not help. Using spin locks is good only if waits are extremely rare; with any frequency of waiting, even for very short waits, a mutex is still a lot better.
Now, the mutex in question happens to serialize the buffer cache for one specific page of data, one level down from the root of the index for RDF PSOG. By the luck of the draw, the Ps falling on that page are commonly accessed Ps pertaining to product features. In order to get any product feature value, one must pass via this page. At the smaller scale, the different properties web their different ways based on the index root.
One might here ask why the problem is one level down from the root and not in the root. The index root is already handled specially, so the read-write locks for buffers usually apply only for the first level down. One might also ask why have a mutex in the first place. Well, unless one is read-only and all in memory, there simply must be a way to say that a buffer must not get written to by one thread while another is reading it. Same for cache replacement. Some in-memory people fork a whole copy of the database process to do a large query and so can forget about serialization. But one must have long queries for this and have all in memory. One can make writes less frequent by keeping deltas, but this does not remove the need to merge the deltas at some point, which cannot happen without serializing this with the readers.
Most of the time the offending mutex is acquired for getting a property of a product in Q5, the one that looks for products with similar values of a numeric property. We retrieve this property for a number of products in one go, due to vectoring. Vectoring is supposed to save us from constantly hitting the index tree top when getting the next match. So how come there is contention in the index tree top? As it happens, the vectored index lookup checks for locality only when all search conditions on key parts are equalities. Here however there is equality on P and S and a range on O; hence, the lookup starts from the index root every time.
So I changed this. The effect was Q5 getting over twice as fast, with the single user throughput at 1000 Mt going from 2000 to 5200 QMpH (Query Mixes per Hour) and the 16-user throughput going from 3800 to over 21000 QMpH. The previously "good" throughput of 40K QMpH at 100 Mt went to 66K QMpH.
Vectoring can make a real difference. The throughputs for the same workload on 6 Single, without vectoring, thus unavoidably hitting the page with the crazy contention, are 1770 QMpH single user and 2487 QMpH with 16 users. The 6 Cluster throughput, avoiding the contention but without the increased locality from vectoring and with the increased latency of going out-of-process for most of the data, was about 11.5K QMpH with 16 users. Each partition had a page getting the hits but since the partitioning was on S and S was about-evenly distributed, each partition got 1/8 of the load; thus waiting on the mutex did not become a killer issue.
We see how detailed analysis of benchmarks can lead to almost an order of magnitude improvements in a short time. This analysis is however both difficult and tedious. It is not readily delegable; one needs real knowledge of how things work and of how they ought to work in order to get anywhere with this. Experience tends to show that a competitive situation is needed in order to motivate one to go to the trouble. Unless something really sticks out in an obvious manner, one is most likely not going to look deep enough. Of course, this is seen in applications too but application optimization tends to stop at a point where the application is usable. Also stored procedures and specially-tweaked queries will usually help. In most application scenarios, we are not simultaneously looking at multiple different implementations, except maybe at the start of development but then this falls under benchmarking and evaluation.
So, the usefulness of benchmarks is again confirmed. There is likely great unexplored space for improvement as we move to more interesting and diverse scenarios.
Correct misleading information about us in the recent Berlin report: The load rate is off-the wall and the update mix is missing. We supply the right numbers and explain how to load things so that one gets decent performance.
Discuss configuration options for Virtuoso.
Tell a story about multithreading and its perils and how vectoring and scale-out can save us.
Analyze the run time behavior of Virtuoso 6 Single, 6 Cluster, and 7 Single.
Look at the benefits of SSDs (solid-state storage devices) over HDDs (hard disk devices; spinning platters), and I/O matters in general.
Talk in general about modalities of benchmark running, and how to reconcile vendors doing what they know best with the air of legitimacy of a third party. Whether to do things a la TPC or a la TREC? We will hopefully try a bit of both, at least so I have proposed to our partners in LOD2, the EU FP7 that also funded the recent Berlin report.
Outline the desiderata for an RDF benchmark that is not just an RDF-ized relational workload, the Social Intelligence Benchmark.
Talk about BSBM in specific. What does it measure?
Discuss some experiments with the BI use case of BSBM.
Document how the results mentioned here were obtained and suggest practices for benchmark running and disclosure.
The background is that the LOD2 FP7 project is supposed to deliver a report about the state of the art and benchmark laboratory by March 1. The Berlin report is a part thereof. In the project proposal we talk about an ongoing benchmarking activity and about having up-to-date installations of the relevant RDF stores and RDBMS.
Since this is taxpayer money for supposedly the common good, I see no reason why such a useful thing should be restricted to the project participants. On the other hand, running a display window of stuff for benchmarking, when in at least in some cases licenses prohibit unauthorized publishing of benchmark results might be seen to conflict with the spirit of the license if not its letter. We will see.
For now, my take is that we want to run benchmarks of all interesting software, inviting the vendors to tell us how to do that if they will, and maybe even letting them perform those runs themselves. Then we promise not to disclose results without the vendor's permission. Access to the installations is limited to whoever operates the equipment. Configuration files and detailed hardware specs and such on the other hand will be made public. If a run is published, it will be with permission and in a format that includes full information for replicating the experiment.
In the LOD2 proposal we also in so many words say that we will stretch the limits of the state of the art. This stretching is surely not limited to the project's own products but should also include the general benchmarking aspect. I will say with confidence that running single server benchmarks at a max 200 Mtriples of data is not stretching anything.
So to ameliorate this situation, I thought to run the same at 10x the scale on a couple of large boxes we have access to. 1 and 2 billion triples are still comfortably single server scales. Then we could go for example to Giovanni's cluster at DERI and do 10 and 20 billion triples, this should fly reasonably on 8 or 16 nodes of the DERI gear. Or we might talk to SEALS who by now should have their own lab. Even Amazon EC2 might be an option, although not the preferred one.
So I asked everybody about config instructions, which produced a certain amount of dismay as I might be said to be biased and to be skirting the edges of conflict of interest. The inquiry was not altogether negative though since Ontotext and Garlik provided some information. We will look into these this and next week. We will not publish any information without asking first.
In this series of posts I will only talk about OpenLink Software.
I will now discuss what we have done towards this end in 2010 and how you will gain by this in 2011.
At the start of 2010, we had internally demonstrated 4x space efficiency gains from column-wise compression and 3x loop join speed gains from vectored execution. To recap, column-wise compression means a column-wise storage layout where values of consecutive rows of a single column are consecutive in memory/disk and are compressed in a manner that benefits from the homogenous data type and possible sort order of the column. Vectored execution means passing large numbers of query variable bindings between query operators and possibly sorting inputs to joins for improving locality. Furthermore, always operating on large sets of values gives extra opportunities for parallelism, from instruction level to threads to scale out.
So, during 2010, we integrated these technologies into Virtuoso, for relational- and graph-based applications alike. Further, even if we say that RDF will be close to relational speed in Virtuoso, the point is moot if Virtuoso's relational speed is not up there with the best of analytics-oriented RDBMS. RDF performance does rest on the basis of general-purpose database performance; what is sauce for the goose is sauce for the gander. So we reimplemented HASH JOIN
and GROUP BY
, and fine-tuned many of the tricks required by TPC-H. TPC-H is not the sole final destination, but it is a step on the way and a valuable checklist for what a database ought to do.
At the Semdata workshop of VLDB 2010 we presented some results of our column store applied to RDF and relational tasks. As noted in the paper, the implementation did demonstrate significant gains over the previous row-wise architecture but was not yet well optimized, so not ready to be compared with the best of the relational analytics world. A good part of the fall of 2010 went into optimizing the column store and completing functionality such as transaction support with columns.
A lot of this work is not specifically RDF oriented, but all of this work is constantly informed by the specific requirements of RDF. For example, the general idea of vectored execution is to eliminate overheads and optimize CPU cache and other locality by doing single query operations on arrays of operands so that the whole batch runs more or less in CPU cache. Are the gains not lost if data is typed at run time, as in RDF? In fact, the cost of run-time-typing turns out to be small, since data in practice tends to be of homogenous type and with locality of reference in values. Virtuoso's column store implementation resembles in broad outline other column stores like Vertica or VectorWise, the main difference being the built-in support for run-time heterogenous types.
The LOD2 EU FP 7 project started in September 2010. In this project OpenLink and the celebrated heroes of the column store, CWI of MonetDB and VectorWise fame, represent the database side.
The first database task of LOD2 is making a survey of the state of the art and a round of benchmarking of RDF stores. The Berlin SPARQL Benchmark (BSBM) has accordingly evolved to include a business intelligence section and an update stream. Initial results from running these will become available in February/March, 2011. The specifics of this process merit another post; let it for now be said that benchmarking is making progress. In the end, it is our conviction that we need a situation where vendors may publish results as and when they are available and where there exists a well defined process for documenting and checking results.
LOD2 will continue by linking the universe, as I half-facetiously put it on a presentation slide. This means alignment of anything from schema to instance identifiers, with and without supervision, and always with provenance, summarization, visualization, and so forth. In fact, putting it this way, this gets to sound like the old chimera of generating applications from data or allowing users to derive actionable intelligence from data of which they do not even know the structure. No, we are not that unrealistic. But we are moving toward more ad-hoc discovery and faster time to answer. And since we provide an infrastructure element under all this, we want to do away with the "RDF tax," by which we mean any significant extra cost of RDF compared to an alternate technology. To put it another way, you ought to pay for unpredictable heterogeneity or complex inference only when you actually use them, not as a fixed up-front overhead.
So much for promises. When will you see something? It is safe to say that we cannot very well publish benchmarks of systems that are not generally available in some form. This places an initial technology preview cut of Virtuoso 7 with vectored execution somewhere in January or early February. The column store feature will be built in, but more than likely the row-wise compressed RDF format of Virtuoso 6 will still be the default. Version 6 and 7 databases will be interchangeable unless column-store structures are used.
For now, our priority is to release the substantial gains that have already been accomplished.
After an initial preview cut, we will return to the agenda of making sure Virtuoso is up there with the best in relational analytics, and that the equivalent workload with an RDF data model runs as close as possible to relational performance. As a first step this means taking TPC-H as is, and then converting the data and queries to the trivially equivalent RDF and SPARQL and seeing how it goes. In the September paper we dabbled a little with the data at a small scale but now we must run the full set of queries at 100GB and 300GB scales, which come to about 14 billion and 42 billion triples, respectively. A well done analysis of the issues encountered, covering similarities and dissimilarities of the implementation of the workload as SQL and SPARQL, should make a good VLDB paper.
Database performance is an entirely open-ended quest and the bag of potentially applicable tricks is as good as infinite. Having said this, it seems that the scales comfortably reached in the TPC benchmarks are more than adequate for pretty much anything one is likely to encounter in real world applications involving comparable workloads. Businesses getting over 6 million new order transactions per minute (the high score of TPC-C) or analyzing a warehouse of 60 billion orders shipped to 6 billion customers over 7 years (10000GB or 10TB TPC-H) are not very common if they exist at all.
The real world frontier has moved on. Scaling up the TPC workloads remains a generally useful exercise that continues to contribute to the state of the art but the applications requiring this advance are changing.
Someone once said that for a new technology to become mainstream, it needs to solve a new class of problem. Yes, while it is a preparatory step to run TPC-H translated to SPARQL without dying of overheads, there is little point in doing this in production since SQL is anyway likely better and already known, proven, and deployed.
The new class of problem, as LOD2 sees it, is the matter of web-wide cross-organizational data integration. Web-wide does not necessarily mean crawling the whole web, but does tend to mean running into significant heterogeneity of sources, both in terms of modeling and in terms of usage of more-or-less standard data models. Around this topic we hear two messages. The database people say that inference beyond what you can express in SQL views is theoretically nice but practically not needed; on the other side, we hear that the inference now being standardized in efforts like RIF and OWL is not expressive enough for the real world. As one expert put it, if enterprise data integration in the 1980s was between a few databases, today it is more like between 1000 databases, which makes this matter similar to searching the web. How can one know in such a situation that the data being aggregated is in fact meaningfully aggregate-able?
Add to this the prevalence of unstructured data in the world and the need to mine it for actionable intelligence. Think of combining data from CRM, worldwide media coverage of own and competitive brands, and in-house emails for assessing organizational response to events on the market.
These are the actual use cases for which we need RDF at relational DW performance and scale. This is not limited to RDF and OWL profiles, since we fully believe that inference needs are more diverse. The reason why this is RDF and not SQL plus some extension of Datalog, is the widespread adoption of RDF and linked data as a data publishing format, with all the schema-last and open world aspects that have been there from the start.
Stay tuned for more news later this month!
Feature | Description | Benefit |
---|---|---|
Automatic Deployment | Linked Data Pages are now automatically published for every Virtuoso Data Object; users need only load their data into the RDF Quad Store. | Handcrafted URL-Rewrite Rules are no longer necessary. |
HTTP Metadata Enhancements | HTTP Link: header is used to transfer vital metadata (e.g., relationships between a Descriptor Resource and its Subject) from HTTP Servers to User Agents. |
Enables HTTP-oriented tools to work with such relationships and other metadata. |
HTML Metadata Embedding | HTML resource <head /> and <link /> elements and their @rel attributes are used to transfer vital metadata (e.g., relationships between a Descriptor Resource and its Subject) from HTTP Servers to User Agents. |
Enables HTML-oriented tools to work with such relationships and other metadata. |
Hammer Stack Auto-Discovery Patterns | HTML resource <head /> section and <link /> elements, the HTTP Link: header, and XRD-based "host-meta" resources collectively provide structured metadata about Virtuoso hosts, associated Linked Data Spaces, and specific Data Items (Entities). |
Enables humans and machines to easily distinguish between Descriptor Resources and their Subjects, irrespective of URI scheme. |
Feature | Description | Benefit |
---|---|---|
New Sponger Cartridges | New cartridges (data access and transformation drivers) for Twitter, Facebook, Amazon, eBay, |
Enable users and user agents to deal with the Sponged data spaces as though they were named graphs in a quad store, or tables in an RDBMS. |
New Descriptor Pages | HTML-based descriptor pages are automatically generated. | Descriptor subjects, and the constellation of navigable attribute-and-value pairs that constitute their descriptive representation, are clearly identified. |
Automatic Subject Identifier Generation | De-referenceable data object identifiers are automatically created. | Removes tedium and risk of error associated with nuance-laced manual construction of identifiers. |
Support for OData, JSON, RDFa | Additional data representation and serialization formats associated with Linked Data. | Increases flexibility and interoperability. |
Feature | Description | Benefit |
---|---|---|
Materialized RDF Views | RDF Views over ODBC/JDBC Data Sources can now (optionally) keep the Quad Store in sync with the RDBMS data source. | Enables high-performance Faceted Browsing while remaining sensitive to changes in the RDBMS data sources. |
CSV-to-RDF Transformation | Wizard-based generation of RDF Linked Data from CSV files. | Speeds deployment of data which may only exist in CSV form as Linked Data. |
Transparent Data Access Binding | SPASQL (SPARQL Query Language integrated into SQL) is usable over ODBC, JDBC, ADO.NET, OLEDB, or XMLA connections. | Enables Desktop Productivity Tools to transparently work with any blend of RDBMS and RDF data sources. |
Feature | Description | Benefit |
---|---|---|
Quad Store to Quad Store Replication | High-fidelity graph-data replication between one or more database instances. | Enables a wide variety of deployment topologies. |
Delta Engine | Automated generation of deltas at the named-graph-level, matches transactional replication offered by the Virtuoso SQL engine. | Brings RDF replication on par with SQL replication. |
|
Deep integration within Quad Store as an optional mechanism for shipping deltas. | Enables push-based data replication across a variety of topologies. |
Feature | Description | Benefit |
---|---|---|
|
Use |
Enables application of sophisticated security and data access policies to Web Services (e.g., SPARQL endpoint) and actual DBMS objects. |
Webfinger | Supports using mailto: and acct: URIs in the context of |
Enables more intuitive identification of people and organizations. |
Fingerpoint | Similar to Webfinger but does not require XRDS resources; instea,d it works directly with SPARQL endpoints exposed using auto-discovery patterns in the <head /> section of HTML documents. |
Enables more intuitive identification of people and organizations. |
]]>
I will here recap some of the points discussed, since these can be of broader interest.
Why is there no single dominant vendor?The field is young. We can take the relational database industry as a historical precedent. From the inception of the relational database around 1970, it took 15 years for the relational model to become mainstream. "Mainstream" here does not mean dominant in installed base, but does mean something that one tends to include as a component in new systems. The figure of 15 years might repeat with RDF, from around 1990 for the first beginnings to 2015 for routine inclusion in new systems, where applicable.
This does not necessarily mean that the RDF graph data model (or more properly, EAV+CR; Entity-Attribute-Value + Classes and Relationships) will take the place of the RDBMS as the preferred data backbone. This could mean that RDF model serialization formats will be supported as data exchange mechanisms, and that systems will integrate data extracted by semantic technology from unstructured sources. Some degree of EAV storage is likely to be common, but on-line transactional data is guaranteed to stay pure relational, as EAV is suboptimal for OLTP. Analytics will see EAV alongside relational especially in applications where in-house data is being combined with large numbers of outside structured sources or with other open sources such as information extracted from the web.
EAV offerings will become integrated by major DBMS vendors, as is already the case with Oracle. Specialized vendors will exist alongside these, just as is the case with relational databases.
Can there be a positive reinforcement cycle (e.g., building cars creates a need for road construction, and better roads drive demand for more cars)? Or is this an up-front infrastructure investment that governments make for some future payoff or because of science-funding policies?
The Document Web did not start as a government infrastructure initiative. The infrastructure was already built, albeit first originating with the US defense establishment. The Internet became ubiquitous through the adoption of the Web. The general public's adoption of the Web was bootstrapped by all major business and media adopting the Web. They did not adopt the web because they particularly liked it, as it was essentially a threat to the position of media and to the market dominance of big players who could afford massive advertising in this same media. Adopting the web became necessary because of the prohibitive opportunity cost of not adopting it.
A similar process may take place with open data. For example, in E-commerce, vendors do not necessarily welcome easy-and-automatic machine-based comparison of their offerings against those of their competitors. Publishing data will however be necessary in order to be listed at all. Also, in social networks, we have the identity portability movement which strives to open the big social network silos. Data exchange via RDF serializations, as already supported in many places, is the natural enabling technology for this.
Will the web of structured data parallel the development of web 2.0?
Web 2.0 was about the blogosphere, exposure of web site service APIs, creation of affiliate programs, and so forth. If the Document Web was like a universal printing press, where anybody could publish at will, Web 2.0 was a newspaper, bringing the democratization of journalism, creating the blogger, the citizen journalist. The Data Web will create the Citizen Analyst, the Mini Media Mogul (e.g., social-network-driven coops comprised of citizen journalists, analysts, and other content providers such as video and audio producers and publishers). As the blogosphere became an alternative news source to the big media, the web of data may create an ecosystem of alternative data products. Analytics is no longer a government or big business only proposition.
Is there a specifically semantic market or business model, or will semantic technology be exploited under established business models and merged as a component technology into existing offerings?
We have seen a migration from capital expenses to operating expenses in the IT sector in general, as exemplified by cloud computing's Platform as a Service (PaaS) and Software as a Service (SaaS). It is reasonable to anticipate that this trend will continue to Data as a Service (DaaS). Microsoft Odata and Dallas are early examples of this and go towards legitimizing the data as service concept. DaaS is not related to semantic technology per se, but since this will involve integration of data, RDF serializations will be attractive, especially given the takeoff of linked data in general. The data models in Odata are also much like RDF, as both stem from EAV+CR, which makes for easy translation and a degree of inherent interoperability.
The integration of semantic technology into existing web properties and business applications will manifest to the end user as increased serendipity. The systems will be able to provide more relevant and better contextualized data for the user's situation. This applies equally to the consumer and business user cases.
Identity virtualization in the forms of WebID and Webfinger — making first-class de-referenceable identifiers of mailto:
and acct:
schemes — is emerging as a new way to open social network and Web 2.0 data silos.
On the software production side, especially as concerns data integration, the increased schema- and inference-flexibility of EAV will lead to a quicker time to answer in many situations. The more complex the task or the more diverse the data, the higher the potential payoff. Data in cyberspace is mirroring the complexity and diversity of the real world, where heterogeneity and disparity are simply facts of life, and such flexibility is becoming an inescapable necessity.
]]>Franz, Ontotext, and OpenLink were the vendors present at the workshop. To summarize very briefly, Jans Aasman of Franz talked about the telco call center automation solution by Amdocs, where the AllegroGraph RDF store is integrated. On the technical side, AllegroGraph has Javascript as a stored procedure language, which is certainly a good idea. Naso of Ontotext talked about the BBC FIFA World Cup site. The technical proposition was that materialization is good and data partitioning is not needed; a set of replicated read-only copies is good enough.
I talked about making RDF cost competitive with relational for data integration and BI. The crux is space efficiency and column store techniques.
One question that came up was that maybe RDF could approach relational in some things, but what about string literals being stored in a separate table? Or URI strings being stored in a separate table?
The answer is that if one accesses a lot of these literals the access will be local and fairly efficient. If one accesses just a few, it does not matter. For user-facing reports, there is no point in returning a million strings that the user will not read anyhow. But then it turned out that there in fact exist reports in bioinformatics where there are 100,000 strings. Now taking the worst abuse of SPARQL, a regexp over all literals in a property of a given class. With a column store this is a scan of the column; with RDF, a three table join. The join is about 10x slower than the column scan. Quite OK, considering that a full text index is the likely solution for such workloads anyway. Besides, a sensible relational schema will also not use strings for foreign keys, and will therefore incur a similar burden from fetching the strings before returning the result.
Another question was about whether the attitude was one of confrontation between RDF and relational and whether it would not be better to join forces. Well, as said in my talk, sauce for the goose is sauce for the gander and generally speaking relational techniques apply equally to RDF. There are a few RDB tricks that have no RDF equivalent, like clustering a fact table on dimension values, e.g., sales ordered by country, manufacturer, month. But by and large, column-store techniques apply. The execution engine can be essentially identical, just needing a couple of extra data types and some run-time typing and in some cases producing nulls instead of errors. Query optimization is much the same, except that RDB stats are not applicable as such; one needs to sample the data in the cost model. All in all, these adaptations to a RDB are not so large, even though they do require changes to source code.
Another question was about combining data models, e.g., relational (rows and columns), RDF (graph), XML (tree), and full text. Here I would say that it is a fault of our messaging that we do not constantly repeat the necessity of this combining, as we take it for granted. Most RDF stores have a full text index on literal values. OWLIM and a CWI prototype even have it for URIs. XML is a valid data type for an RDF literal, even though this does not get used very much. So doing SPARQL to select the values, and then doing XPath and XSLT on the values, is entirely possible, at least in Virtuoso which has an XPath/XSLT engine built in. Same for invoking SPARQL from an XSLT sheet. Colocating a native RDBMS with local and federated SQL is what Virtuoso has always done. One can, for example, map tables in heterogenous remote RDBs into tables in Virtuoso, then map these into RDF, and run SPARQL queries that get translated into SQL against the original tables, thereby getting SPARQL access without any materialization. Alongside this, one can ETL relational data into RDF via the same declarative mapping.
Further, there are RDF extensions for geospatial queries in Virtuoso and AllegroGraph, and soon also in others.
With all this cross-model operation, RDF is definitely not a closed island. We'll have to repeat this more.
Of the academic papers, the SpiderStore (paper is not yet available at time of writing, but should be soon) and Webpie that should be specially noted.
Let us talk about SpiderStore first.
The SpiderStore from the University of Innsbruck is a main-memory-only system that has a record for each distinct IRI. The IRI record has one array of pointers to all IRI records that are objects where the referencing record is the subject, and a similar array of pointers to all records where the referencing record is the object. Both sets of pointers are clustered based on the predicate labeling the edge.
According to the authors (Robert Binna, Wolfgang Gassler, Eva Zangerle, Dominic Pacher, and Günther Specht), a distinct IRI is 5 pointers and each triple is 3 pointers. This would make about 4 pointers per triple, i.e., 32 bytes with 64-bit pointers.
This is not particularly memory efficient, since one must count unused space after growing the lists, fragmentation, etc., which will make the space consumption closer to 40 bytes per triple, plus should one add a graph to the mix one would need another pointer per distinct predicate, adding another 1-4 bytes per triple. Supporting non-IRI types in the object position is not a problem, as long as all distinct values have a chunk of memory to them with a type tag.
We get a few times better memory efficiency with column compressed quads, plus we are not limited to main memory.
But SpiderStore has a point. Making the traversal of an edge in the graph into a pointer dereference is not such a bad deal, especially if the data set is not that big. Furthermore, compiling the queries into C procedures playing with the pointers alone would give performance to match or exceed any hard coded graph traversal library and would not be very difficult. Supporting multithreaded updates would spoil much of the gain but allowing single threaded updates and forking read-only copies for reading would be fine.
SpiderStore as such is not attractive for what we intend to do, this being aggregating RDF quads in volumes far exceeding main memory and scaling to clusters. We note that SpiderStore hits problems with distributed memory, since SpiderStore executes depth first, which is manifestly impossible if significant latencies are involved. In other words, if there can be latency, one must amortize by having a lot of other possible work available. Running with long vectors of values is one way, as in MonetDB or Virtuoso Cluster. The other way is to have a massively multithreaded platform which favors code with few instructions but little memory locality. SpiderStore could be a good fit for massive multithreading, specially if queries were compiled to C, dramatically cutting down on the count of instructions to execute.
We too could adopt some ideas from SpiderStore. Namely, if running vectored, one just in passing, without extra overhead, generates an array of links to the next IRI, a bit like the array that SpiderStore has for each predicate for the incoming and outgoing edges of a given IRI. Of course, here these would be persistent IDs and not pointers, but a hash from one to the other takes almost no time. So, while SpiderStore alone may not be what we are after for data warehousing, Spiderizing parts of the working set would not be so bad. This is especially so since the Spiderizable data structure almost gets made as a by-product of query evaluation.
If an algorithm made several passes over a relatively small subgraph of the whole database, Spiderizing it would accelerate things. The memory overhead could have a fixed cap so as not to ruin the working set if locality happened not to hold.
Running a SpiderStore-like execution model on vectors instead of single values would likely do no harm and might even result in better cache behavior. The exception is in the event of completely unpredictable patterns of connections which may only be amortized by massive multithreading.
Webpie from VU Amsterdam and the LarKC EU FP 7 project is, as it were, the opposite of SpiderStore. This is a map-reduce-based RDFS and OWL Horst inference engine which is all about breadth-first passes over the data in a map-reduce framework with intermediate disk-based storage.
Webpie is not however a database. After the inference result has been materialized, it must be loaded into a SPARQL engine in order to evaluate a query against the result.
The execution plan of Webpie is made from the ontology whose consequences must be materialized. The steps are sorted and run until a fixed point is reached for each. This is similar to running SPARQL INSERT … SELECT
statements until no new inserts are produced. The only requirement is that the INSERT
statement should report whether new inserts were actually made. This is easy to do. In this way, a comparison between map-reduce plus memory-based joining and a parallel RDF database could be made.
We have suggested such an experiment to the LarKC people. We will see.
]]>The queries touch very little data, to the point where compilation is a large fraction of execution time. This is not representative of the data integration/analytics orientation of RDF.
Most queries are logarithmic to scale factor, but some are linear. The linear ones come to dominate the metric at larger scales.
An update stream would make the workload more realistic.
We could rectify this all with almost no changes to the data generator or test driver by adding one or two more metrics.
So I am publishing the below as a starting point for discussion.
Below is a set of business questions that can be answered with the BSBM data set. These are more complex and touch a greater percentage of the data than the initial mix. Their evaluation is between linear and n * log(n) to the data size. The TPC-H rules can be used for a power (single user) and a throughput (multi-user, where each submits queries from the mix with different parameters and in different order). The TPC-H score formula and executive summary formats are directly applicable.
This can be a separate metric from the "restricted" BSBM score. Restricted means "without a full scan with regexp" which will dominate the whole metric at larger scales.
Vendor specific variations in syntax will occur, hence these are allowed but disclosure of specific query text should accompany results. Hints for JOIN
order and the like are not allowed; queries must be declarative. We note that both SPARQL and SQL implementations of the queries are possible.
The queries are ordered so that the first ones fill the cache. Running the analytics mix immediately after backup post initial load is allowed, resulting in semi-warm cache. Steady-state rules will be defined later, seeing the characteristics of the actual workload.
For each country, list the top 10 product categories, ordered by the count of reviews from the country.
Product with the most reviews during its first month on the market
10 products most similar to X, with similarity score based on the count of features in common
Top 10 reviewers of category X
Product with largest increase in reviews in month X compared to month X-minus-1.
Product of category X with largest change in mean price in the last month
Most active American reviewer of Japanese cameras last year
Correlation of price and average review
Features with greatest impact on price — for features occurring in category X, find the top 10 features where the mean price with the feature is most above the mean price without the feature
Country with greatest popularity of products in category X — reviews of category X from country Y divided by total reviews
Leading product of category X by country, mentioning mean price in each country and number of offers, sort by number of offers
Fans of manufacturer — find top reviewers who score manufacturer above their mean score
Products sold only in country X
Since RDF stores often implement a full text index, and since a full scan with regexp matching would never be used in an online E-commerce portal, it is meaningful to extend the benchmark to have some full text queries.
For the SPARQL implementation, text indexing should be enabled for all string-valued literals even though only some of them will be queried in the workload.
Q6 from the original mix, now allowing use of text index.
Reviews of products of category X where the review contains the names of 1 to 3 product features that occur in said category of products; e.g., MP3 players with support for mp4 and ogg.
ibid but now specifying review author. The intent is that structured criteria are here more selective than text.
Difference in the frequency of use of "awesome", "super", and "suck(s)" by American vs. European vs. Asian review authors.
For full text queries, the search terms have to be selected according to a realistic distribution. DERI has offered to provide a definition and possibly an implementation for this.
The parameter distribution for the analytics queries will be defined when developing the queries; the intent is that one run will touch 90% of the values in the properties mentioned in the queries.
The result report will have to be adapted to provide a TPC-H executive summary-style report and appropriate metrics.
For supporting the IR mix, reviews should, in addition to random text, contain the following:
For each feature in the product concerned, add the label of said feature to 60% of the reviews.
Add the names of review author, product, product category, and manufacturer.
The review score should be expressed in the text by adjectives (e.g., awesome, super, good, dismal, bad, sucky). Every 20th word can be an adjective from the list correlating with the score in 80% of uses of the word and random in 20%. For 90% of adjectives, pick the adjectives from lists of idiomatic expressions corresponding to the country of the reviewer. In 10% of cases, use a random list of idioms.
Skew the review scores so that comparatively expensive products have a smaller chance for a bad review.
During the benchmark run:
1% of products are added;
3% of initial offers are deleted and 3% are added; and
5% of reviews are added.
Updates may be divided into transactions and run in series or in parallel in a manner specified by the test sponsor. The code for loading the update stream is vendor specific but must be disclosed.
The initial bulk load does not have to be transactional in any way.
Loading the update stream must be transactional, guaranteeing that all information pertaining to a product or an offer constitutes a transaction. Multiple offers or products may be combined in a transaction. Queries should run at least in READ COMMITTED
isolation, so that half-inserted products or offers are not seen.
Full text indices do not have to be updated transactionally; the update can lag up to 2 minutes behind the insertion of the literal being indexed.
The test data generator generates the update stream together with the initial data. The update stream is a set of files containing Turtle-serialized data for the updates, with all triples belonging to a transaction in consecutive order. The possible transaction boundaries are marked with a comment distinguishable from the text. The test sponsor may implement a special load program if desired. The files must be loaded in sequence but a single file may be loaded on any number of parallel threads.
The data generator should generate multiple files for the initial dump in order to facilitate parallel loading.
The same update stream can be used during all tests, starting each run from a backup containing only the initial state. In the original run, the update stream is applied starting at the measurement interval, after the SUT is in steady state.
]]>As concerns OpenLink specifically, we have two short term activities, namely publishing the initial LOD2 repository in December and publishing a set of RDB and RDF benchmarks in February.
The LOD2 repository is a fusion of the OpenLink LOD Cloud Cache (which includes data from URIBurner and PingTheSemanticWeb) and Sindice, both hosted at DERI. The value-add compared to Sindice or the Virtuoso-based LOD Cloud Cache alone is the merger of the timeliness and ping-ping crawling of Sindice with the SPARQL of Virtuoso.
Further down the road, after we migrate the system to the Virtuoso column store, we will also see gains in performance, primarily due to much better working set, as data is many times more compact than with the present row-wise key compression.
Still further, but before next September, we will have dynamic repartitioning; the time of availability is set as this is part of the LOD2 project roadmap. The operational need for this is pushed back somewhat by the compression gains from column-wise storage.
As for benchmarks, I just compiled a draft of suggested extensions to the BSBM (Berlin SPARQL Benchmark). I talked about this with Peter Boncz and Chris Bizer, to the effect that some extensions of BSBM could be done but that the time was a bit short for making a RDF-specific benchmark. We do recall that BSBM is fully feasible with a relational schema and that RDF offers no fundamental edge for the workload.
There was a graph benchmark talk at the TPC workshop at VLDB 2010. There too, the authors were suggesting a social network use case for benchmarking anything from RDF stores to graph libraries. The presentation did not include any specification of test data, so it may be that some cooperation is possible there. The need for such a benchmark is well acknowledged. The final form of this is not yet set but LOD2 will in time publish results from such.
We did informally talk about a process for publishing with our colleagues from Franz and Ontotext at VLDB 2010. The idea is that vendors tune their own systems and do the runs and that the others check on this, preferably all using the same hardware.
Now, the LOD2 benchmarks will also include relational-to-RDF comparisons, for example TPC-H in SQL and SPARQL. The SQL will be Virtuoso, MonetDB, and possibly VectorWise and others, depending on what legal restrictions apply at the time. This will give an RDF-to-SQL comparison of TPC-H at least on Virtuoso, later also on MonetDB, depending on the schedule for a MonetDB SPARQL front-end.
In the immediate term, this of course focuses our efforts on productizing the Virtuoso column store extension and the optimizations that go with it.
LOD2 is however about much more than database benchmarks. Over the longer term, we plan to apply suitable parts of the ground-breaking database research done at CWI to RDF use cases.
This involves anything from adaptive indexing, to reuse and caching of intermediate results, to adaptive execution. This is however more than just mapping column store concepts to RDF. New challenges are posed by running on clusters and dealing with more expressive queries than just SQL, in specific queries with Datalog-like rules and recursion.
LOD2 is principally about integration and alignment, from the schema to the instance level. This involves complex batch processing, close to the data, on large volumes of data. Map-reduce is not the be-all-end-all of this. Of course, a parallel database like Virtuoso, Greenplum, or Vertica can do map-reduce style operations under control of the SQL engine. After all, the SQL engine needs to do map-reduce and a lot more to provide good throughput for parallel, distributed SQL. Something like the Berkeley Orders Of Magnitude (BOOM) distributed Datalog implementation (Overlog, Deadalus, BLOOM) could be a parallel computation framework that would subsume any map-reduce-style functionality under a more elegant declarative framework while still leaving control of execution to the developer for the cases where this is needed.
From our viewpoint, the project's gains include:
Significant narrowing of the RDB to RDF performance gap. RDF will be an option for large scale warehousing, cutting down on time to integration by providing greater schema flexibility.
Ready to use toolbox for data integration, including schema alignment and resolution of coreference.
Data discovery, summarization and visualization
Integrating this into a relatively unified stack of tools is possible, since these all cluster around the task of linking the universe with RDF and linked data. In this respect the integration of results may be stronger than often seen in European large scale integrating projects.
The use cases fit the development profile well:
Wolters Kluwer will develop an application for integrating resources around law, from the actual laws to court cases to media coverage. The content is modeled in a fine grained legal ontology.
Exalead will implement the linked data enterprise, addressing enterprise search and any typical enterprise data integration plus generating added value from open sources.
The Open Knowledge Foundation will create a portal of all government published data for easy access by citizens.
In all these cases, the integration requirements of schema alignment, resolution of identity, information extraction, and efficient storage and retrieval play a significant role. The end user interfaces will be task-specific but developer interfaces around integration tools and query formulation may be quite generic and suited for generic RDF application development.
]]>Well, Perseus wasn't blogging or checking his email either, when he went to fetch the Gorgon's head. As Joseph Campbell puts it, the hero breaks into a world separate from the ordinary in order to bring back a blessing which will revitalize the community.
Thus, I deliberately withdrew from the public conversation, in faith that it would take care of itself and that I would still not be altogether forgotten. As it happens, I was confirmed in this when recently invited to submit a talk for the Semdata workshop at VLDB 2010.
Great deeds are not only personal accomplishments but also play a role in a broader context. The quest may appear remote and difficult to execute but its outcome can be quite tangible: Andromeda needed no elaborate sales pitch to convince her of the advantages of not being eaten by the sea serpent.
Thus right after the meeting in Sofia last March, I followed the vertical treasure map into the realm of first principles. As Perseus received advice from Athena, so was I informed by the Platonic ideas of locality and concurrency.
The great quests have an outer and inner aspect. Likewise here, bringing the ideas to physical reality gave me a great deal of material on cognitive function itself. For human and computer alike, it appears that the main reason why anything at all works is cache. Locality and parallelism again. Maybe I will say something more about memory, attention, interface, and paradigm some other time. On the other hand, such material is bound to be unpopular even if valid.
By now, you may ask yourself what I am talking about.
We remember that Andromeda's fix was due to her mother, Cassiopeia, having claimed greater beauty than the daughters of the sea-god Poseidon. To transpose the archetype into the present, it is like Tim B-L saying that OWLs (by the way sacred to Athena) are more semantic than Codd's brainchild. Yet the relational community sees RDF as something not quite serious. A matter of scale(s) — just think of the sea serpent.
So, I am talking about what I alluded to in the 2010 New Year's statement on this blog: RDF as a viable alternative to relational for big data. This means that RDF is no longer a specialty niche where, due to the hopeless task of bringing everything into a relational model, the fact of everything taking several times both the time and space is tolerated because there is no real alternative.
The value proposition is that for any current RDF user, the present assets will go four times farther than before with the next release of Virtuoso. For a prospective RDF user, the cost of keeping an ETLed RDF integration warehouse is now in the same ballpark as the relational cost, except that schema is now flexible, and the time to integrate and answer is accordingly shorter. For users of analytics-oriented RDBMS, the next Virtuoso is a full cluster-capable SQL column store. Its merits compared to others in this space will be published later with benchmarks like TPC-H. As an extra bonus for such users, Virtuoso brings SQL federation and a growth path to RDF, should this become interesting.
This is accomplished by introducing a new column-wise compressed-storage engine with corresponding changes to query execution. The general principles are explained in Daniel Abadi's famous Ph.D. thesis. The compression is tuned by the data itself, without user intervention. Further, our implementation remains capable of run-time-typing, thus the column-store advantages to RDF are obtained without going to a task-specific schema. But since data types, even if determined at run-time, are still in practice repetitive, the advantages of running on homogenous vectors are not lost.
When storing an RDF extraction of TPC-H data, we get a storage usage of 6.3 bytes per quad. If you do not care about queries where the predicate is unspecified, the storage requirement drops to 4.7 bytes per quad. Whether storing the data as RDF quads or as Vertica-style multicolumn projections, the working set is about the same. Since having enough of the data in memory is the sine qua non prerequisite of flexible querying, the point is made. QED.
In Virtuoso also, relational remains a bit faster but a penalty of 1.3x or so for RDF is quite tolerable, considering that a priori schema is no longer needed.
This means that we are coming into an age where the warehouse becomes an ad hoc asset, to be filled with RDF, without the need to develop an a priori universal schema for all data one may ever wish to integrate, now or in the future. The data can be stored as RDF and projected from there into any form that may be needed at any time, whether the target format is more RDF or a task-specific relational schema.
Availability is planned for late 2010, first as a Virtuoso Open Source preview.
]]>The paper shows how we store TPC-H data as RDF with relational-level efficiency and how we query both RDF and relational versions in comparable time. We also compare row-wise and column-wise storage formats as implemented in Virtuoso.
A question that has come up a few times during the Semdata initiative is how semantic data will avoid the fate of other would-be database revolutions like OODBMS and deductive databases.
The need and opportunity are driven by the explosion of data in quantity and diversity of structure. The competition consists of analytics RDBMS, point solutions done with map-reduce or the like, and lastly in some cases from key-value stores with relaxed schema but limited querying.
The benefits of RDF are the ever expanding volume of data published in it, reuse of vocabulary, and well-defined semantics. The downside is efficiency. This is not so much a matter of absolute scalability — you can run an RDF database on a cluster — but a question of relative cost as opposed to alternatives.
The baseline is that for relational-style queries, one should get relational performance or close enough. We outline in the paper how RDF reduces to a run-time-typed relational column-store, and gets all the compression and locality advantages traditionally associated with such. After memory is no longer the differentiator, the rest is engineering. So much for the scalability barrier to adoption.
I do not need to talk here about the benefits of linked data and more or less ad hoc integration per se. But again, to make these practical, there are logistics to resolve: How to keep data up to date? How to distribute it incrementally? How to monetize freshness? We propose some solutions for these, looking at diverse-RDF replication and RDB-to-RDF replication in Virtuoso.
But to realize the ultimate promise of RDF/Linked Data/Semdata, however we call it, we must look farther into the landscape of what is being done with big data. Here we are no longer so much running against the RDBMS, but against map-reduce and key-value stores.
Given the psychology of geekdom, the charm of map-reduce is understandable: One controls what is going on, can work in the usual languages, can run on big iron without being picked to pieces by the endless concurrency and timing and order-of-events issues one gets when programming a cluster. Tough for the best, and unworkable for the rest.
The key-value store has some of the same appeal, as it is the DBMS laid bare, so to say, made understandable, without the again intractably-complex questions of fancy query planning and distributed ACID transactions. The psychological rewards of the sense of control are there, never mind the complex query; one can always hard code a point solution for the business question, if really must — maybe even in map-reduce.
Besides, for some things that go beyond SQL (for example, with graph structures), there really isn't a good solution.
Now, enter Vertica, Greenplum, VectorWise (a MonetDB project derivative from Ingres) and Virtuoso, maybe others, who all propose some combination of SQL- and explicit map-reduce-style control structures. This is nice but better is possible.
Here we find the next frontier of Semdata. Take Joe Hellerstein et al's work on declarative logic for the data centric data center.
We have heard it many times — when the data is big, the logic must go to it. We can take declarative, location-conscious rules, à la BOOM and BLOOM, and combine these with the declarative query, well-defined semantics, parallel-database capability of the leading RDF stores. Merge this with locality compression and throughput from the best analytics DBMS.
Here we have a data infrastructure that subsumes map-reduce as a special case of arbitrary distributed-parallel control flow, can send the processing to the data, and has flexible queries and schema-last capability.
Further, since RDF more or less reduces to relational columns, the techniques of caching and reuse and materialized joins and demand-driven indexing, à la MonetDB, are applicable with minimal if any adaptation.
Such a hybrid database-fusion frontier is relevant because it addresses heterogenous, large-scale data, with operations that are not easy to reduce to SQL, still without loss of the advantages of SQL. Apply this to anything from enhancing the business intelligence process by faster integration, including integration with linked open data to the map-reduce bulk processing of today. Do it with strong semantics and inference close to the data.
In short, RDF stays relevant by tackling real issues, with scale second to none, and decisive advantages in time-to-integrate and expressive power.
Last week I was at the LOD2 kick off and a LarKC meeting. The capabilities envisioned in this and the following post mirror our commitments to the EU co-funded LOD2 project. This week is VLDB and the Semdata workshop. I will talk more about how these trends are taking shape within the Virtuoso product development roadmap in future posts.
]]>This post discusses the technical specifics of how we accomplish smooth transactional operation in a database server cluster under different failure conditions. (A higher-level short version was posted last week.) The reader is expected to be familiar with the basics of distributed transactions.
Someone on a cloud computing discussion list called two-phase commit (2PC) the "anti-availability protocol." There is indeed a certain anti-SQL and anti-2PC sentiment out there, with key-value stores and "eventual consistency" being talked about a lot. Indeed, if we are talking about wide-area replication over high-latency connections, then 2PC with synchronously-sharp transaction boundaries over all copies is not really workable.
For multi-site operations, a level of eventual consistency is indeed quite unavoidable. Exactly what the requirements are depends on the application, so I will focus here on operations inside one site.
The key-value store culture seems to focus on workloads where a record is relatively self-contained. The record can be quite long, with repeating fields, different selections of fields in consecutive records, and so forth. Such a record would typically be split over many tables of a relational schema. In the RDF world, such a record would be split even wider, with the information needed to reconstitute the full record almost invariably split over many servers. This comes from the mapping between the text of URIs and their internal IDs being partitioned in one way, and the many indices on the RDF quads each in yet another way.
So it comes to pass that in the data models we are most interested in, the application-level entity (e.g., a user account in a social network) is not a contiguous unit with a single global identifier. The social network user account, that the key-value store would consider a unit of replication mastering and eventual consistency, will be in RDF or SQL a set of maybe hundreds of tuples, each with more than one index, nearly invariably spanning multiple nodes of the database cluster.
So, before we can talk about wide-area replication and eventual consistency with application-level semantics, we need a database that can run on a fair-sized cluster and have cast-iron consistency within its bounds. If such a cluster is to be large and is to operate continuously, it must have some form of redundancy to cover for hardware failures, software upgrades, reboots, etc., without interruption of service.
This is the point of the design space we are tackling here.
There are two basic modes of operation we cover: bulk load, and online transactions.
In the case of bulk load, we start with a consistent image of the database; load data; and finish by making another consistent image. If there is a failure during load, we lose the whole load, and restart from the initial consistent image. This is quite simple and is not properly transactional. It is quicker for filling a warehouse but is not to be used for anything else. In the remainder, we will only talk about online transactions.
When all cluster nodes are online, operation is relatively simple. Each entry of each index belongs to a partition that is determined by the values of one or more partitioning columns of said index. There are no tables separate from indices; the relational row is on the index leaf of its primary key. Secondary indices reference the row by including the primary key. Blobs are in the same partition as the row which contains the blob. Each partition is then stored on a "cluster node." In non fault-tolerant operations, each such cluster node is a single process with exclusive access to its own permanent storage, consisting of database files and logs; i.e., each node is a single server instance. It does not matter if the storage is local or on a SAN, the cluster node is still the only one accessing it.
When things are not fault tolerant, transactions work as follows:
When there are updates, two-phase commit is used to guarantee a consistent result. Each transaction is coordinated by one cluster node, which issues the updates in parallel to all cluster nodes concerned. Sending two update messages instead of one does not significantly impact latency. The coordinator of each transaction is the primary authority for the transaction's outcome. If the coordinator of the transaction dies between the phases of the commit, the transaction branches stay in the prepared state until the coordinator is recovered and can be asked again about the outcome of the transaction. Likewise, if a non-coordinating cluster node with a transaction branch dies between the phases, it will do a roll-forward and ask the coordinator for the outcome of the transaction.
If cluster nodes occasionally crash and then recover relatively quickly, without ever losing transaction logs or database files, this is resilient enough. Everything is symmetrical; there are no cluster nodes with special functions, except for one master node that has the added task of resolving distributed deadlocks.
I suppose our anti-SQL person called 2PC "anti-availability" because in the above situation we have the following problems: if any one cluster node is offline, it is quite likely that no transaction can be committed. This is so unless the data is partitioned on a key with application semantics, and all data touched by a transaction usually stays within a single partition. Then operations could proceed on most of the data while one cluster node was recovering. But, especially with RDF, this is never the case, since keys are partitioned in ways that have nothing to do with application semantics. Further, if one uses XA or Microsoft DTC with the monitor on a single box, this box can become a bottleneck and/or a single point of failure. (Among other considerations, this is why Virtuoso does not rely on any such monitor.) Further, if a cluster node dies never to be heard of again, leaving prepared but uncommitted transaction branches, the rest of the system has no way of telling what to do with them, again unless relying on a monitor that is itself liable to fail.
If transactions have a real world counterpart, it is possible, at least in theory, to check the outcome against the real world state: One can ask a customer if an order was actually placed or a shipment delivered. But when a transaction has to do with internal identifiers of things, for example whether mailto://plaidskirt@hotdate.com
has internal ID 0xacebabe
, such a check against external reality is not possible.
In a fault tolerant setting, we introduce the following extra elements: Cluster nodes are comprised of "quorums" of mutually-mirroring server instances. Each such quorum holds a partition of the data. Such a quorum typically consists of two server instances, but may have three for extra safety. If all server instances in the quorum are offline, then the cluster node is offline, and the cluster is not fully operational. If at least one server instance in a quorum is online, then the cluster node is online, and the cluster is operational and can process new transactions.
We designate one cluster node (i.e., one quorum of 2 or 3 server instances) to act as a master node, and we set an order of precedence among its member instances. In addition to arbitrating distributed deadlocks, the master instance on duty will handle reports of server instance failures, and answer questions about any transactions left hanging in prepared state by a dead transaction coordinator. If the master on duty fails, the next master in line will either notice this itself in the line of normal business or get a complaint from another server instance about not being able to contact the previous master.
There is no global heartbeat messaging per se, but since connections between server instances are reused long-term, a dropped connection will be noticed and the master on duty will be notified. If all masters are unavailable, that entire quorum (i.e., the master node) is offline and thus (as with any entire node going offline) most operations will fail anyway, unless by chance they do not hit any data managed by that failed quorum.
When it receives a notice of unavailability, the master instance on duty tries to contact the unavailable server instance and if it fails, it will notify all remaining instances that that server instance is removed from the cluster. The effect is that the remaining server instances will stop attempting to access the failed instance. Updates to the partitions managed by the failed server instance are no longer sent to it, which results in updates to this data succeeding, as they are made against the other server instances in that quorum. Updates to the data of the failed server instance will fail in the window of time between the actual failure and the removal, which is typically well under a second. The removal of a failed server instance is delegated to a central authority in order not to have everybody get in each other's way when trying to effect the removal.
If the failed server instance left prepared uncommitted transactions behind, the server instances having such branches will in due order contact the transaction coordinator to ask what should be done. This is a normal procedure for dealing with possibly dropped commit or rollback messages. When they discover that the coordinator has been removed, the master on duty will be contacted instead. Each prepare message of a transaction lists all the server instances participating in the transaction; thus the master can check whether each has received the prepare. If all have the prepare and none has an abort, the transaction is committed. The dead coordinator may not know this or may indeed not have the transaction logged, since it sends the prepares before logging its own prepare. The recovery will handle this though. We note that of the remaining branches, there is at least one copy of the branch with the failed server instance, or else we would have a whole quorum failed. In cases where there are branches participating in an unresolved transaction where all the quorum members have failed, the system cannot decide the outcome, and will periodically retry until at least one member of the failed quorum becomes available.
The most complex part of the protocol is the recovery of a failed server instance. The recovery starts with a normal roll forward from the local transaction log. After this, the server instance will contact the master on duty to ask for its state. Typically, the master will reply that the recovering server instance had been removed and is out of date. When this is established, the recovering server instance will contact a live member of its quorum and ask for sync. The failed server instance has an approximate timestamp of its last received transaction. It knows this from the roll forward, where time markers are interspersed now and then between transaction records. The live partner then sends its transaction log(s) covering the time from a few seconds before the last transaction of the failed partner up to the present. A few transactions may get rolled forward twice but this does no harm, since these records have absolute values and no deltas and the second insert of a key is simply ignored. When the sender of the log reaches its last committed log entry, it asks the recovering server instance to confirm successful replay of the log so far. Having the confirmation, the sender will abort all unprepared transactions affecting it and will not accept any new ones until the sync is completed. If new transactions were committed between sending the last of the log and killing the uncommitted new transactions, these too are shipped to the recovering server instance in their committed or prepared state. When these are also confirmed replayed, the recovering server instance is in exact sync up to the transaction. The sender then notifies the rest of the cluster that the sync is complete and that the recovered server instance will be included in any updates of its slice of the data. The time between freeze and re-enable of transactions is the time to replay what came in between the first sync and finishing the freeze. Typically nothing came in, so the time is in milliseconds. If an application got its transaction killed in this maneuver, it will be seen as a deadlock.
If the recovering server instance received transactions in prepared state, it will ask about their outcome as a part of the periodic sweep through pending transactions. One of these transactions could have been one originally prepared by itself, where the prepares had gone out before it had time to log the transaction. Thus, this eventuality too is covered and has a consistent outcome. Failures can interrupt the recovery process. The recovering server instance will have logged as far as it got, and will pick up from this point onward. Real time clocks on the host nodes of the cluster will have to be in approximate sync, within a margin of a minute or so. This is not a problem in a closely connected network.
For simultaneous failure of a entire quorum of server instances (i.e., a set of mutually-mirroring partners; a cluster node), the rule is that the last one to fail must be the first to come back up. In order to have uninterrupted service across arbitrary double failures, one must store things in triplicate; statistically, however, most double failures will not hit cluster nodes of the same group.
The protocol for recovery of failed server instances of the master quorum (i.e., the master cluster node) is identical, except that a recovering master will have to ask the other master(s) which one is more up to date. If the recovering master has a log entry of having excluded all other masters in its quorum from the cluster, it can come back online without asking anybody. If there is no such entry, it must ask the other master(s). If all had failed at the exact same instant, none has an entry of the other(s) being excluded and all will know that they are in the same state since any update to one would also have been sent to the other(s).
When a server instance fails, its permanent storage may or may not survive. Especially with mirrored disks, storage most often survives a failure. However, the survival of the database does not depend on any single server instance retaining any permanent storage over failure. If storage is left in place, as in the case of an OS reboot or replacing a faulty memory chip, rejoining the cluster is done based on the existing copy of the database on the server instance. if there is no existing copy, a copy can be taken from any surviving member of the same quorum. This consists of the following steps: First, a log checkpoint is forced on the surviving instance. Normally log checkpoints are done at regular intervals, independently on each server instance. The log checkpoint writes a consistent state of the database to permanent storage. The disk pages forming this consistent image will not be written to until the next log checkpoint. Therefore copying the database file is safe and consistent as long as a log checkpoint does not take place between the start and end of copy. Thus checkpoints are disabled right after the initial checkpoint. The copy can take a relatively long time; consider 20s per gigabyte on a 1GbE network a good day. At the end of copy, checkpoints are re-enabled on the surviving cluster node. The recovering database starts without a log, sees the timestamp of the checkpoint in the database, and asks for transactions from just before this time up to present. The recovery then proceeds as outlined above.
The CAP theorem states that Consistency, Availability, and Partition-tolerance do not mix. "Partition" here means the split of a network.
It is trivially true that if the network splits so that on both sides there is a copy of each partition of the data, both sides will think themselves the live copy left online after the other died, and each will thus continue to accumulate updates. Such an event is not very probable within one site where all machines are redundantly connected to two independent switches. Most servers have dual 1GbE on the motherboard, and both ports should be used for cluster interconnect for best performance, with each attached to an independent switch. Both switches would have to fail in such a way as to split their respective network for a single-site network split to happen. Of course, the likelihood of a network split in multi-site situations is higher.
One way of guarding against network splits is to require that at least one partition of the data have all copies online. Additionally, the master on duty can request each cluster node or server instance it expects to be online to connect to every other node or instance, and to report which they could reach. If the reports differ, there is a network problem. This procedure can be performed using both interfaces or only the first or second interface of each server to determine if one of the switches selectively blocks some paths. These simple sanity checks protect against arbitrary network errors. Using TCP for inter-cluster-node communication in principle protects against random message loss, but the Virtuoso cluster protocols do not rely on this. Instead, there are protocols for retry of any transaction messages and for using keep-alive messages on any long-running functions sent across the cluster. Failure to get a keep-alive message within a certain period will abort a query even if the network connections look OK.
For a constantly-operating distributed system, it is hard to define what exactly constitutes a consistent snapshot. The checkpointed state on each cluster node is consistent as far as this cluster node is concerned (i.e., it contains no uncommitted data), but the checkpointed states on all the cluster nodes are not from exactly the same moment in time. The complete state of a cluster is the checkpoint state of each cluster node plus the current transaction log of each. If the logs were shipped in real time to off-site storage, a consistent image could be reconstructed from them. Since such shipping cannot be synchronous due to latency considerations, some transactions could be received only in part in the event of a failure of the off-site link. Such partial transactions can however be detected at reconstruction time because each record contains the list of all participants of the transaction. If some piece is found missing, the whole can be discarded. In this way integrity is guaranteed but it is possible that a few milliseconds worth of transactions get lost. In these cases, the online client will almost certainly fail to get the final success message and will recheck the status after recovery.
For business continuity purposes, a live feed of transactions can be constantly streamed off-site, for example to a cloud infrastructure provider. One low-cost virtual machine on the cloud will typically be enough for receiving the feed. In the event of long-term loss of the whole site, replacement servers can be procured on the cloud; thus, capital is not tied up in an aging inventory of spare servers. The cloud-based substitute can be maintained for the time it takes to rebuild an owned infrastructure, which is still at present more economical than a cloud-only solution.
Switching a cluster from an owned site to the cloud could be accomplished in a few hours. The prerequisite of this is that there are reasonably recent snapshots of the database files, so that replay of logs does not take too long. The bulk of the time taken by such a switch would be in transferring the database snapshots from S3 or similar to the newly provisioned machines, formatting the newly provisioned virtual disks, etc.
Rehearsing such a maneuver beforehand is quite necessary for predictable execution. We do not presently have a productized set of tools for such a switch, but can advise any interested parties on implementing and testing such a disaster recovery scheme.
In conclusion, we have shown how we can have strong transactional guarantees in a database cluster without single points of failure or performance penalties when compared with a non fault-tolerant cluster. Operator intervention is not required for anything short of hardware failure. Recovery procedures are simple, at most consisting of installing software and copying database files from a surviving cluster node. Unless permanent storage is lost in the failure, not even this is required. Real-time off-site log shipment can easily be added to these procedures to protect against site-wide failures.
Future work may be directed toward concurrent operation of geographically-distributed data centers with eventual consistency. Such a setting would allow for migration between sites in the event of whole-site failures, and for reconciliation between inconsistent histories of different halves of a temporarily split network. Such schemes are likely to require application-level logic for reconciliation and cannot consist of an out-of-the-box DBMS alone. All techniques discussed here are application-agnostic and will work equally well for Graph Model (e.g., RDF) and Relational Model (e.g., SQL) workloads.
Based on some feedback from the field, we decided to make this feature more user friendly. The gist of the matter is that failure and recovery processes have been automated so that neither application developer nor operating personnel needs any knowledge of how things actually work.
So I will here make a few high level statements about what we offer for fault tolerance. I will follow up with technical specifics in another post.
Three types of individuals need to know about fault tolerance:
I will explain the matter to each of these three groups:
The value gained is elimination of downtime. The cost is in purchasing twice (or thrice) the hardware and software licenses. In reality, the cost is less since you get the whole money's worth of read throughput and half the money's worth of write throughput. Since most applications are about reading, this is a good deal. You do not end up paying for unused capacity.
Server instances are grouped in "quorums" of two or, for extra safety, three; as long as one member of each quorum is available, the system keeps running and nobody sees a difference, except maybe for slower response. This does not protect against widespread power outage or the building burning down; the scope is limited to hardware and software failures at one site.
The most basic site-wide disaster recovery plan consists of constantly streaming updates off-site. Using an off-site backup plus update stream, one can reconstitute the failed data center on a cloud provider in a few hours. Details will vary; please contact us for specifics.
Running multiple sites in parallel is also possible but specifics will depend on the application. Again, please contact us if you have a specific case in mind.
To configure, divide your server instances into quorums of 2 or 3, according to which will be mirrors of each other, with each quorum member on a different host from the others in its quorum. These things are declared in a configuration file. Table definitions do not have to be altered for fault tolerance. It is enough for tables and indices to specify partitioning. Use two switches, and two NICs per machine, and connect one of each server's network cables to each switch, to cover switch failures.
When things break, as long as there is at least one server instance up from each quorum, things will continue to work. Reboots and the like are handled without operator intervention; if there is a broken host, then remove it and put a spare in its place. If the disks are OK, put the old disks in the replacement host and start. If the disks are gone, then copy the database files from the live copy. Finally start the replacement database, and the system will do the rest. The system is online in read-write mode during all this time, including during copying.
Having mirrored disks in individual hosts is optional since data will anyhow be in two copies. Mirrored disks will shorten the vulnerability window of running a partition on a single server instance since this will for the most part eliminate the need to copy many (hundreds) of GB of database files when recovering a failed instance.
An application can connect to any server instance in the cluster and have access to the same data, with full ACID properties.
There are two types of errors that can occur in any database application: The database server instance may be offline or otherwise unreachable; and a transaction may be aborted due to a deadlock.
For the missing server instance, the application should try to reconnect. An ODBC/JDBC connect string can specify a list of alternate server instances; thus as long as the application is written to try to reconnect as best practices dictate, there is no new code needed.
For the deadlock, the application is supposed to retry the transaction. Sometimes when a server instance drops out or rejoins a running cluster, some transactions will have to be retried. To the application, these conditions look like a deadlock. If the application handles deadlocks (SQL State 40001) as best practices dictate, there is no change needed.
In summary...
All the above applies to both the Graph Model (RDF) and Relational (SQL) sides of Virtuoso. These features will be in the commercial release of Virtuoso to be publicly available in the next 2-3 weeks. Please contact OpenLink Software Sales for details of availability or for getting advance evaluation copies.
I have over the years trained a few truly excellent engineers and managed a heterogeneous lot of people. These days, since what we are doing is in fact quite difficult and the world is not totally without competition, I find that I must stick to core competence, which is hardcore tech and leave management to those who have time for it.
When younger, I thought that I could, through sheer personal charisma, transfer either technical skills, sound judgment, or drive and ambition to people I was working with. Well, to the extent I believed this, my own judgment was not sound. Transferring anything at all is difficult and chancy. I must here think of a fantasy novel where a wizard said that, "working such magic that makes things do what they already want to do is easy." There is a grain of truth in that.
In order to build or manage organizations, we must work, as the wizard put it, with nature, not against it. There are also counter-examples, for example my wife's grandmother had decided to transform a regular willow into a weeping one by tying down the branches. Such "magic," needless to say, takes constant maintenance; else the spell breaks.
To operate efficiently, either in business or education, we need to steer away from such endeavors. This is a valuable lesson, but now consider teaching this to somebody. Those who would most benefit from this wisdom are the least receptive to it. So again, we are reminded to stay away from the fantasy of being able to transfer some understanding we think to have and to have this take root. It will if it will and if it does not, it will take constant follow up, like the would-be weeping willow.
Now, in more specific terms, what can we realistically expect to teach about computer science?
Complexity of algorithms would be the first thing. Understanding the relative throughputs and latencies of the memory hierarchy (i.e., cache, memory, local network, disk, wide area network) is the second. Understanding the difference of synchronous and asynchronous and the cost of synchronization (i.e., anything from waiting for a mutex to waiting for a network message) is the third.
Understanding how a database works would be immensely helpful for almost any application development task but this is probably asking too much.
Then there is the question of engineering. Where do we put interfaces and what should these interfaces expose? Well, they certainly should expose multiple instances of whatever it is they expose, since passing through an interface takes time.
I tried once to tell the SPARQL committee that parameterized queries and array parameters are a self-evident truism on the database side. This is an example of an interface that exposes multiple instances of what it exposes. But the committee decided not to standardize these. There is something in the "semanticist" mind that is irrationally antagonistic to what is self-evident for databasers. This is further an example of ignoring precept 2 above, the point about the throughputs and latencies in the memory hierarchy. Nature is a better and more patient teacher than I; the point will become clear of itself in due time, no worry.
Interfaces seem to be overvalued in education. This is tricky because we should not teach that interfaces are bad either. Nature has islands of tightly intertwined processes, separated by fairly narrow interfaces. People are taught to think in block diagrams, so they probably project this also where it does not apply, thereby missing some connections and porosity of interfaces.
LarKC (EU FP7 Large Knowledge Collider project) is an exercise in interfaces. The lessons so far are that coupling needs to be tight, and that the roles of the components are not always as neatly separable as the block diagram suggests.
Recognizing the points where interfaces are naturally narrow is very difficult. Teaching this in a curriculum is likely impossible. This is not to say that the matter should not be mentioned and examples of over-"paradigmatism" given. The geek mind likes to latch on to a paradigm (e.g., object orientation), and then they try to put it everywhere. It is safe to say that taking block diagrams too naively or too seriously makes for poor performance and needless code. In some cases, block diagrams can serve as tactical disinformation; i.e., you give lip service to the values of structure, information hiding, and reuse, which one is not allowed to challenge, ever, and at the same time you do not disclose the competitive edge, which is pretty much always a breach of these same principles.
I was once at a data integration workshop in the US where some very qualified people talked about the process of science. They had this delightfully American metaphor for it:
The edge is created in the "Wild West" — there are no standards or hard-and-fast rules, and paradigmatism for paradigmatism's sake is a laughing matter with the cowboys in the fringe where new ground is broken. Then there is the OK Corral, where the cowboys shoot it out to see who prevails. Then there is Dodge City, where the lawman already reigns, and compliance, standards, and paradigms are not to be trifled with, lest one get the tar-and-feather treatment and be "driven out o'Dodge."
So, if reality is like this, what attitude should the curriculum have towards it? Do we make innovators or followers? Well, as said before, they are not made. Or if they are made, they are not at least made in the university but much before that. I never made any of either, in spite of trying, but did meet many of both kinds. The education system needs to recognize individual differences, even though this is against the trend of turning out a standardized product. Enforced mediocrity makes mediocrity. The world has an amazing tolerance for mediocrity, it is true. But the edge is not created with this, if edge is what we are after.
But let us move to specifics of semantic technology. What are the core precepts, the equivalent of the complexity/memory/synchronization triangle of general purpose CS basics? Let us not forget that, especially in semantic technology, when we have complex operations, lots of data, and almost always multiple distributed data sources, forgetting the laws of physics carries an especially high penalty.
Know when to ontologize, when to folksonomize. The history of standards has examples of "stacks of Babel," sky-high and all-encompassing, which just result in non-communication and non-adoption. Lighter weight, community driven, tag folksonomy, VoCamp-style approaches can be better. But this is a judgment call, entirely contextual, having to do with the maturity of the domain of discourse, etc.
Answer only questions that are actually asked. This precept is two-pronged. The literal interpretation is not to do inferential closure for its own sake, materializing all implied facts of the knowledge base.
The broader interpretation is to take real-world problems. Expanding RDFS semantics with map-reduce and proving how many iterations this will take is a thing one can do but real-world problems will be more complex and less neat.
Deal with ambiguity. Data on which semantic technologies will be applied will be dirty, with errors from machine processing of natural language to erroneous human annotations. The knowledge bases will not be contradiction free. Michael Witbrock of CYC said many good things about this in Sofia; he would have something to say about a curriculum, no doubt.
Here we see that semantic technology is a younger discipline than computer science. We can outline some desirable skills and directions to follow but the idea of core precepts is not as well formed.
So we can approach the question from the angle of needed skills more than of precepts of science. What should the certified semantician be able to do?
Data integration. Given heterogenous relational schemas talking about the same entities, the semantician should find existing ontologies for the domain, possibly extend these, and then map the relational data to them. After the mapping is conceptually done, the semantician must know what combination of ETL and on-the-fly mapping fits the situation. This does mean that the semantician indeed must understand databases, which I above classified as an almost unreachable ideal. But there is no getting around this. Data is increasingly what makes the world go round. From this it follows that everybody must increasingly publish, consume, and refine, i.e., integrate. The anti-database attitude of the semantic web community simply has to go.
Design and implement workflows for content extraction, e.g., NLP or information extraction from images. This also means familiarity with NLP, desirably to the point of being able to tune the extraction rule sets of various NLP frameworks.
Design SOA workflows. The semantician should be able to extract and represent the semantics of business transactions and the data involved therein.
Lightweight knowledge engineering. The experience of building expert systems from the early days of AI is not the best possible, but with semantics attached to data, some sort of rules seem about inevitable. The rule systems will merge into the DBMS in time. Some ability to work with these, short of making expert systems, will be desirable.
Understand information quality in the sense of trust, provenance, errors in the information, etc. If the world is run based on data analytics, then one must know what the data in the warehouse means, what accidental and deliberate errors it contains, etc.
Of course, most of these tasks take place at some sort of organizational crossroads or interface. This means that the semantician must have some project management skills; must be capable of effectively communicating with different publics and simply getting the job done, always in the face of organizational inertia and often in the face of active resistance from people who view the semantician as some kind of intruder on their turf.
Now, this is a tall order. The semantician will have to be reasonably versatile technically, reasonably clever, and a self-starter on top. The self-starter aspect is the hardest.
The semanticists I have met are more of the scholar than the IT consultant profile. I say semanticist for the semantic web research people and semantician for the practitioner we are trying to define.
We could start by taking people who already do data integration projects and educating them in some semantic technology. We are here talking about a different breed than the one that by nature gravitates to description logics and AI. Projecting semanticist interests or attributes on this public is a source of bias and error.
If we talk about a university curriculum, the part that cannot be taught is the leadership and self-starter aspect, or whatever makes a good IT consultant. Thus the semantic technology studies must be profiled so as to attract people with this profile. As quoted before, the dream job for each era is a scarce skill that makes value from something that is plentiful in the environment. At this moment and for a few moments to come, this is the data geek, or maybe even semantician profile, if we take data geek past statistics and traditional business intelligence skills.
The semantic tech community, especially the academic branch of it, needs to reinvent itself in order to rise to this occasion. The flavor of the dream job curriculum will be away from the theoretical computer science towards the hands-on of database, large systems performance, and the practicalities of getting data intensive projects delivered.
Related
]]>Performance, I have said before, is a matter of locality and parallelism. So we applied both to the otherwise quite boring exercise of loading RDF. The recipe is this: Take a large set of triples; resolve the IRIs and literals into their IDs; then insert each index of the triple table on its own thread. All the lookups and inserts are first sorted in key order to get the locality. Running the indices in parallel gets the parallelism. Then run the parser on its own thread, fetching chunks of consecutive triples and queueing them for a pool of loader threads. Then run several parsers concurrently on different files so as to make sure there is work enough at all times. Do not make many more process threads than available CPU threads, since they would just get in each other's way.
The whole process is non-transactional, starting from a checkpoint and ending with a checkpoint.
The test system was a dual-Xeon 5520 with 72G RAM. The Virtuoso was a single server; no cluster capability was used.
We loaded English Dbpedia, 179M triples, in 15 minutes, for a rate of 198 Kt/s. Uniprot with 1.33 G triples loaded in 79 minutes, for 279 Kt/s.
The source files were the Dbpedia 3.4 English files and the Bio2RDF copy of Uniprot, both in Turtle syntax. The uniref, uniparc and uniprot files from the Bio2RDF set were sliced into smaller chunks so as to have more files to load in parallel; the taxonomy file was as such; and no other Bio2RDF files were loaded. Both experiments ran with 8 load streams, 1 per core. The CPU utilization was mostly between 1400% and 1500%, 14-15 of 16 CPU threads busy. Top load speed for a measurement window of 2 minutes was 383 Kt/s.
The index scheme for RDF quads was the default Virtuoso 6 configuration of 5 indices — GS, SP, OP, PSOG, and POGS. (We call this "3+2" indexing, because there are 3 partial and 2 full indices, delivering massive performance benefits over most other index schemes.) IRIs and literals reside in their own tables, each indexed from string to ID and vice versa. A full-text index on literals was not used.
Compared to previous performance, we have more than tripled our best single server multi-stream load speed, and multiplied our single stream load speed by a factor of 8. Some further gains may be reached by adjusting thread counts and matching vector sizes to CPU cache.
This will be available in a forthcoming release; this is not for download yet. Now that you know this, you may guess what we are doing with queries. More on this another time.
]]>Lots of smart people together. The meeting was hosted by Ontotext and chaired by Dieter Fensel. On the database side we had Ontotext, SYSTAP (Bigdata), CWI (MonetDB), Karlsruhe Institute of Technology (YARS2/SWSE). LarKC was well represented, being our hosts, with STI, Ontotext, CYC, and VU Amsterdam. Notable absences were Oracle, Garlik, Franz, and Talis.
Now of semantic data management... What is the difference between a relational database and a semantic repository, a triple/quad store, a whatever-you-call-them?
I had last fall a meeting at CWI with Martin Kersten, Peter Boncz and Lefteris Sidirourgos from CWI, and Frank van Harmelen and Spiros Kotoulas of VU Amsterdam, to start a dialogue between semanticists and databasers. Here we were with many more people trying to discover what the case might be. What are the differences?
Michael Stonebraker and Martin Kersten have basically said that what is sauce for the goose is sauce for the gander, and that there is no real difference between relational DB and RDF storage, except maybe for a little tuning in some data structures or parameters. Semantic repository implementors on the other hand say that when they tried putting triples inside an RDB it worked so poorly that they did everything from scratch. (It is a geekly penchant to do things from scratch, but then this is not always unjustified.)
OpenLink Software and Virtuoso are in agreement with both sides, contradictory as this might sound. We took our RDBMS and added data types and structures and cost model alterations to an existing platform. Oracle did the same. MonetDB considers doing this and time will tell the extent of their RDF-oriented alterations. Right now the estimate is that this will be small and not in the kernel.
I would say with confidence that without source code access to the RDB, RDF will not be particularly convenient or efficient to accommodate. With source access, we found that what serves RDB also serves RDF. For example, execution engine and data compression considerations are the same, with minimal tweaks for RDF's run time typing needs.
So now we are founding a platform for continuing this discussion. There will be workshops and calls for papers and the beginnings of a research community.
After the initial meeting at CWI, I tried to figure what the difference was between the databaser and semanticist minds. Really, the things are close but there is still a disconnect. Database is about big sets and semantics is about individuals, maybe. The databaser discovers that the operation on each member of the set is not always the same, and the semanticist discovers that the operation on each member of the set is often the same.
So the semanticist says that big joins take time. The databaser tells the semanticist not to repeat what's been obvious for 40 years and for which there is anything from partitioned hashes to merges to various vectored execution models. Not to mention columns.
Spiros of VU Amsterdam/LarKC says that map-reduce materializes inferential closure really fast. Lefteris of CWI says that while he is not a semantic person, he does not understand what the point of all this materializing is, nobody is asking the question, right? So why answer? I say that computing inferential closure is a semanticist tradition; this is just what they do. Atanas Kiryakov of Ontotext says that this is not just a tradition whose start and justification is in the forgotten mists of history, but actually a clear and present need; just look at all the joining you would need.
Michael Witbrock of CYC says that it is not about forward or backward inference on toy rule sets, but that both will be needed and on massively bigger rule sets at that. Further, there can be machine learning to direct the inference, doing the meta-reasoning merged with the reasoning itself.
I say that there is nothing wrong with materialization if it is guided by need, in the vein of memo-ization or cracking or recycling as is done in MonetDB. Do the work when it is needed, and do not do it again.
Brian Thompson of Systap/Bigdata asks whether it is not a contradiction in terms to both want pluggability and merging inference into the data, like LarKC would be doing. I say that this is difficult but not impossible and that when you run joins in a cluster database, as you decide based on the data where the next join step will be, so it will be with inference. Right there, between join steps, integrated with whatever data partitioning logic you have, for partitioning you will have, data being bigger and bigger. And if you have reuse of intermediates and demand driven indexing à la MonetDB, this too integrates and applies to inference results.
So then, LarKC and CYC, can you picture a pluggable inference interface at this level of granularity? So far, I have received some more detail as to the needs of inference and database integration, essentially validating our previous intuitions and plans.
Aside talking of inference, we have the more immediate issue of creating an industry out of the semantic data management offerings of today.
What do we need for this? We need close-to-parity with relational — doing your warehouse in RDF with the attendant agility thereof can't cost 10x more to deploy than the equivalent relational solution.
We also want to tell the key-value, anti-SQL people, who throw away transactions and queries, that there is a better way. And for this, we need to improve our gig just a little bit. Then you have the union of some level of ACID, at least consistent read, availability, complex query, large scale.
And to do this, we need a benchmark. It needs a differentiation of online queries and browsing and analytics, graph algorithms and such. We are getting there. We will soon propose a social web benchmark for RDF which has both online and analytical aspects, a data generator, a test driver, and so on, with a TPC-style set of rules. If there is agreement on this, we will all get a few times faster. At this point, RDF will be a lot more competitive with mainstream and we will cross another qualitative threshold.
]]>The ability to use distributed queries -- i.e., to issue SQL queries against any OLE-DB-accessible back end -- via Linked Servers.
The promise fails to materialize, primarily because while there are several ways of issuing such distributed queries, none of them work with all data access providers, and even for those that do, results received via different methods may differ.
Compounding the issue, there are specific configuration options which must be set correctly, often differing from defaults, to permit such things as "ad-hoc distributed queries".
Common tools that are typically used with such Linked Servers include SSIS and DTS. Such generic tools typically rely on four-part naming for their queries, expecting SQL Server to properly rewrite remotely executed queries for the DBMS engine which ultimately executes them.
The most common cause of failure is that when SQL Server rewrites a query, it typically does so using SQL-92 syntax, regardless of the back-end's abilities, and using the Transact-SQL dialect for implementation-specific query syntaxes, regardless of the back-end's dialect. This leads to problems especially when the Linked Server is an older variant which doesn't support SQL-92 (e.g., Progress 8.x or earlier, Informix 7 or earlier), or which SQL dialect differs substantially from Transact-SQL (e.g., Informix, Progress, MySQL, etc.).
SELECT *
FROM linked_server.[catalog].[schema].object
Four-part naming presumes that you have pre-defined a Linked Server, and executes the query on SQL Server. SQL Server decides what if any sub- or partial-queries to execute on the linked server, tends not to use appropriate syntax for these, and usually does not take advantage of linked server or provider features.
SELECT *
FROM OPENQUERY ( linked_server , 'query' )
OpenQuery also presumes that you have pre-defined a Linked Server, but executes the query as a "pass-through", handing it directly to the remote provider. Features of the remote server and the data access provider may be taken advantage of, but only if the query author knows about them.
SQL Server's Linked Server extension executes the specified pass-through query on the specified linked server. This server is an OLE DB data source.
OPENQUERY
can be referenced in theFROM
clause of a query as if it were a table name.OPENQUERY
can also be referenced as the target table of anINSERT
,UPDATE
, orDELETE
statement. This is subject to the capabilities of the OLE DB provider. Although the query may return multiple result sets,OPENQUERY
returns only the first one....
OPENQUERY
does not accept variables for its arguments.OPENQUERY
cannot be used to execute extended stored procedures on a linked server. However, an extended stored procedure can be executed on a linked server by using a four-part name.
SELECT *
FROM OPENROWSET
( 'provider_name' ,
'datasource' ; 'user_id' ; 'password',
{ [ catalog. ] [ schema. ] object | 'query' }
)
OpenRowset
does not require a pre-defined Linked Server, but does require the user to know what data access providers are available on the SQL Server host, and how to manually construct a valid connection string for the chosen provider. It does permit both "pass-through" and "local execution" queries, which can lead to confusion when the results differ (as they regularly will).
Includes all connection information that is required to access remote data from an OLE DB data source. This method is an alternative to accessing tables in a linked server and is a one-time, ad hoc method of connecting and accessing remote data by using OLE DB. For more frequent references to OLE DB data sources, use linked servers instead. For more information, see Linking Servers. The
OPENROWSET
function can be referenced in theFROM
clause of a query as if it were a table name. TheOPENROWSET
function can also be referenced as the target table of anINSERT
,UPDATE
, orDELETE
statement, subject to the capabilities of the OLE DB provider. Although the query might return multiple result sets,OPENROWSET
returns only the first one.OPENROWSET also supports bulk operations through a built-in
BULK
provider that enables data from a file to be read and returned as a rowset....
OPENROWSET
can be used to access remote data from OLE DB data sources only when theDisallowAdhocAccess
registry option is explicitly set to0
for the specified provider, and theAd Hoc Distributed Queries
advanced configuration option is enabled. When these options are not set, the default behavior does not allow for ad hoc access. When accessing remote OLE DB data sources, the login identity of trusted connections is not automatically delegated from the server on which the client is connected to the server that is being queried. Authentication delegation must be configured. For more information, see Configuring Linked Servers for Delegation.Catalog and schema names are required if the OLE DB provider supports multiple catalogs and schemas in the specified data source. Values for catalog and schema can be omitted when the OLE DB provider does not support them. If the provider supports only schema names, a two-part name of the form
schema.object
must be specified. If the provider supports only catalog names, a three-part name of the formcatalog.schema.object
must be specified. Three-part names must be specified for pass-through queries that use the SQL Server Native Client OLE DB provider. For more information, see Transact-SQL Syntax Conventions (Transact-SQL).OPENROWSET
does not accept variables for its arguments.
SELECT *
FROM OPENDATASOURCE
( 'provider_name',
'provider_specific_datasource_specification'
).[catalog].[schema].object
As with basic four-part naming, OpenDataSource
executes the query on SQL Server. SQL Server decides what if any sub-queries to execute on the linked server, tends not to use appropriate syntax for these, and usually does not take advantage of linked server or provider features.
Provides ad hoc connection information as part of a four-part object name without using a linked server name.
...
OPENDATASOURCE
can be used to access remote data from OLE DB data sources only when theDisallowAdhocAccess
registry option is explicitly set to0
for the specified provider, and theAd Hoc Distributed Queries
advanced configuration option is enabled. When these options are not set, the default behavior does not allow for ad hoc access.The
OPENDATASOURCE
function can be used in the same Transact-SQL syntax locations as a linked-server name. Therefore,OPENDATASOURCE
can be used as the first part of a four-part name that refers to a table or view name in aSELECT
,INSERT
,UPDATE
, orDELETE
statement, or to a remote stored procedure in anEXECUTE
statement. When executing remote stored procedures,OPENDATASOURCE
should refer to another instance of SQL Server.OPENDATASOURCE
does not accept variables for its arguments.Like the
OPENROWSET
function,OPENDATASOURCE
should only reference OLE DB data sources that are accessed infrequently. Define a linked server for any data sources accessed more than several times. NeitherOPENDATASOURCE
norOPENROWSET
provide all the functionality of linked-server definitions, such as security management and the ability to query catalog information. All connection information, including passwords, must be provided every time thatOPENDATASOURCE
is called.
The ability to link objects (tables, views, stored procedures) from any ODBC-accessible data source. This includes any JDBC-accessible data source, through the OpenLink ODBC Driver for JDBC Data Sources.
There are no limitations on the data types which can be queried or read, nor must the target DBMS have primary keys set on linked tables or views.
All linked objects may be used in single-site or distributed queries, and the user need not know anything about the actual data structure, including whether the objects being queried are remote or local to Virtuoso -- all objects are made to appear as part of a Virtuoso-local schema.
]]>Ability to use distributed queries over a generic connectivity gateway (HSODBC, DG4ODBC) -- i.e., to issue SQL queries against any ODBC- or OLE-DB-accessible linked back end.
Promise fails to materialize for several reasons. Immediate limitations include:
FOR UPDATE
clause and all tables with LONG
columns selected by the query must be located in the same external database.REF
datatypes on remote tables.In addition to the above, which apply to database-specific heterogeneous environments, the database-agnostic generic connectivity components have the following limitations:
BLOB
column must have a separate column that serves as a primary key.BLOB
and CLOB
data cannot be read by passthrough queries.WHERE
clause are not allowed.LONG
columns with bind variables is not supported.ROWID
s.Compounding the issue, the HSODBC and DG4ODBC generic connectivity agents perform many of their functions by brute-force methods. Rather than interrogating the data access provider (whether ODBC or OLE DB) or DBMS to which they are connected, to learn their capabilities, many things are done by using the lowest possible function.
For instance, when a SELECT COUNT (*) FROM table@link
is issued through Oracle SQL, the target DBMS doesn't simply perform a SELECT COUNT (*) FROM table
. Rather, it performs a SELECT * FROM table
which is used to inventory all columns in the table, and then performs and fully retrieves SELECT field FROM table
into an internal temporary table, where it does the COUNT (*)
itself, locally. Testing has confirmed this process to be the case despite Oracle documentation stating that target data sources must support COUNT (*)
(among other functions).
The Virtuoso Universal Server will link/attach objects (tables, views, stored procedures) from any ODBC-accessible data source. This includes any JDBC-accessible data source, through the OpenLink ODBC Driver for JDBC Data Sources.
There are no limitations on the data types which can be queried or read, nor must the target DBMS have primary keys set on linked tables or views.
All linked objects may be used in single-site or distributed queries, and the user need not know anything about the actual data structure, including whether the objects being queried are remote or local to Virtuoso -- all objects are made to appear as part of a Virtuoso-local schema.
]]>The geometry support is for both SQL and SPARQL. On the SQL side, it works with the ISO/IEC 13249 SQL/MM API; with RDF, a geometry can occur as the object of a quad. If the object is a typed-literal of the virtrdf:Geometry
type, it gets indexed in a geometry index over all geometries in quads; no special declarations are needed. After this, SQL MM predicates and functions can be used with SPARQL, like this:
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> SELECT ?class COUNT (*) WHERE { ?m geo:geometry ?geo . ?m a ?class . FILTER ( <bif:st_intersects> ( ?geo, <bif:st_point> (0, 52), 100 ) ) } GROUP BY ?class ORDER BY DESC 2
This returns the counts of objects of each class occurring within 100 km of (0, 52), a point near London.
For any data set with WGS 84 geo:long
and geo:lat
values, a simple SQL function makes a point geometry for each such coordinate pair and adds it as the geo:geometry
property of the subject with the long/lat. This then enables fast spatial access to arbitrary location data in RDF.
Right now, we hardly see any geometries other than points in RDF data, even though there are some efforts for vocabularies for more complex entities. As these get adopted we will support them.
For scalability, we tried the implementation with OpenStreetMap's 350 million or so points. The geometry implementation partitions well over a cluster, similarly to a full text index, i.e., every server has its slice of the geometries, partitioned by the geometry object's key, thus not by range of coordinates or such. Like this, the items are evenly spread even though the coordinate distribution is highly uneven.
We can do spatial joins like —
SELECT ?s ( <sql:num_or_null> (?p) ) COUNT (*) WHERE { ?s <http://dbpedia.org/ontology/populationTotal> ?p . FILTER ( <sql:num_or_null> (?p) > 1000000 ) . ?s geo:geometry ?geo . FILTER ( <bif:st_intersects> ( ?pt, ?geo, 5 ) ) . ?xx geo:geometry ?pt } GROUP BY ?s ( <sql:num_or_null> (?p) ) ORDER BY DESC 3 LIMIT 20
This takes the DBpedia subjects that have a population over 1 million and a geometry. We then count all the geometries within 5 km of the point location of the first geometry. With DBpedia (about 5 million points), GeoNames (7 million points), and OpenStreetMap (350 million points), we get the result:
http://dbpedia.org/resource/Munich 1356594 117280 http://dbpedia.org/resource/London 7355400 81486 http://dbpedia.org/resource/Davao_City 1363337 58640 http://dbpedia.org/resource/Belo_Horizonte 2412937 58640 http://dbpedia.org/resource/Chengde 3610000 58640 http://dbpedia.org/resource/Hamburg 1769117 51664 http://dbpedia.org/resource/San_Diego%2C_California 1266731 47685 http://dbpedia.org/resource/Bursa 1562828 47685 http://dbpedia.org/resource/Port-au-Prince 1082800 47685 http://dbpedia.org/resource/Oakland_County%2C_Michigan 1194156 45636 http://dbpedia.org/resource/Sana%27a 1747627 40923 http://dbpedia.org/resource/Milan 1303437 40923 http://dbpedia.org/resource/Campinas 1059420 40923 http://dbpedia.org/resource/Hohhot 2580000 40923 http://dbpedia.org/resource/Brussels 1031215 40923 http://dbpedia.org/resource/Bogra_District 2988567 40923 http://dbpedia.org/resource/Cort%C3%A9s_Department 1202510 40923 http://dbpedia.org/resource/Berlin 3416300 35668 http://dbpedia.org/resource/New_York_City 8274527 30810 http://dbpedia.org/resource/Los_Angeles%2C_California 3849378 25614
20 Rows. -- 1733 msec.
Cluster 8 nodes, 1 s. 358 m/s 1596 KB/s 664% cpu 2% read 16% clw threads 1r 0w 0i buffers 1124351 0 d 0 w 0 pfs
This takes 1.7 seconds on a Virtuoso Cluster configured with 8 processes on a single dual-Xeon 5520 box, running at about 664% CPU with warm cache. Fair enough for a first crack, this can obviously be optimized further. Still, the geo part of the processing is already as good as instantaneous.
We will shortly have the geography features installed on DBpedia and the other data sets we host. As these come online we will show more demo queries.
For more about SQL/MM, you can look to a couple of PDFs:
Since the questionnaire is public, I am publishing my answers below.
Data and data types
What volumes of data are we dealing with today? What is the growth rate? Where can we expect to be in 2015?
Private data warehouses of corporations have more than doubled yearly for the past years; hundreds of TB is not exceptional. This will continue. The real shift is in structured data being published in increasing quantities with a minimum level of integrate-ability through use of RDF and linked data principles. There are rewards for use of standard vocabularies and identifiers through search engines recognizing such data. There is convergence around DBpedia identifiers for real-world entities, e.g., most things that would be in the news.
This also means that internal data processes and silos may be enriched with this content. There is consequent pressure for accommodating more diversity of data, with more flexible schema.
Ultimately, all content presently stored in RDBs and presented in public accessible dynamic web pages will end up on the web of linked data. Examples are product catalogs, price lists, event schedules and the like.
The volume of the well known linked data sets is around 10 billion statements. With the above mentioned trends, growth by two or three orders of magnitude by 2015 seems reasonable, This is so especially if explicit semantics are extracted from the document web and if there is some further progress in the precision/recall of such extraction.
Relevant sections of this mass of data are a potential addition to any present or future analytics application.
Since arbitrary analytics over the database which is the web cannot be economically provided by a centralized search engine, a cloud model may be used for on-demand selection of relevant data and mixing it with private data. This will drive database innovation for the next years even more than the continued classical warehouse growth.
Science data is another driver of the data overflow. For example, faster gene sequencing, more accurate measurements in high energy physics, better imaging, and remote sensing will produce large volumes of data. This data has highly regular structure but labeling this data with source and lineage calls for a flexible, schema-last, self-describing model, such as RDF and linked data. Data and metadata should travel together but may have different data models.
By and large, the metadata of science data will be another stream to the web of linked data, at least to the degree it is publicly accessible. Restricted circles can and likely will implement similar ideas.
What types of data can we deal with intelligently due to their inherent structure (geospatial, temporal, social or knowledge graphs, 3D, sensor streams...)?
All the above types should be supported inside one DBMS so as to allow efficient querying combining conditions on all these types of data, e.g., photos of sunsets taken last summer in Ibiza, with over 20 megapixels, by people I know.
Note that the test for being a sunset is an operation on the image blob that should be taken to the data; the images cannot be economically transferred.
Interleaving of all database functions and types becomes increasingly important.
Industries, communities
Who is producing these data and why? Could they do it better? How?
Right now, projects such as Bio2RDF, Neurocommons, and DBPedia produce this data. The processes are in place and are reasonable. Incremental improvement is to be expected. These processes, along with the linked data meme generally taking off, drive demand for better NLP (Natural Language Processing), e.g., entity and relationship extraction, especially extraction that can produce instance data in given ontologies (e.g., events) using common identifiers (e.g., DBPedia URIs).
Mapping of RDBs to RDF is possible, and a W3C working group is developing standards for this. The required baseline level has been reached; the rest is a matter of automating deployment. Within the enterprise, there are advantages to be gained for information integration; e.g., all entities in the CRM space can be integrated with all email and support tickets through giving everything a URI. Some of this information may even be published on an extranet for self-service and web-service interfaces. This has been done at small scales and the rest is a matter of spreading adoption and lowering the entry barrier. Incremental progress will take place, eventually resulting in qualitatively better integration along the value chain when adoption is sufficiently widespread.
Who is consuming these data and why? Could they do it better? How?
Consumers are various. The greatest need is for tools that summarize complex data and allow getting a bird's eye view of what data is in the first instance available. Consuming the data is hindered by the user not even necessarily knowing what data there is. This is somewhat new, as traditionally the business analyst did know the schema of the warehouse and was proficient with SQL report generators and statistics packages.
Where Web 2.0 made the citizen journalist, the web of linked data will make the citizen analyst. For this to happen, with benefits for individuals, enterprises, and governments alike, more work in user interfaces, knowledge discovery, and query composition will be useful. We may envision a "meshup economy" where data is plentiful, but the unit of value and exchange is the smart report that crystallizes actionable value from this ocean.
What industrial sectors in Europe could become more competitive if they became much better at managing data?
Any sector could benefit. Early adopters are seen in the biomedical field and to an extent in media.
Is the regulation landscape imposing constraints (privacy, compliance ...) that don't have today good tool support?
The regulation landscape drives database demand through data retention requirements and the like.
With data integration, especially with privacy-sensitive data (as in medicine), there are issues of whether one dares put otherwise-shareable information online. Regulation is needed to protect individuals, but integration should still be possible for science.
For this, we see a need for progress in applying policy-based approaches (e.g., row level security) to relatively schema-last data such as RDF. This is possible but needs some more work. Also, creating on-the-fly-anonymizing views on data might help.
More research is needed for reconciling the need for security with the advantages of broad-based ad hoc integration. Ideally, data should be intelligent, aware of its origins and classification and cautious of whom it interacts with, all of this supported under the covers so that the user could ask anything but the data might refuse to answer or might restrict answers according to the user's profile. This is a tall order and implementing something of the sort is an open question.
What are the main practical problem identified for individuals and organizations? Please give examples and tell us about the main obstacles and barriers.
We have come across the following:
Other problems have to do with sheer volume, i.e., transfer of data even in a local area network is too slow, let alone over a wide area network. Computation needs to go to the data, and databases need to support this.
Services, software stacks, protocols, standards, benchmarks
What combinations of components are needed to deal with these problems?
Recent times have seen a proliferation of special purpose databases. Since the data needs of the future are about combining data with maximum agility and minimum performance hit, there is need to gather the currently-separate functionality into an integrated system with sufficient flexibility. We see some of this in integration of map-reduce and scale-out databases. The former antagonists have become partners. Vertica, Greenplum, and OpenLink Virtuoso are example of DBMS featuring work in this direction.
Interoperability and at least de facto standards in ways of doing this will emerge.
What data exchange and processing mechanisms will be needed to work across platforms and programming languages?
HTTP, XML, and RDF are in fact very verbose, yet these are the formats and models that have uptake. Thus, these will continue to be used even though one might think binary formats to be more efficient.
There are of course science data set standards that are more compressed and these will continue, hopefully adding a practice of rich metadata in RDF.
For internals of systems, MPI and TCP/IP with proprietary optimized wire formats will continue. Inter-system communication will likely continue to be HTTP, XML, and RDF as appropriate.
What data environments are today so wastefully messy that they would benefit from the development of standards?
RDF and OWL are not messy but they could use some more performance; we are working on this. SPARQL is finally acquiring the capabilities of a serious query language, so things are slowly coming together.
Community process for developing application domain specific vocabularies works quite well, even though one could argue it is ad hoc and not up to what a modeling purist might wish.
Top-down imposition of standards has a mixed history, with long and expensive development and sometimes no or little uptake, consider some WS* standards for example.
What kind of performance is expected or required of these systems? Who will measure it reliably? How?
Relational databases have a history of substantial investment in optimization and some of them are very good for what they do, e.g., the newer generation of analytics databases.
The very large schema-last, no-SQL, sometimes eventually consistent key-value stores have a somewhat shorter history but do fill a real need.
These trends will merge: Extreme scale, schema-last, complex queries, even more complex inference, custom code for in-database machine learning and other bulk processing.
We find RDF augmented with some binary types at this crossroads. This point of the design space will have to provide performance roughly on the level of today's best relational solution for workloads that fit the relational model. The added cost of schema-last and inference must come down. We are working on this. Research work such as carried out with MonetDB gives clues as to how these aims can be reached.
The separation of query language and inference is artificial. After the concepts are mature, these functions will merge and execute close to the data; there are clear evolutionary pressures in this direction.
Benchmarks are key. Some gain can be had even from repurposing standard relational benchmarks like TPC-H. But the TPC-H rules do not allow official reporting of such.
Development of benchmarks for RDF, complex queries, and inference is needed. A bold challenge to the community, it should be rooted in real-life integration needs and involve high heterogeneity. A key-value store benchmark might also be conceived. A transaction benchmark like TPC-C might be the basis, maybe augmented with massive user-generated content like reviews and blogs.
If benchmarks exist and are not too easy nor inaccessibly difficult nor too expensive to run — think of the high end TPC-C results — then TPC-style rules and processes would be quite adequate. The threshold to publish should be lowered: Everybody runs the TPC workloads internally but few publish.
Some EC initiative for benchmarking could make sense, similar to the TREC initiative of the US government. Industry should be consulted for the specific content; possibly the answers to the present questionnaire can provide an approximate direction.
Benchmarks should be run by software vendors on their own systems, tuned by themselves. But there should be a process of disclosure and auditing; the TPC rules give an example. Compliance should not be too expensive or time consuming. Some community development for automating these things would be a worthwhile target for EC funding.
Usability and training
How difficult will it be for a developer of average competence to deploy components whose core is based on rather deep computer science? Do we all need to understand Monads and Continuations? What can be done to make it ever easier?
In the database world, huge advances in technology have taken place behind a relatively simple and stable interface: SQL. For the linked data web, the same will take place behind SPARQL.
Beyond these, for example, programming with MPI with good utilization of a cluster platform for an arbitrary algorithm, is quite difficult. The casual amateur is hereby warned.
There is no single solution. For automatic parallelization, since explicit, programmatic parallelization of things with MPI for example is very unscalable in terms of required skill, we should favor declarative and/or functional approaches.
Developing a debugger and explanation engine for rule-based and description-logics-based inference would be an idea.
For procedural workloads, things like Erlang may be good in cases and are not overly difficult in principle, especially if there are good debugging facilities.
For shipping functions in a cluster or cloud, the BOOM (Berkeley Orders Of Magnitude) approach or logic programming with explicit specification of compute location seem promising, surely more flexible than map-reduce. The question is whether a PHP developer can be made to do logic programming.
This bridge will be crossed only with actual need and even then reluctantly. We may look at the Web 2.0 practice of sharding MySQL, inconvenient as this may be, for an example. There is inertia and thus re-architecting is a constant process that is generally in reaction to facts, post hoc, often a point solution. One could argue that planning ahead would be smarter but by and large the world does not work so.
One part of the answer is an infinitely-scalable SQL database that expands and shrinks in the clouds, with the usual semantics, maybe optional eventual consistency and built-in map reduce. If such a thing is inexpensive enough and syntax-level-compatible with present installed base, many developers do not have to learn very much more.
This is maybe good for the bread-and-butter IT, but European competitiveness should not rest on this. Therefore we wish to go for bold new application types for which the client-server database application is not the model. Data-centric languages like BOOM, if they can be made very efficient and have good debugging support, are attractive there. These do require more intellectual investment but that is not a problem since the less-inquisitive part of the developer community is served by the first part of the answer.
How is a developer of average skills going to learn about these new advanced tools? How can we plan for excellent documentation and training, community mentoring, exchange of good practices, etc... across all EU countries?
For the most part, developers do not learn things for the sake of learning. When they have learned something and it is adequate, they stay with it for the most part and are even reluctant to engage in cross-camps interaction. The research world is often similarly insular. A new inflection in the application landscape is needed to drive learning. This inflection is provided by the ubiquity of mobile devices, sensor data, explicit semantics, NLP concept extraction, web of linked data, and such factors.
RDFa is a good example of a new technique piggybacking on something everybody uses, namely HTML. These new things should, within possibility, be deployed in the usual technology stack, LAMP or Java. Of course these do not have to be LAMP or Java or HTML or HTTP themselves but they must manifest through these.
A lot of the semantic web potential can be realized within the client-server database application model, thus no fundamental re-architecting, just some new data types and queries.
For data- or processing-intensive tasks, an on-demand hookup to cloud-based servers with Erlang and/or BOOM for programming model would be easy enough to learn and utilize.
The question is one of providing challenges. Addressing actual challenges with these techniques will lead to maturity, documentation, examples, and training. With virtual, Europe-wide distributed teams a reality in many places, Europe-wide dissemination is no longer insurmountable.
As the data overflow proceeds, its victims will multiply and create demand for solutions. The EC could here encourage research project use cases gaining an extended life past the end of research projects, possibly being maintained and multiplied and spun off.
If such things could be mutated into self-sustaining service businesses with pay-per-use revenue, say through a cloud SaaS business model, still primarily leveraging an open source technology stack, we could have self-propagating and self-supporting models for exploiting advanced IT. This would create interest, and interest would drive training and dissemination.
The problem is creating the pull.
Challenges
What should be, in this domain, the equivalent of the Netflix challenge, Ansari X Prize, Google Lunar X Prize, etc. ... ?
The EC itself no doubt suffers from data overflow in one function or another. Unless security/secrecy prohibits, simply publishing a large data set and a description of what operations should be done on it would be a start. The more real the data, the better — reality is consistently more complex and surprising than imagination. Since many interesting problems touch on fraud detection and law enforcement, there may be some security obstacles for using these application domains as subject matters of open challenges.
Once there is a good benchmark, as discussed above, there can be some prize money allocated for the winners, specially if the race is tight.
The Semantic Web Challenge and the Billion Triples Challenge exist and are useful as such, but do not seem to have any huge impact.
The incentives should be sufficient and part of the expenses arising from running for such challenges could be funded. Otherwise investing in existing business development will be more interesting to industry. Some industry participation seems necessary; we would wish academia and industry to work closer. Also, having industry supply the baseline guarantees that academia actually does further the state of the art. This is not always certain.
If challenges are based on actual problems, whether of the EC, its member governments, or private entities, and winning the challenge may lead to a contract for supplying an actual solution, these will naturally become more interesting for consortia involving integrators, specialist software vendors, and academia. Such a model would build actual capacity to deploy leading edge technologies in production, which is sorely needed.
What should one do to set up such a challenge, administer, and monitor it?
The EC should probably circulate a call for actual problem scenarios involving big data. If the matter of the overflow is as dire as represented, cases should be easy to find. A few should be selected and then anonymized if needed.
The party with the use case would benefit by having hopefully the best work on it. The contestants would benefit from having real world needs guide R&D. The EC would not have to do very much, except possibly use some money for funding the best proposals. The winner would possibly get a large account and related sales and service income. The contestants would have to be teams possibly involving many organizations; for example, development and first-line services and support could come from different companies along a systems integrator model such as is widely used in the US.
There may be a good benchmark at the time, possibly resulting from FP7 itself. In such a case, the EC could offer a prize for winners. Details would have to be worked out case by case. Such a challenge could be repeated a few times, as benchmark-driven progress in databases or TREC for example have taken some years to reach a point of slowdown in progress.
Administrating such an activity should not be prohibitive, as most of the expertise can be found with the stakeholders.
"The universe of cycles is not exactly one of literal cycles, but rather one of spirals," mused Joe Hellerstein of UC Berkeley.
"Come on, let's all drop some ACID," interjected another.
"It is not that we end up repeating the exact same things, rather even if some patterns seem to repeat, they do so at a higher level, enhanced by the experience gained," continued Joe.
Thus did the Web Scale Data Management panel conclude.
Whether successive generations are made wiser by the ones that have gone before may be argued either way.
The cycle in question was that of developers discovering ACID in the 1960s, i.e. Atomicity, Consistency, Integrity, Durability. Thus did the DBMS come into being. Then DBMSs kept becoming more complex until, as there will be a counter-force to each force, came the meme of key value stores and BASE, no multiple-row transactions, eventual consistency, no query language but scaling to thousands of computers. So now, the DBMS community asks itself what went wrong.
In the words of one panelist, another demonstrated a "shocking familiarity with the subject matter of substance abuse" when he called for the DBMS community to get on a 12 step program and to look where addiction to certain ideas, among which ACID, had brought its life. Look at yourself: The influential papers in what ought to be your space by rights are coming from the OS community: Google Bigtable, Amazon Dynamo, want more? When you ought to drive, you give excuses and play catch up! Stop denial, drop SQL, drop ACID!
The web developers have revolted against the time-honored principles of the DBMS. This is true. Sharded MySQL is not the ticket — or is it? Must they rediscover the virtues of ACID, just like the previous generation did?
Nothing under the sun is new. As in music and fashion, trends keep cycling also in science and engineering.
But seriously, does the full-featured DBMS scale to web scale? Microsoft says the Azure version of SQL server does. Yahoo says they want no SQL but Hadoop and PNUTS.
Twitter, Facebook, and other web names got their own discussion. Why do they not go to serious DBMS vendors for their data but make their own, like Facebook with Hive?
Who can divine the mind of the web developer? What makes them go to memcached, manually sharded MySQL, and MapReduce, walking away from the 40 years of technology invested in declarative query and ACID? What is this highly visible but hard to grasp entity? My guess is that they want something they can understand, at least at the beginning. A DBMS, especially on a cluster, is complicated, and it is not so easy to say how it works and how its performance is determined. The big brands, if deployed on a thousand PCs, would also be prohibitively expensive. But if all you do with the DBMS is single row selects and updates, it is no longer so scary, but you end up doing all the distributed things in a middle layer, and abandoning expressive queries, transactions, and database-supported transparency of location. But at least now you know how it works and what it is good/not good for.
This would be the case for those who make a conscious choice. But by and large the choice is not deliberate; it is something one drifts into: The application gains popularity; the single LAMP can no longer keep all in memory; you need a second MySQL in the LAMP and you decide that users A–M go left and N–Z right (horizontal partitioning). This siren of sharding beckons you and all is good until you hit the reef of re-architecting. Memcached and duct-tape help, like aspirin helps with hangover, but the root cause of the headache lies unaddressed.
The conclusion was that there ought to be something incrementally scalable from the get-go. Low cost of entry and built-in scale-out. No, the web developers do not hate SQL; they just have gotten the idea that it does not scale. But they would really wish it to. So, DBMS people, show there is life in you yet.
Joe Hellerstein was the philosopher and paradigmatician of the panel. His team had developed a protocol-compatible Hadoop in a few months using a declarative logic programming style approach. His claim was that developers made the market. Thus, for writing applications against web scale data, there would have to be data centric languages. Why not? These are discussed in Berkeley Orders Of Magnitude (BOOM).
I come from Lisp myself, way back. I have since abandoned any desire to tell anybody what they ought to program in. This is a bit like religion: Attempting to impose or legislate or ram it on somebody just results in anything from lip service to rejection to war. The appeal exerted by the diverse language/paradigm -isms on their followers seems to be based on hitting a simplification of reality that coincides with a problem in the air. MapReduce is an example of this. PHP is another. A quick fix for a present need: Scripting web servers (PHP) or processing tons of files (MapReduce). The full database is not as quick a fix, even though it has many desirable features. It is also not as easy to tell what happens inside one, so MapReduce may give a greater feeling of control.
Totally self-managing, dynamically-scalable RDF would be a fix for not having to design or administer databases: Since it would be indexed on everything, complex queries would be possible; no full database scans would stop everything. For the mid-size segment of web sites this might be a fit. For the extreme ends of the spectrum, the choice is likely something custom built and much less expressive.
The BOOM rule language for data-centric programming would be something very easy for us to implement, in fact we will get something of the sort essentially for free when we do the rule support already planned.
The question is, can one induce web developers to do logic? The history is one of procedures, both in LAMP and MapReduce. On the other hand, the query languages that were ever universally adopted were declarative, i.e., keyword search and SQL. There certainly is a quest for an application model for the cloud space beyond just migrating apps. We'll see. More on this another time.
]]>Dynamic scale, wide area replication, and high availability are the issues. Transactions on multiple records, complex queries, and absolute consistency at all times are traded off. Also, the programming interfaces are lower level than with SQL. Replication and consistency rules are choices for the application developer; the platform offers some basic alternatives. Implementation-wise, there is a MySQL back-end and all the partitioning, query routing, replication, and balancing take place in a layer of front-ends.
Now what do we say to this?
In the Yahoo! case, even if complex queries were possible, which they are not, one would probably keep them off the online system since latency and availability are everything. A latency of some tens of milliseconds is however acceptable, which is not so terrible for single record operations: There is time for a couple of messages on the data center network and even maybe for a disk read.
PNUTS is probably the fastest way of getting to the desired beachhead of simple access to data at infinite scale in multiple geographies. In the identical situation, I might have done something similar.
But we are in a different situation, concerned with complex queries, a highly-normalized schema-last situation, i.e., index on everything, large objects normalized away, as is done in RDF. Then we are also in the relational situation. Infinite scale, fault tolerance, and wide-area replication do come up regularly in user needs. The applications for which people would like RDF are not only complex reasoning things but very big metadata stores for user generated content, social networks, and the like.
Which of the PNUTS principles could we apply?
Division in tablets: When a partition of the data grows too big, it should split.
Migration of partitions: as capacity/demand change, partitions should migrate so as to equalize load.
High availability: This is divided in two — on one hand inside the data center; on the other between data centers. Inside the data center, storing partitions in duplicate and running them synchronously is possible. This is manifestly impossible in wide area settings, though. For this, we need a log-shipping style of asynchronous replication. But how does one deal with split networks and transfer of replication mastery?
PNUTS determines the master copy record by record. This makes sense when the record, for example, corresponds to a user. For RDF, doing this by the triple would be prohibitive. Doing this by the graph, or by the subject of a set of triples across all graphs, would be better. We would agree with PNUTS that transferring mastery by the storage chunk is not desired, as the chunk will contain arbitrary unrelated data.
The eventual consistency mechanisms can be generalized to RDF readily enough. In a social RDF application, the graph is the most likely unit of data ownership and update authorization, so the graph would also be the unit of eventual consistency. Keeping a separate data structure listing recent inserts/deletes to a graph with timestamps would serve for establishing consistency. The size of this would be a small fraction of the size of the graph.
RDF cannot do anything without joining between partitions, whereas for PNUTS the join between partitions is an application matter. But then PNUTS does have an extra step of RPC between the PNUTS infrastructure and the back-end. Doing query routing in the back-end gets rid of this. RDF does remain more dependent on even performance and short interconnect latencies, though. It also likely takes more space. But the essential consistency and availability features can be generalized to it, providing the merge of semi-structured data at infinite scale and availability with complex query.
At any rate, repartitioning-on-demand and partition-migration remain the key agenda items for us, confirmed over and over at VLDB.
]]>Now we are great fans of the TPC and while we have not published results by the TPC book, we have extensively used TPC material for guiding optimization, as has pretty much everybody else.
It is true that the rules encourage unrealistic configurations. The emphasis on random access from disk that is built into the rules leads to disk configurations that are very improbable in practice, such as 1PB of disks for 3TB of data, just so there are enough disk arms in parallel. Stonebraker also pointed out that replication and failover were ubiquitous in real life and that roll forward from logs was unrealistic as a recovery model since it took so long. Benchmarks should therefore include replication.
Further, Stonebraker challenged the TPC to go for the new frontier, which he described as the huge data sets in science and on big web sites. Scientists, the ones who would save our planet from the diverse ills confronting it, do not like relational databases. They avoid them when can. They want arrays for physics, and graphs for biology and chemistry. MapReduce is eating database's lunch; what will you do about this?
I later suggested incorporating an RDF metadata benchmark into the TPC suite. We'll see about this; we'll first have to come up with a suitable one. There is a great deal of pressure for making good RDF benchmarks but this is not yet in the center of the mainstream that TPC tends to cover.
TPC's own talk was about the life cycle of benchmarks. A benchmark begins a bit ahead of the mainstream, with a problem that is difficult but not so difficult as to be uncommon. When the solution to this problem becomes commonplace, the benchmark's relevance gradually drops.
There was a talk on robustness of query plans which was well to the point. Indeed, there are performance cliffs at certain points; for example, when passing from memory-only to disk-pageable data structures, or when switching from indexed access to table scans, or from loop to hash joins. Quite so. The analysis I really would have liked to see would have been one of what happens when passing from single server to a cluster, and from local joins to cross-partition ones. Also contrasting of cache fusion and partitioning. We have our own data and experience but we find we don't have time to measure all the other systems.
Anyway it is good to raise the question of smooth and predictable performance.
]]>Intel and Oracle had measured hash and sort merge joins on Intel Core i7. The result was that hash join with both tables partitioned to match CPU cache was still the best but that sort/merge would catch up with more SIMD instructions in the future.
We should probably experiment with this but the most important partitioning of hash joins is still between cluster nodes. Within the process, we will see. The tradeoff of doing all in cache-sized partitions is larger intermediate results which in turn will impact the working set of disk pages in RAM. For one-off queries this is OK; for online use this has an effect.
SAP presented a paper about federating relational databases. Queries would be expressed against VIEWs defined over remote TABLEs, UNIONed together and so forth. Traditional methods of optimization would run out of memory; a single 1000 TABLE plan is already a big thing. Enumerating multiple variations of such is not possible in practice. So the solution was to plan in two stages — first arrange the subqueries and derived TABLEs, and then do the JOIN orders locally. Further, local JOIN orders could even be adjusted at run time based on the actual data. Nice.
Oracle presented some new SQL optimizations, combining and inlining subqueries and derived TABLEs. We do fairly similar things and might extend the repertoire of tricks in the direction outlined by Oracle as and when the need presents itself. This further confirms that SQL and other query optimization is really an incremental collection of specially recognized patterns. We still have not found any other way of doing it.
Another interesting piece by Oracle was about their re-implementation of large object support, where they compared LOB loading to file system and raw device speeds.
There was a paper about a memory-resident database that could give steady time for any kind of single-table scan query. The innovation was to not use indices, but to have one partition of the table per processor core, all in memory. Then each core would have exactly two cursors — one reading, the other writing. The write cursor should keep ahead of the read cursor. Like this, there would be no read/write contention on pages, no locking, no multiple threads splitting a tree at different points, none of the complexity of a multithreaded database engine. Then, when the cursor would hit a row, it would look at the set of queries or updates and add the result to the output if there was a result. The data indexes the queries, not the other way around. We have done something similar for detecting changes in a full text corpus but never thought of doing queries this way.
Well, we are all about JOINs so this is not for us, but it deserves a mention for being original and clever. And indeed, anything one can ask about a table will likely be served with great predictability.
Google's chief economist said that the winning career choice would be to pick a scarce skill that made value from something that was plentiful. For the 2010s this career is that of the statistician/data analyst. We've said it before — the next web is analytics for all. The Greenplum talk was divided between the Fox use case, with 200TB of data about ads, web site traffic, and other things, growing 5TB a day. The message was that cubes and drill down are passé, that it is about complex statistical methods that have to run in the database, that the new kind of geek is the data geek, whose vocation it is to consume and spit out data, discover things in it, and so forth.
The technical part was about Greenplum, a SQL database running on a cluster with a PostgreSQL back-end. The interesting points were embedding MapReduce into SQL, and using relational tables for arrays and complex data types — pretty much what we also do. Greenplum emphasized scale-out and found column orientation more like a nice-to-have.
The MonetDB people from CWI in Amsterdam gave a 10 year best paper award talk about optimizing database for CPU cache. The key point was that if data is stored as columns, it ought also to be transferred as columns inside the execution engine. Materialize big chunks of state to cut down on interpretation overhead and use cache to best effect. They vector for CPU cache; we vector for scale-out, since the only way to ship operations is to ship many at a time. So we might as well vector also in single servers. This could be worth an experiment. Also we regularly visit the topic of column storage. But we are not yet convinced that it would be better than row-style covering indices for RDF quads. But something could certainly be tried, given time.
]]>Firstly, RDF was as good as absent from the presentations and discussions we saw. There were a few mentions in the panel on structured data on the web, however RDF was not in any way seen to be essential for this. There were also a couple of RDF mentions in questions at other sessions, but that was about it.
It is a common perception that RDF and database people do not talk with each other. Evidence seems to bear this out.
As a database developer I did get a lot of readily applicable ideas from the VLDB talks. These run across the whole range of DBMS topics, from key compression and SQL optimization, to column storage, CPU cache optimization, and the like. In this sense, VLDB is directly relevant to all we do. In a conversation, someone was mildly confused that I should on one hand mention I was doing RDF, and on the other hand also be concerned about database performance. These things are not seen to belong together, even though making RDF do something useful certainly depends on a great deal of database optimization.
The question of all questions — that of infinite scale-out with complex queries, resilience, replication, and full database semantics — was strongly in the air.
But it was in the air more as a question than as an answer. Not very much at all was said about the performance of distributed query plans, of 2pc (two-phase commit), of the impact of interconnect latency, and such things. On the other hand, people were talking quite liberally about optimizing CPU cache and local multi-core execution, not to mention SQL plans and rewrites. Also, almost nothing was said about transactions.
Still, there is bound to be a great deal of work in scale-out of complex workloads by any number of players. Either these things are all figured out and considered self-evidently trivial, or they are so hot that people will go there only by way of allusion and vague reference. I think it is the latter.
By and large, we were confirmed in our understanding that infinite scale-out on the go, with redundancy, is the ticket, especially if one can offer complex queries and transactional semantics coupled with instant data loading and schema-last.
Column storage and cache optimizations seem to come right after these.
Certainly the database space is diversifying.
MapReduce was discussed quite a bit, as an intruder into what would be the database turf. We have no great problem with MapReduce; we do that in SQL procedures if one likes to program in this way. Greenplum also seems to have come by the same idea.
As said before, RDF and RDF reasoning were ignored. Do these actually offer something to the database side? Certainly for search, discovery, integration, and resource discovery, linked data has evident advantages.
Two points of the design space — the warehouse, and the web-scale key-value store — got a lot of attention. Would I do either in RDF? RDF is a slightly different design space point, like key-value with complex queries — on the surface, a fusion of the two. As opposed to RDF, the relational warehouse gains from fixed data-types and task-specific layout, whether row or column. The key-value store gains from having a concept of a semi-structured record, a bit like the RDF subject of a triple, but now with ad-hoc (if any) secondary indices, and inline blobs. The latter is much simpler and more compact than the generic RDF subject with graphs and all, and can be easily treated as a unit of version control and replication mastering. RDF, being more generic and more normalized, is representationally neither as ad-hoc nor as compact.
But RDF will be the natural choice when complex queries and ad-hoc schema meet, for example in web-wide integrations of application data.
There seems to be a huge divide in understanding between database-developing people and those who would be using databases. On one side, this has led to a back-to-basics movement with no SQL, no ACID, key-value pairs instead of schema, MapReduce instead of fancy but hard-to-follow parallel execution plans. On the other side, the database space specializes more and more; it is no longer simply transactions vs. analytics, but many more points of specialization.
Some frustration can be sensed in the ivory towers of science when it is seen that the ones most in need of database understanding in fact have the least. Google, Yahoo!, and Microsoft know what they are doing, with or without SQL, but the medium-size or fast-growing web sites seem to be in confusion when LAMP or Ruby or the scripting-du-jour can no longer cut it.
Can somebody using a database be expected to understand how it works? I would say no, not in general. Can a database be expected to unerringly self-configure based on workload? Sure, a database can suggest layouts, but it ought not restructure itself on the spur of the moment under full load.
It is safe to say that the community at large no longer believes in "one size fits all". Since there is no general solution, there is a fragmented space of specific solutions. We will be looking at some of these issues in the following posts.
]]>RDF and linked data principles could evidently be a great help. This is a large topic that goes into the culture of doing science and will deserve a more extensive treatment down the road.
For now, I will talk about possible ways of dealing with provenance annotations in Virtuoso at a fairly technical level.
If data comes many-triples-at-a-time from some source (e.g., library catalogue, user of a social network), then it is often easiest to put the data from each source/user into its own graph. Annotations can then be made on the graph. The graph IRI will simply occur as the subject of a triple in the same or some other graph. For example, all such annotations could go into a special annotations graph.
On the query side, having lots of distinct graphs does not have to be a problem if the index scheme is the right one, i.e., the 4 index scheme discussed in the Virtuoso documentation. If the query does not specify a graph, then triples in any graph will be considered when evaluating the query.
One could write queries like —
SELECT ?pub
WHERE
{
GRAPH ?g
{
?person foaf:knows ?contact
}
?contact foaf:name "Alice" .
?g xx:has_publisher ?pub
}
This would return the publishers of graphs that assert that somebody knows Alice.
Of course, the RDF reification vocabulary can be used as-is to say things about single triples. It is however very inefficient and is not supported by any specific optimization. Further, reification does not seem to get used very much; thus there is no great pressure to specially optimize it.
If we have to say things about specific triples and this occurs frequently (i.e., for more than 10% or so of the triples), then modifying the quad table becomes an option. For all its inefficiency, the RDF reification vocabulary is applicable if reification is a rarity.
Virtuoso's RDF_QUAD
table can be altered to have more columns. The problem with this is that space usage is increased and the RDF loading and query functions will not know about the columns. A SQL update statement can be used to set values for these additional columns if one knows the G,S,P,O
.
Suppose we annotated each quad with the user who inserted it and a timestamp. These would be columns in the RDF_QUAD
table. The next choice would be whether these were primary key parts or dependent parts. If primary key parts, these would be non-NULL
and would occur on every index. The same quad would exist for each distinct user and time this quad had been inserted. For loading functions to work, these columns would need a default. In practice, we think that having such metadata as a dependent part is more likely, so that G,S,P,O
are the unique identifier of the quad. Whether one would then include these columns on indices other than the primary key would depend on how frequently they were accessed.
In SPARQL, one could use an extension syntax like —
SELECT *
WHERE
{ ?person foaf:knows ?connection
OPTION ( time ?ts ) .
?connection foaf:name "Alice" .
FILTER ( ?ts > "2009-08-08"^^xsd:datetime )
}
This would return everybody who knows Alice since a date more recent than 2009-08-08. This presupposes that the quad table has been extended with a datetime column.
The OPTION (time ?ts)
syntax is not presently supported but we can easily add something of the sort if there is user demand for it. In practice, this would be an extension mechanism enabling one to access extension columns of RDF_QUAD
via a column ?variable
syntax in the OPTION
clause.
If quad metadata were not for every quad but still relatively frequent, another possibility would be making a separate table with a key of GSPO
and a dependent part of R
, where R
would be the reification URI of the quad. Reification statements would then be made with R
as a subject. This would be more compact than the reification vocabulary and would not modify the RDF_QUAD
table. The syntax for referring to this could be something like —
SELECT *
WHERE
{ ?person foaf:knows ?contact
OPTION ( reify ?r ) .
?r xx:assertion_time ?ts .
?contact foaf:name "Alice" .
FILTER ( ?ts > "2008-8-8"^^xsd:datetime )
}
We could even recognize the reification vocabulary and convert it into the reify option if this were really necessary. But since it is so unwieldy I don't think there would be huge demand. Who knows? You tell us.
]]>In answer, we will here explain how we do search engine style processing by writing SPARQL. There is no need for custom procedural code because the query optimizer does all the partitioning and the equivalent of map reduce.
The point is that what used to require programming can often be done in a generic query language. The technical detail is that the implementation must be smart enough with respect to parallelizing queries for this to be of practical benefit. But by combining these two things, we are a step closer to the web being the database.
I will here show how we do some joins combining full text, RDF conditions, and aggregates and ORDER BY
. The sample task is finding the top 20 entities with New York in some attribute value. Then we specify the search further by only taking actors associated with New York. The results are returned in the order of a composite of entity rank and text match score.
The basic query is:
SELECT
(
<sql:s_sum_page>
(
<sql:vector_agg>
(
<bif:vector> ( ?c1 , ?sm , ?g1 )
),
<bif:vector> ( 'NEW', 'YORK' )
)
) AS ?res
WHERE
{
{
SELECT
( <SHORT_OR_LONG::> ( ?s1 ) ) AS ?c1 ,
(
<sql:S_SUM>
(
<SHORT_OR_LONG::IRI_RANK> ( ?s1 ) ,
<SHORT_OR_LONG::> ( ?s1textp ) ,
<SHORT_OR_LONG::> ( ?o1 ) ,
?sc
)
) AS ?sm ,
<SHORT_OR_LONG::> ( ?g ) AS ?g1
WHERE
{
QUAD MAP virtrdf:DefaultQuadMap
{
graph ?g
{
?s1 ?s1textp ?o1
. ?o1 <bif:contains> '( NEW AND YORK )'
OPTION ( SCORE ?sc )
}
}
}
ORDER BY
DESC
(
<sql:sum_rank>
((
<sql:S_SUM>
(
<SHORT_OR_LONG::IRI_RANK> ( ?s1 )
, <SHORT_OR_LONG::> ( ?s1textp )
, <SHORT_OR_LONG::> ( ?o1 )
, ?sc
)
))
)
LIMIT 20
OFFSET 0
}
}
This takes some explaining. The basic part is
{
?s1 ?s1textp ?o1
. ?o1 <bif:contains> '( NEW AND YORK )'
OPTION ( SCORE ?sc )
}
This just makes tuples where ?s1
is the object (entity), ?s1textp
the property (attribute), and ?o1
the literal (value) which contains the strings NEW
and YORK
. For a single ?s1
, there can of course be many properties which all contain NEW
and YORK
.
The rest of the query gathers all the "New York" containing properties of an entity into a single aggregate, and then gets the entity ranks of all such entities.
After this, the aggregates are sorted by a sum of the entity rank and a combined text score calculated based on the individual text match scores between "New York" and the strings containing "New York". The text hit score is higher if the words repeat often and in close proximity.
The S_SUM
function is a user-defined aggregate (borrowed from the FCT application; subject to change over time) which takes 4 arguments: The rank of the subject of the triple; the predicate of the triple containing the text; the object of the triple containing the text; and the text match score.
These are grouped by the subject of the triple. After this, these are sorted by sum_rank
of the aggregate constructed with S_SUM
. The sum_rank
(also borrowed from the FCT application; subject to change over time) is a SQL function combining the entity rank with the text scores of the different literals.
This executes as one would expect: All partitions make a text index lookup, retrieving the object of the triple. The text index entries of an object are stored in the same partition as the object. But the entity rank is a property of the subject and is partitioned by the subject. Also the GROUP BY
is by the subject. Thus the data is produced from all partitions, then streamed into the receiving partitions, determined by the subject. This partition can then get the score and group the matches by the subject. Since all these partial aggregates are partitioned by the subject, there is no need to merge them; thus, the top k
sort can be done for each partition separately. Finally, the top 20 of each partition are merged into the global top 20. This is then passed to a final function s_sum_page
(also borrowed from the FCT application; subject to change over time) that turns this all into an XML fragment that can be processed with XSLT for inclusion on a web page.
This differs from the text search engine in that the query pipeline can contain arbitrary cross-partition joins. Also, the string "New York" is a common label that occurs in many distinct entities. Thus one text match, to one document, in the case the containing only the string "New York" will get many entities, likely all from different partitions.
So, if we only want actors with a mention of "New York", we need to get the inner part of the query as:
{
?s1 ?s1textp ?o1
. ?o1 <bif:contains> ' ( NEW AND YORK ) '
OPTION ( SCORE ?sc )
. ?s1 a <http://umbel.org/umbel/sc/Actor>
}
Whether an entity is an actor can be checked in the same partition as the rank of the entity. Thus the query plan gets this check right before getting the rank. This is natural since there is no point in getting the rank of something that is not an actor.
The <SHORT_OR_LONG::sql:func>
notation means that we call func
, which is a SQL stored procedure with the arguments in their internal form. Thus, if a variable bound to an IRI is passed, the SHORT_OR_LONG
specifies that it is passed as its internal ID and is not converted into its text form. This is essential, since there is no point getting the text of half a million IRIs when only 20 at most will be shown in the end.
Now, when we run this on a collection of 4.5 billion triples of linked data, once we have the working set, we can get the top 20 "New York" occurrences, with text summaries and all, in just 1.1s, with 12 of 16 cores busy. (The hardware is two boxes with two quad-core Xeon 5345 each.)
If we run this query in two parallel sessions, we get both results in 1.9s, with 14 of 16 cores busy. This gets about 200K "New York" strings, which becomes about 400K entities with New York somewhere, for which a rank then has to be retrieved. After this, all the possibly-many occurrences of New York in the title, text, and other properties of the entity are aggregated together, resolving into some 220K groups. These are then sorted. This is internally over 1.5 million random lookups and some 40MB of traffic between processes. Restricting the type of the entity to actor drops the execution time of one query to 0.8s because there are then fewer ranks to retrieve and less data to aggregate and sort.
By adding partitions and cores, we scale horizontally, as evaluating the query involves almost no central control, even though data are swapped between partitions. There is some flow control to avoid constructing overly-large intermediate results but generally partitions run independently and asynchronously. In the above case, there is just one fence at the point where all aggregates are complete, so that they can be sorted; otherwise, all is asynchronous.
Doing JOINs
between partitions and partitioned GROUP BY
/ORDER BY
is pretty regular database stuff. Applying this to RDF is a most natural thing.
If we do not parallelize the user-defined aggregate for grouping all the "New York" occurrences, the query takes 8s instead of 1.1s. If we could not put SQL procedures as user-defined aggregates to be parallelized with the query, we'd have to either bring all the data to a central point before the top k, which would destroy performance, or we would have to do procedures with explicit parallel procedure calls which is hard to write, surely too hard for ad hoc queries.
Results of live execution of this query may not be complete on initial load, as this link includes a "Virtuoso Anytime" timeout of 10 seconds. Running against a cold cache, these results may take much longer to return; a warm cache will deliver response times along the lines of those discussed above.Engineering matters. If we wish to commoditize queries on a lot of data, such intelligence in the DBMS is necessary; it is very unscalable to require people to do procedural code or give query parallelization hints. If you need to optimize a workload of 10 different transactions, this is of course possible and even desirable, but for the infinity of all search or analysis, this will not happen.
]]>The load rate is now 160,739 triples-per-second.
Virtuoso 6 (previous run) |
Virtuoso 6 (new run) |
Virtuoso 6 (newest run) |
||||
---|---|---|---|---|---|---|
blades | 1 | 1 | 2 | |||
processors | 2 x Xeon 5410 | 2 x Xeon 5520 | 2 x Xeon 5520 + 2 x Xeon 5410 with 1x1GigE interconnect |
|||
memory | 16G 667 MHz | 72G 1333 MHz | 72G 1333 MHz + 16G 667 MHz respectively |
|||
reported load rate triples-per-second |
110,532 | 160,739 | 214,188 |
Again, if others talk about loading LUBM, so must we. Otherwise, this metric is rather uninteresting.
]]>The real time for the load was 161m 3s. The rate was 110,532 triples-per-second. The hardware was one machine with 2 x Xeon 5410 (quad core, 2.33 GHz) and 16G 667 MHz RAM. The software was Virtuoso 6 Cluster, configured into 8 partitions (processes) — one partition per CPU core. Each partition had its database striped over 6 disks total; the 6 disks on the system were shared between the 8 database processes.
The load was done on 8 streams, one per server process. At the beginning of the load, the CPU usage was 740% with no disk; at the end, it was around 700% with 25% disk wait. 100% counts here for one CPU core or one disk being constantly busy.
The RDF store was configured with the default two indices over quads, these being GSPO and OGPS. Text indexing of literals was not enabled. No materialization of entailed triples was made.
We think that LUBM loading is not a realistic benchmark for the world but since other people publish such numbers, so do we.
]]>Our test is very simple: Load 20 warehouses of TPC-C data, and then run one client per warehouse for 10,000 new orders. The way this is set up, disk I/O does not play a role and lock contention between the clients is minimal.
The test essentially has 20 server and 20 client threads running the same workload in parallel. The load time gives the single thread number; the 20 clients run gives the multi-threaded number. The test uses about 2-3 GB of data, so all is in RAM but is large enough not to be all in processor cache.
All times reported are real times, starting from the start of the first client and ending with the completion of the last client.
Do not confuse these results with official TPC-C. The measurement protocols are entirely incomparable.
Test | Platform | Load (seconds) |
Run (seconds) |
GHz / cores / threads |
---|---|---|---|---|
1 | Amazon EC2 Extra Large (4 virtual cores) |
340 | 42 | 1.2 GHz? / 4 / 1 |
1 | Amazon EC2 Extra Large (4 virtual cores) |
305 | 43.3 | 1.2 GHz? / 4 / 1 |
2 | 1 x dual-core AMD 5900 | 263 | 58.2 | 2.9 GHz / 2 / 1 |
3 | 2 x dual-core Xeon 5130 ("Woodcrest") | 245 | 35.7 | 2.0 GHz / 4 / 1 |
4 | 2 x quad-core Xeon 5410 ("Harpertown") | 237 | 18.0 | 2.33 GHz / 8 / 1 |
5 | 2 x quad-core Xeon 5520 ("Nehalem") | 162 | 18.3 | 2.26 GHz / 8 / 2 |
We tried two different EC2 instances to see if there would be variation. The variation was quite small. The tested EC2 instances costs 20 US cents per hour. The AMD dual-core costs 550 US dollars with 8G. The 3 Xeon configurations are Supermicro boards with 667MHz memory for the Xeon 5130 ("Woodcrest") and Xeon 5410 ("Harpertown"), and 800MHz memory for the Nehalem. The Xeon systems cost between 4000 and 7000 US dollars, with 5000 for a configuration with 2 x Xeon 5520 ("Nehalem"), 72 GB RAM, and 8 x 500 GB SATA disks.
Caveat: Due to slow memory (we could not get faster within available time), the results for the Nehalem do not take full advantage of its principal edge over the previous generation, i.e., memory subsystem. We'll see another time with faster memories.
The operating systems were various 64 bit Linux distributions.
We did some further measurements comparing Harpertown and Nehalem processors. The Nehalem chip was a bit faster for a slightly lower clock but we did not see any of the twofold and greater differences advertised by Intel.
We tried some RDF operations on the two last systems:
operation | Harpertown | Nehalem |
---|---|---|
Build text index for DBpedia | 1080s | 770s |
Entity Rank iteration | 263s | 251s |
Then we tried to see if the core multithreading of Nehalem could be seen anywhere. To this effect, we ran the Fibonacci function in SQL to serve as an example of an all in-cache integer operation. 16 concurrent operations took exactly twice as long as 8 concurrent ones, as expected.
For something that used memory, we took a count of RDF quads on two different indices, getting the same count. The database was a cluster setup with one process per core, so a count involved one thread per core. The counts in series took 5.02s and in parallel they took 4.27s.
Then we took a more memory intensive piece that read the RDF quads table in the order of one index and for each row checked that there is the equal row on another, differently-partitioned index. This is a cross-partition join. One of the indices is read sequentially and the other at random. The throughput can be reported as random-lookups-per-second. The data was English DBpedia, about 140M triples. One such query takes a couple of minutes with a 650% CPU utilization. Running multiple such queries should show effects of core multithreading since we expect frequent cache misses.
n | cpu% | rows per second |
---|---|---|
1 query | 503 | 906,413 |
2 queries | 1263 | 1,578,585 |
3 queries | 1204 | 1,566,849 |
n | cpu% | rows per second |
---|---|---|
1 query | 652 | 799,293 |
2 queries | 1266 | 1,486,710 |
3 queries | 1222 | 1,484,093 |
n | cpu% | rows per second |
---|---|---|
1 query | 648 | 1,041,448 |
2 queries | 708 | 1,124,866 |
The CPU percentages are as reported by the OS: user + system CPU divided by real time.
So, Nehalem is in general somewhat faster, around 20-30%, than Harpertown. The effect of core multithreading can be noticed but is not huge, another 20% or so for situations with more threads than cores. The join where Harpertown did better could be attributed to its larger cache — 12 MB vs 8 MB.
We see that Xen has a measurable but not prohibitive overhead; count a little under 10% for everything, also tasks with no I/O. The VM was set up to have all CPU for the test and the queries did not do disk I/O.
The executables were compiled with gcc
with default settings. Specifying -march=nocona
(Core 2 target) dropped the cross-partition join time mentioned above from 128s to 122s on Harpertown. We did not try this on Nehalem but presume the effect would be the same, since the out-of-order unit is not much different. We did not do anything about process-to-memory affinity on Nehalem, which is a non-uniform architecture. We would expect this to increase performance since we have many equal size processes with even load.
The mainstay of the Nehalem value proposition is a better memory subsystem. Since the unit we got was at 800 MHz memory, we did not see any great improvement. So if you buy Nehalem, you should make sure it is with 1333 MHz memory, else the best case will not be over 50% over a 667 MHz Core 2-based Xeon.
Nehalem remains a better deal for us because of more memory per board. One Nehalem box with 72 GB costs less than two Harpertown boxes with 32 GB and offers almost the same performance. Having a lot of memory in a small space is key. With faster memory, it might even outperform two Harpertown boxes, but this remains to be seen.
If space were not a constraint, we could make a cluster of 12 small workstations for the price of our largest system and get still more memory and more processor power per unit of memory. The Nehalem box was almost 4x faster than the AMD box but then it has 9x the memory, so the CPU to memory ratio might be better with the smaller boxes.
]]>The social networks camp was interesting, with a special meeting around Twitter. Half jokingly, we (that is, the OpenLink folks attending) concluded that societies would never be completely classless, although mobility between, as well as criteria for membership in, given classes would vary with time and circumstance. Now, there would be a new class division between people for whom micro-blogging is obligatory and those for whom it is an option.
By my experience, a great deal is possible in a short time, but this possibility depends on focus and concentration. These are increasingly rare. I am a great believer in core competence and focus. This is not only for geeks — one can have a lot of breadth-of-scope but this too depends on not getting sidetracked by constant information overload.
Insofar as personal success depends on constant reaction to online social media, this comes at a cost in time and focus and this cost will have to be managed somehow, for example by automation or outsourcing. But if the social media is only automated fronts twitting and re-twitting among themselves, a bit like electronic trading systems do with securities, with or without human operators, the value of the medium decreases.
There are contradictory requirements. On one hand, what is said in electronic media is essentially permanent, so one had best only say things that are well considered. On the other hand, one must say these things without adequate time for reflection or analysis. To cope with this, one must have a well-rehearsed position that is compacted so that it fits in a short format and is easy to remember and unambiguous to express. A culture of pre-cooked fast-food advertising cuts down on depth. Real-world things are complex and multifaceted. Besides, prevalent patterns of communication train the brain for a certain mode of functioning. If we train for rapid-fire 140-character messaging, we optimize one side but probably at the expense of another. In the meantime, the world continues developing increased complexity by all kinds of emergent effects. Connectivity is good but don't get lost in it.
There is a CIA memorandum about how analysts misinterpret data and see what they want to see. This is a relevant resource for understanding some psychology of perception and memory. With the information overload, largely driven by user generated content, interpreting fragmented and variously-biased real-time information is not only for the analyst but for everyone who needs to intelligently function in cyber-social space.
I participated in discussions on security and privacy and on mobile social networks and context.
For privacy, the main thing turned out to be whether people should be protected from themselves. Should information expire? Will it get buried by itself under huge volumes of new content? Well, for purposes of visibility, it will certainly get buried and will require constant management to stay visible. But for purposes of future finding of dirt, it will stay findable for those who are looking.
There is also the corollary of setting security for resources, like documents, versus setting security for statements, i.e., structured data like social networks. As I have blogged before, policies à la SQL do not work well when schema is fluid and end-users can't be expected to formulate or understand these. Remember Ted Nelson? A user interface should be such that a beginner understands it in 10 seconds in an emergency. The user interaction question is how to present things so that the user understands who will have access to what content. Also, users should themselves be able to check what potentially sensitive information can be found out about them. A service along the lines of Garlic's Data Patrol should be a part of the social web infrastructure of the future.
People at MIT have developed AIR (Accountability In RDF) for expressing policies about what can be done with data and for explaining why access is denied if it is denied. However, if we at all look at the history of secrets, it is rather seldom that one hears that access to information about X is restricted to compartment so-and-so; it is much more common to hear that there is no X. I would say that a policy system that just leaves out information that is not supposed to be available will please the users more. This is not only so for organizations; it is fully plausible that an individual might not wish to expose even the existence of some selected inner circle of friends, their parties together, or whatever.
In conclusion, there is no self-evident solution for careless use of social media. A site that requires people to confirm multiple times that they know what they are doing when publishing a photo will not get much use. We will see.
For mobility, there was some talk about the context of usage. Again, this is difficult. For different contexts, one would for example disclose one's location at the granularity of the city; for some other purposes, one would say which conference room one is in.
Embarrassing social situations may arise if mobile devices are too clever: If information about travel is pushed into the social network, one would feel like having to explain why one does not call on such-and-such a person and so on. Too much initiative in the mobile phone seems like a recipe for problems.
There is a thin line between convenience and having IT infrastructure rule one's life. The complexities and subtleties of social situations ought not to be reduced to the level of if-then rules. People and their interactions are more complex than they themselves often realize. A system is not its own metasystem, as Gödel put it. Similarly, human self-knowledge, let alone knowledge about another, is by this very principle only approximate. Not to forget what psychology tells us about state-dependent recall and of how circumstance can evoke patterns of behavior before one even notices. The history of expert systems did show that people do not do very well at putting their skills in the form of if-then rules. Thus automating sociality past a certain point seems a problematic proposition.
]]>There was quite a bit of talk about what web science could or ought to be. I will here comment a bit on the panels and keynotes, in no special order.
In the web science panel, Tim Berners-Lee said that the deliverable of the web science initiative could be a way of making sense of all the world's data once the web had transformed into a database capable of answering arbitrary queries.
Michael Brodie of Verizon said that one deliverable would be a well considered understanding of the issue of counter-terrorism and civil liberties: Everything, including terrorism, operates on the platform of the web. How do we understand an issue that is not one of privacy, intelligence, jurisprudence, or sociology, but of all these and more?
I would add to this that it is not only a matter of governments keeping and analyzing vast amounts of private data, but of basically anybody who wants to do this being able to do so, even if at a smaller scale. In a way, the data web brings formerly government-only capabilities to the public, and is thus a democratization of intelligence and analytics. The citizen blogger increased the accountability of the press; the citizen analyst may have a similar effect. This is trickier though. We remember Jefferson's words about vigilance and the price of freedom. But vigilance is harder today, not because information is not there but because there is so much of it, with diverse spins put on it.
Tim B-L said at another panel that it seemed as if the new capabilities, especially the web as a database, were coming just in time to help us cope with the problems confronting the planet. With this, plus having everybody online, we would have more information, more creativity, more of everything at our disposal.
I'd have to say that the web is dual use: The bulk of traffic may contribute to distraction more than to awareness, but then the same infrastructure and the social behaviors it supports may also create unprecedented value and in the best of cases also transparency. I have to think of "For whosoever hath, to him shall be given." [Matthew 13:12] This can mean many things; here I am talking about whoever hath a drive for knowledge.
The web is both equalizing and polarizing: The equality is in the access; the polarity in the use made thereof. For a huge amount of noise there will be some crystallization of value that could not have arisen otherwise. Developments have unexpected effects. I would not have anticipated that gaming should advance supercomputing, for example.
Wendy Hall gave a dinner speech about communities and conferences; how the original hypertext conferences, with lots of representation of the humanities, became the techie WWW conference series; and how now we have the pendulum swinging back to more diversity with the web science conferences. So it is with life. Aside from the facts that there are trends and pendulum effects, and that paths that cross usually cross again, it is very hard to say exactly how these things play out.
At the "20 years of web" panel, there was a round of questions on how different people had been surprised by the web. Surprises ranged from the web's actual scalability to its rapid adoption and the culture of "if I do my part, others will do theirs." On the minus side, the emergence of spam and phishing were mentioned as unexpected developments.
Questions of simplicity and complexity got a lot of attention, along with network effects. When things hit the right simplicity at the right place (e.g., HTML and HTTP, which hypertext-wise were nothing special), there is a tipping point.
No barrier of entry, not too much modeling, was repeated quite a bit, also in relation to semantic web and ontology design. There is a magic of emergent effects when the pieces are simple enough: Organic chemistry out of a couple of dozen elements; all the world's information online with a few tags of markup and a couple of protocol verbs. But then this is where the real complexity starts — one half of it in the transport, the other in the applications, yet a narrow interface between the two.
This then begs the question of content- and application-aware networks. The preponderance of opinion was for separation of powers — keep carriers and content apart.
Michael Brodie commented in the questions to the first panel that simplicity was greatly overrated, that the world was in fact very complex. It seems to me that that any field of human endeavor develops enough complexity to fully occupy the cleverest minds who undertake said activity. The life-cycle between simplicity and complexity seems to be a universal feature. It is a bit like the Zen idea that "for the beginner, rivers are rivers and mountains are mountains, for the student these are imponderable mysteries of bewildering complexity and transcendent dimension but for the master these are again rivers and mountains." One way of seeing this is that the master, in spite of the actual complexity and interrelatedness of all things, sees where these complexities are significant and where not and knows to communicate concerning these as fits the situation.
There is no fixed formula for saying where complexities and simplicities fit, relevance of detail is forever contextual. For technological systems, we find that there emerge relatively simple interfaces on either side of which there is huge complexity: The x86 instruction set, TCP/IP, SQL, to name a few. These are lucky breaks, it is very hard to say beforehand where these will emerge. Object oriented people would like to see such everywhere, which just leads to problems of modeling.
There was a keynote from Telefonica about infrastructure. We heard that the power and cooling cost more than the equipment, that data centers ought to be scaled down from the football stadium and 20 megawatt scale, that systems must be designed for partitioning, to name a few topics. This is all well accepted. The new question is whether storage should go into the network infrastructure. We have blogged that the network will be the database, and it is no surprise that a telco should have the same idea, just with slightly different emphasis and wording. For Telefonica, this is about efficiency of bulk delivery, for us this is more about virtualized query-able dataspaces. Both will be distributed but issues of separation of powers may keep the two roles of network with storage separate.
In conclusion, the network being the database was much more visible and accepted this year than last. The linked data web was in Tim B-L's keynote as it was in the opening speech by the Prince of Asturias.
]]>There are some points that came up in conversation at WWW 2009 that I will reiterate here. We find there is still some lack of clarity in the product image, so I will here condense it.
Virtuoso is a DBMS. We pitch it primarily to the data web space because this is where we see the emerging frontier. Virtuoso does both SQL and SPARQL and can do both at large scale and high performance. The popular perception of RDF and Relational models as mutually exclusive and antagonistic poles is based on the poor scalability of early RDF implementations. What we do is to have all the RDF specifics, like IRIs and typed literals as native SQL types, and to have a cost based optimizer that knows about this all.
If you want application-specific data structures as opposed to a schema-agnostic quad-store model (triple + graph-name), then Virtuoso can give you this too. Rendering application specific data structures as RDF applies equally to relational data in non-Virtuoso databases because Virtuoso SQL can federate tables from heterogenous DBMS.
On top of this, there is a web server built in, so that no extra server is needed for web services, web pages, and the like.
Installation is simple, just one exe and one config file. There is a huge amount of code in installers — application code and test suites and such — but none of this is needed when you deploy. Scale goes from a 25MB memory footprint on the desktop to hundreds of gigabytes of RAM and endless terabytes of disk on shared-nothing clusters.
Clusters (coming in Release 6) and SQL federation are commercial only; the rest can be had under GPL.
To condense further:
]]>There was a workshop on semantic search plus a number of papers and of course keynotes from Google and Yahoo.
A general topic was the use of and access to query logs. Are these the monopoly of GYM (Google, Yahoo, Microsoft) or should they be made more generally available? This is a privacy question. Use of query logs and click through of search results for improved ranking was mentioned many times throughout the conference.
The semantic search workshop was largely about benchmarks for keyword search in information retrieval. For linked data, which is a database proposition, these benchmarks are not really applicable. For document search aided by semantics derived by NLP, these are of course applicable. But there is a divide in approach.
Giovanni Tummarello presented Sig.ma, a service using Sindice's RDF index for collecting all RDF statements about entities matching some set of keywords. One could then choose which sources and which entities were the right ones. One could further store such a query and embed it on a page. The point was that the filtering done manually could be persisted and republished, so as to create dynamic content aggregated from selected live sources. Further speculating, one could use such user feedback for adjusting ranking, even though Sig.ma did not. We may adopt the idea of manually excluding sources into our browser too. Fresnel lenses are another thing to look at.
There was a paper by Josep M. Pujol and Pablo Rodriguez, of Telefonica Research, about returning search to the people by means of Porqpine, a peer-to-peer search implementation based on sharing search results from search engines among peers and indexing them locally as they were retrieved. For users with similar interests, this can give a community based ranking model but has issues of privacy. Another point was that with local processing and personal scale data volumes various kinds of brute force processing were feasible that would cost a lot for the web scale. Much can be done web scale but it must be done cleverly, not with a shell script and not so ad hoc.
As a counterpoint to this, there was a talk about Hadoop and Hive, a map-reduce-based SQL-like framework. One could do an SQL GROUP BY
on text files with record parsing at run time, all spread over a Hadoop cluster. The issue is, if you have a petabyte of data, you may wish to run more than one ad hoc query on it. This means that joining between partitions and complex processing becomes important. This cannot be done without indices and complex query optimization, and needs a DBMS. Stonebraker and company are fully justified in their critique of map reduce. It looks like each generation must get dazzled by the oversimplified and then retrace the same discoveries of complexity as the previous one.
Some of our future plans were confirmed by what we saw, for example as concerns:
These are all items in the pipeline, easy to do on top of the existing platform. For the machine learning and NLP parts, we will partner with others, details will be worked out while we work on the items we implement by ourselves.
]]>We gave a talk at the Linked Open Data workshop, LDOW 2009, at WWW 2009. I did not go very far into the technical points in the talk, as there was almost no time and the points are rather complex. Instead, I emphasized what new things had become possible with recent developments.
The problem we do not cease hearing about is scale. We have solved most of it. There is scale in the schema: Put together, ontologies go over a million classes/properties. Which ones are relevant depends, and the user should have the choice. The instance data is in the tens of billions of triples, much derived from Web 2.0 sources but also much published as RDF.
To make sense of this all, we need quick summaries and search. Without navigation via joins, the value will be limited. Fast joining, counting, grouping, and ranking are key.
People will use different terms for the same thing. The issue of identity is philosophical. In order to do reasoning one needs strong identity; a statement like x is a bit like y is not very useful in a database context. Whether any x and y can be considered the same depends on the context. So leave this for query time. The conditions under which two people are considered the same will depend on whether you are doing marketing analysis or law enforcement. A general purpose data store cannot anticipate all the possibilities, so smush on demand, as you go, as has been said many times.
Against this backdrop, we offer a solution with which anybody who so chooses can play with big data, whether a search or analytics player.
We are going in the direction of more and more ad hoc processing at larger and larger scale. With good query parallelization, we can do big joins without complex programming. No explicit Map Reduce jobs or the like. What was done with special code with special parallel programming models, can now be done in SQL and SPARQL.
To showcase this, we do linked data search, browsing, and so on, but are essentially a platform provider.
Entry costs into relatively high end databases have dropped significantly. A cluster with 1 TB of RAM sells for $75K or so at today's retail prices and fits under a desk. For intermittent use, the rent for 1TB RAM is $1228 per day on EC2. With this on one side and Virtuoso on the other, a lot that was impractical in the past is now within reach. Like Giovanni Tummarello put it for airplanes, the physics are as they were for da Vinci but materials and engines had to develop a bit before there was commercial potential. So it is also with analytics for everyone.
A remark from the audience was that all the stuff being shown, not limited to Virtuoso, was non-standard, having to do with text search, with ranking, with extensions, and was in fact not SPARQL and pure linked data principles. Further, by throwing this all together, one got something overcomplicated, too heavy.
I answered as follows, which apparently cannot be repeated too much:
First, everybody expects a text search box, and is conditioned to having one. No text search and no ranking is a non-starter. Ceterum censeo, for database, the next generation cannot be less expressive than the previous. All of SQL and then some is where SPARQL must be. The barest minimum is being able to say anything one can say in SQL, and then justify SPARQL by saying that it is better for heterogenous data, schema last, and so on. On top of this, transitivity and rules will not hurt. For now, the current SPARQL working group will at least reach basic SQL parity; the edge will still remain implementation dependent.
Another remark was that joining is slow. Depends. Anything involving more complex disk access than linear reading of a blob is generally not good for interactive use. But with adequate memory, and with all hot spots in memory, we do some 3.2 million random-accesses-per-second on 12 cores, with easily 80% platform utilization for a single large query. The high utilization means that times drop as processing gets divided over more partitions.
There was a talk about MashQL by Mustafa Jarrar, concerning an abstraction on top of SPARQL for easy composition of tree-structured queries. The idea was that such queries can be evaluated "on the fly" as they are being composed. As it happens, we already have an XML-based query abstraction layer incorporated into Virtuoso 6.0's built-in Faceted Data Browser Service, and the effects are probably quite similar. The most important point here is that by using XML, both of these approaches are interoperable against a Virtuoso back-end. Along similar lines, we did not get to talk to the G Facets people but our message to them is the same: Use the faceted browser service to get vastly higher performance when querying against Linked Data, be it DBpedia or the entity LOD Cloud. Virtuoso 6.0 (Open Source Edition) "TP1" is now publicly available as a Technology Preview (beta).
We heard that there is an effort for porting Freebase's Parallax to SPARQL. The same thing applies to this. With a number of different data viewers on top of SPARQL, we come closer to broad-audience linked-data applications. These viewers are still too generic for the end user, though. We fully believe that for both search and transactions, application-domain-specific workflows will stay relevant. But these can be made to a fair degree by specializing generic linked-data-bound controls and gluing them together with some scripting.
As said before, the application will interface the user to the vocabulary. The vocabulary development takes the modeling burden from the application and makes for interchangeable experience on the same data. The data in turn is "virtualized" into the database cloud or the local secure server, as the use case may require.
For ease of adoption, open competition, and safety from lock-in, the community needs a SPARQL whose usability is not totally dependent on vendor extensions. But we might de facto have that in just a bit, whenever there is a working draft from the SPARQL WG.
Another topic that we encounter often is the question of integration (or lack thereof) between communities. For example, database conferences reject semantic web papers and vice versa. Such politics would seem to emerge naturally but are nonetheless detrimental. We really should partner with people who write papers as their principal occupation. We ourselves do software products and use very little time for papers, so some of the bad reviews we have received do make a legitimate point. By rights, we should go for database venues but we cannot have this take too much time. So we are open to partnering for splitting the opportunity cost of multiple submissions.
For future work, there is nothing radically new. We continue testing and productization of cluster databases. Just deliver what is in the pipeline. The essential nature of this is adding more and more cases of better and better parallelization in different query situations. The present usage patterns work well for finding bugs and performance bottlenecks. For presentation, our goal is to have third party viewers operate with our platform. We cannot completely leave data browsing and UI to third parties since we must from time to time introduce various unique functionality. Most interaction should however go via third party applications.
]]>It has been said many times — when things are large enough, failures become frequent. In view of this, basic storage of partitions in multiple copies is built into the Virtuoso cluster from the start. Until now, this feature has not been tested or used very extensively, aside from the trivial case of keeping all schema information in synchronous replicas on all servers.
Fault tolerance has many aspects but it starts with keeping data in at least two copies. There are shared-disk cluster databases like Oracle RAC that do not depend on partitioning. With these, as long as the disk image is intact, servers can come and go. The fault tolerance of the disk in turn comes from mirroring done by the disk controller. Raids other than mirrored disk are not really good for databases because of write speed.
With shared-nothing setups like Virtuoso, fault tolerance is based on multiple servers keeping the same logical data. The copies are synchronized transaction-by-transaction but are not bit-for-bit identical nor write-by-write synchronous as is the case with mirrored disks.
There are asynchronous replication schemes generally based on log shipping, where the replica replays the transaction log of the master copy. The master copy gets the updates, the replica replays them. Both can take queries. These do not guarantee an entirely ACID fail-over but for many applications they come close enough.
In a tightly coupled cluster, it is possible to do synchronous, transactional updates on multiple copies without great added cost. Sending the message to two places instead of one does not make much difference since it is the latency that counts. But once we go to wide area networks, this becomes as good as unworkable for any sort of update volume. Thus, wide area replication must in practice be asynchronous.
This is a subject for another discussion. For now, the short answer is that wide area log shipping must be adapted to the application's requirements for synchronicity and consistency. Also, exactly what content is shipped and to where depends on the application. Some application-specific logic will likely be involved; more than this one cannot say without a specific context.
For now, we will be concerned with redundancy protecting against broken hardware, software slowdown, or crashes inside a single site.
The basic idea is simple: Writes go to all copies; reads that must be repeatable or serializable (i.e., locking) go to the first copy; reads that refer to committed state without guarantee of repeatability can be balanced among all copies. When a copy goes offline, nobody needs to know, as long as there is at least one copy online for each partition. The exception in practice is when there are open cursors or such stateful things as aggregations pending on a copy that goes down. Then the query or transaction will abort and the application can retry. This looks like a deadlock to the application.
Coming back online is more complicated. This requires establishing that the recovering copy is actually in sync. In practice this requires a short window during which no transactions have uncommitted updates. Sometimes, forcing this can require aborting some transactions, which again looks like a deadlock to the application.
When an error is seen, such as a process no longer accepting connections and dropping existing cluster connections, we in practice go via two stages. First, the operations that directly depended on this process are aborted, as well as any computation being done on behalf of the disconnected server. At this stage, attempting to read data from the partition of the failed server will go to another copy but writes will still try to update all copies and will fail if the failed copy continues to be offline. After it is established that the failed copy will stay off for some time, writes may be re-enabled — but now having the failed copy rejoin the cluster will be more complicated, requiring an atomic window to ensure sync, as mentioned earlier.
For the DBA, there can be intermittent software crashes where a failed server automatically restarts itself, and there can be prolonged failures where this does not happen. Both are alerts but the first kind can wait. Since a system must essentially run itself, it will wait for some time for the failed server to restart itself. During this window, all reads of the failed partition go to the spare copy and writes give an error. If the spare does not come back up in time, the system will automatically re-enable writes on the spare but now the failed server may no longer rejoin the cluster without a complex sync cycle. This all can happen in well under a minute, faster than a human operator can react. The diagnostics can be done later.
If the situation was a hardware failure, recovery consists of taking a spare server and copying the database from the surviving online copy. This done, the spare server can come on line. Copying the database can be done while online and accepting updates but this may take some time, maybe an hour for every 200G of data copied over a network. In principle this could be automated by scripting, but we would normally expect a human DBA to be involved.
As a general rule, reacting to the failure goes automatically without disruption of service but bringing the failed copy online will usually require some operator action.
The only way to make failures totally invisible is to have all in duplicate and provisioned so that the system never runs at more than half the total capacity. This is often not economical or necessary. This is why we can do better, using the spare capacity for more than standby.
Imagine keeping a repository of linked data. Most of the content will come in through periodic bulk replacement of data sets. Some data will come in through pings from applications publishing FOAF and similar. Some data will come through on-demand RDFization of resources.
The performance of such a repository essentially depends on having enough memory. Having this memory in duplicate is just added cost. What we can do instead is have all copies store the whole partition but when routing queries, apply range partitioning on top of the basic hash partitioning. If one partition stores IDs 64K - 128K, the next partition 128K - 192K, and so forth, and all partitions are stored in two full copies, we can route reads to the first 32K IDs to the first copy and reads to the second 32K IDs to the second copy. In this way, the copies will keep different working sets. The RAM is used to full advantage.
Of course, if there is a failure, then the working set will degrade, but if this is not often and not for long, this can be quite tolerable. The alternate expense is buying twice as much RAM, likely meaning twice as many servers. This workload is memory intensive, thus servers should have the maximum memory they can have without going to parts that are so expensive one gets a new server for the price of doubling memory.
When loading data, the system is online in principle, but query response can be quite bad. A large RDF load will involve most memory and queries will miss the cache. The load will further keep most disks busy, so response is not good. This is the case as soon as a server's partition of the database is four times the size of RAM or greater. Whether the work is bulk-load or bulk-delete makes little difference.
But if partitions are replicated, we can temporarily split the database so that the first copies serve queries and the second copies do the load. If the copies serving on line activities do some updates also, these updates will be committed on both copies. But the load will be committed on the second copy only. This is fully appropriate as long as the data are different. When the bulk load is done, the second copy of each partition will have the full up to date state, including changes that came in during the bulk load. The online activity can be now redirected to the second copies and the first copies can be overwritten in the background by the second copies, so as to again have all data in duplicate.
Failures during such operations are not dangerous. If the copies doing the bulk load fail, the bulk load will have to be restarted. If the front end copies fail, the front end load goes to the copies doing the bulk load. Response times will be bad until the bulk load is stopped, but no data is lost.
This technique applies to all data intensive background tasks — calculation of entity search ranks, data cleansing, consistency checking, and so on. If two copies are needed to keep up with the online load, then data can be kept just as well in three copies instead of two. This method applies to any data-warehouse-style workload which must coexist with online access and occasional low volume updating.
Right now, we can declare that two or more server processes in a cluster form a group. All data managed by one member of the group is stored by all others. The members of the group are interchangeable. Thus, if there is four-servers-worth of data, then there will be a minimum of eight servers. Each of these servers will have one server process per core. The first hardware failure will not affect operations. For the second failure, there is a 1/7 chance that it stops the whole system, if it falls on the server whose pair is down. If groups consist of three servers, for a total of 12, the two first failures are guaranteed not to interrupt operations; for the third, there is a 1/10 chance that it will.
We note that for big databases, as said before, the RAM cache capacity is the sum of all the servers' RAM when in normal operation.
There are other, more dynamic ways of splitting data among servers, so that partitions migrate between servers and spawn extra copies of themselves if not enough copies are online. The Google File System (GFS) does something of this sort at the file system level; Amazon's Dynamo does something similar at the database level. The analogies are not exact, though.
If data is partitioned in this manner, for example into 1K slices, each in duplicate, with the rule that the two duplicates will not be on the same physical server, the first failure will not break operations but the second probably will. Without extra logic, there is a probability that the partitions formerly hosted by the failed server have their second copies randomly spread over the remaining servers. This scheme equalizes load better but is less resilient.
Databases may benefit from defragmentation, rebalancing of indices, and so on. While these are possible online, by definition they affect the working set and make response times quite bad as soon as the database is significantly larger than RAM. With duplicate copies, the problem is largely solved. Also, software version changes need not involve downtime.
The basics of replicated partitions are operational. The items to finalize are about system administration procedures and automatic synchronization of recovering copies. This must be automatic because if it is not, the operator will find a way to forget something or do some steps in the wrong order. This also requires a management view that shows what the different processes are doing and whether something is hung or failing repeatedly. All this is for the recovery part; taking failed partitions offline is easy.
]]>For the infrastructure provider, hosting the DataSphere is no different from hosting large Web 2.0 sites. This may be paid for by users, as in the cloud computing model where users rent capacity for their own purposes, or by advertisers, as in most of Web 2.0.
Clouds play a role in this as places with high local connectivity. The DataSphere is the atmosphere; the Cloud is an atmospheric phenomenon.
Having proprietary data does not imply using a proprietary language. I would say that for any domain of discourse, no matter how private or specialized, at least some structural concepts can be borrowed from public, more generic sources. This lowers training thresholds and facilitates integration. Being able to integrate does not imply opening one's own data. To take an analogy, if you have a bunker with closed circuit air recycling, you still breathe air, even if that air is cut off from the atmosphere at large. For places with complex existing RDBMS security, the best is to map the RDBMS to RDF on the fly, always running all requests through the RDBMS. This implicitly preserves any policy or label based security schemes.
The more complex situations will be found in environments with mixed security needs, as in social networking with partly-open and partly-closed profiles. The FOAF+SSL solution with https://
URIs is one approach. For query processing, we have a question of enforcing instance-level policies. In the DataSphere, granting privileges on tables and views no longer makes sense. In SQL, a policy means that behind the scenes the DBMS will add extra criteria to queries and updates depending on who is issuing them. The query processor adds conditions like getting the user's department ID and comparing it to the department ID on the payroll record. Labeled security is a scheme where data rows themselves contain security tags and the DBMS enforces these, row by row.
I would say that these techniques are suited for highly-structured situations where the roles, compartments, and needs are clear, and where the organization has the database know-how to write, test, and deploy such rules by the table, row, and column. This does not sit well with schema-last. I would not bet much on an average developer's capacity for making airtight policies on RDF data where not even 100% schema-adherence is guaranteed.
Doing security at the RDF graph level seems more appropriate. In many use cases, the graph is analogous to a photo album or a file system directory. A Data Space can be divided into graphs to provide more granularity for expressing topic, provenance, or security. If policy conditions apply mostly to the graph, then things are not as likely to slip by, for example, policy rules missing some infrequent misuse of the schema. In these cases, the burden on the query processor is also not excessive: Just as with documents, the container (table, graph) is the object of access grants, not the individual sentences (DBMS records, RDF triples) in the document.
It is left to the application to present a choice of graph level policies to the user. Exactly what these will be depends on the domain of discourse. A policy might restrict access to a meeting in a calendar to people whose OpenIDs figure in the attendee list, or limit access to a photo album to people mentioned in the owner's social network. Defining such policies is typically a task for the application developer.
The difference between the Document Web and the Linked Data Web is that while the Document Web enforces security when a thing is returned to the user, Linked Data Web enforcement must occur whenever a query references something, even if this is an intermediate result not directly shown to the user.
The DataSphere will offer a generic policy scheme, filtering what graphs are accessed in a given query situation. Other applications may then verify the safety of one's disclosed information using the same DataSphere infrastructure. Of course, the user must rely on the infrastructure provider to correctly enforce these rules. Then again, some users will operate and audit their own infrastructure anyway.
On the open web, there is the question of federation vs. centralization. If an application is seen to be an interface to a vocabulary, it becomes more agnostic with respect to this. In practice, if we are talking about hosted services, what is hosted together joins much faster. Data Spaces with lots of interlinking, such as closely connected social networks, will tend to cluster together on the same cloud to facilitate joint operation. Data is ubiquitous and not location-conscious, but what one can efficiently do with it depends on location. Joint access patterns favor joint location. Due to technicalities of the matter, single database clusters will run complex queries within the cluster 100 to 1000 times faster than between clusters. The size of such data clouds may be in the hundreds-of-billions of triples. It seems to make sense to have data belonging to same-type or jointly-used applications close together. In practice, there will arise partitioning by type of usage, user profile, etc., but this is no longer airtight and applications more-or-less float on top of all of this.
A search engine can host a copy of the Document Web and allow text lookups on it. But a text lookup is a single well-defined query that happens to parallelize and partition very well. A search engine can also have all the structured public data copied, but the problem there is that queries are a lot less predictable and may take orders of magnitude more resources than a single text lookup. As a partial answer, even now, we can set up a database so that the first million single-row joins cost the user nothing, but doing more requires a special subscription.
The cost for hosting a trillion triples will vary radically in function of what throughput is promised. This may result in pricing per service level, a bit like ISP pricing varies in function of promised connectivity. Queries can be run for free if no throughput guarantee applies, and might cost more if the host promises at least five-million joins-per-second including infrequently-accessed data.
Performance and cost dynamics will probably lead to the emergence of domain-specific clusters of colocated Data Spaces. The landscape will be hybrid, where usage drives data colocation. A single Google is not a practical solution to the world's spectrum of query needs.
The DataSphere proposition is predicated on a worldwide database fabric that can store anything, just like a network can transport anything. It cannot enforce a fixed schema, just like TCP/IP cannot say that it will transport only email. This is continuous schema evolution. Well, TCP/IP can transport anything but it does transport a lot of HTML and email. Similarly, the DataSphere can optimize for some common vocabularies.
We have seen that an application-specific relational schema is often 10 times more efficient than an equivalent completely generic RDF representation of the same thing. The gap may narrow, but task specific representations will keep an edge. We ought to know, as we do both.
While anything can be represented, the masses are not that creative. For any data-hosting provider, making a specialized representation for the top 100 entities may cut data size in half or better. This is a behind-the-scenes optimization that will in time be a matter of course.
Historically, our industry has been driven by two phenomena:
To summarize, there is some cost to schema-last, but then our industry needs more complexity to keep justifying constant investment. The cost is in this sense not all bad.
Building the DataSphere may be the next great driver of server demand. As a case in point, Cisco, whose fortune was made when the network became ubiquitous, just entered the server game. It's in the air.
Right now, we have the Linked Open Data movement with lots of new data being added. We have the drive for data- and reputation-portability. We have Freebase as a demonstrator of end-users actually producing structured data. We have convergence of terminology around DBpedia, FOAF, SIOC, and more. We have demonstrators of useful data integration on the RDF stack in diverse fields, especially life sciences.
We have a totally ubiquitous network for the distribution of this, plus database technology to make this work.
We have a practical need for semantics, as search is getting saturated, email is getting killed by spam, and information overload is a constant. Social networks can be leveraged for solving a lot of this, if they can only be opened.
Of course, there is a call for transparency in society at large. Well, the battle of transparency vs. spin is a permanent feature of human existence but even there, we cannot ignore the possibilities of open data.
Technically, what does this take? Mostly, this takes a lot of memory. The software is there and we are productizing it as we speak. As with other data intensive things, the key is scalable querying over clusters of commodity servers. Nothing we have not heard before. Of course, the DBMS must know about RDF specifics to get the right query plans and so on but this we have explained elsewhere.
This all comes down to the cost of memory. No amount of CPU or network speed will make any difference if data is not in memory. Right now, a board with 8G and a dual core AMD X86-64 and 4 disks may cost about $700. 2 x 4 core Xeon and 16G and 8 disks may be $4000, counting just the components. In our experience, about 32G per billion triples is a minimum. This must be backed by a few independent disks so as to fill the cache in parallel. A cluster with 1 TB of RAM would be under $100K if built from low end boards.
The workload is all about large joins across partitions. The queries parallelize well, thus using the largest and most expensive machines for building blocks is not cost efficient. Having absolutely everything in RAM is also not cost efficient, but it is necessary to have many disks to absorb the random access load. Disk access is predominantly random, unlike some analytics workloads that can read serially. If SSD's get a bit cheaper, one could have SSD for the database and disk for backup.
With large data centers, redundancy becomes an issue. The most cost effective redundancy is simply storing partitions in duplicate or triplicate on different commodity servers. The DBMS software should handle the replication and fail-over.
For operating such systems, scaling-on-demand is necessary. Data must move between servers, and adding or replacing servers should be an on-the-fly operation. Also, since access is essentially never uniform, the most commonly accessed partitions may benefit from being kept in more copies than less frequently accessed ones. The DBMS must be essentially self administrating since these things are quite complex and easily intractable if one does not have in depth understanding of this rather complex field.
The best price point for hardware varies with time. Right now, the optimum is to have many basic motherboards with maximum memory in a rack unit, then another unit with local disks for each motherboard. Much cheaper than SAN's and Infiniband fabrics.
The ingredients and use cases are there. If server clusters with 1TB RAM begin under $100K, the cost of deployment is small compared to personnel costs.
Bootstrapping the DataSphere from current Linked Open Data, such as DBpedia, OpenCYC, Freebase, and every sort of social network, is feasible. Aside from private data integration and analytics efforts and E-science, the use cases are liberating social networks and C2C and some aspects of search from silos, overcoming spam, and mass use of semantics extracted from text. Emergent effects will then carry the ball to places we have not yet been.
The Linked Data Web has its origins in Semantic Web research, and many of the present participants come from these circles. Things may have been slowed down by a disconnect, only too typical of human activity, between Semantic Web research on one hand and database engineering on the other. Right now, the challenge is one of engineering. As documented on this blog, we have worked quite a bit on cluster databases, mostly but not exclusively with RDF use cases. The actual challenges of this are however not at all what is discussed in Semantic Web conferences. These have to do with complexities of parallelism, timing, message bottlenecks, transactions, and the like, i.e., hardcore engineering. These are difficult beyond what the casual onlooker might guess but not impossible. The details that remain to be worked out are nothing semantic, they are hardcore database, concerning automatic provisioning and such matters.
It is as if the Semantic Web people look with envy at the Web 2.0 side where there are big deployments in production, yet they do not seem quite ready to take the step themselves. Well, I will write some other time about research and engineering. For now, the message is &mdash go for it. Stay tuned for more announcements, as we near production with our next generation of software.
The term DataSphere comes from Dan Simmons' Hyperion science fiction series, where it is a sort of pervasive computing capability that plays host to all sorts of processes, including what people do on the net today, and then some. I use this term here in order to emphasize the blurring of silo and application boundaries. The network is not only the computer but also the database. I will look at what effects the birth of a sort of linked data stratum can have on end-user experience, application development, application deployment and hosting, business models and advertising, and security; how cloud computing fits in; and how back-end software such as databases must evolve to support all of these.
This is a mid-term vision. The components are coming into production as we speak, but the end result is not here quite yet.
I use the word DataSphere to refer to a worldwide database fabric, a global Distributed DBMS collective, within which there are many Data Spaces, or Named Data Spaces. A Data Space is essentially a person's or organization's contribution to the DataSphere. I use Linked Data Web to refer to component technologies and practices such as RDF, SPARQL, Linked Data practices, etc. The DataSphere does not have to be built on this technology stack per se, but this stack is still the best bet for it.
There exist applications for performing specialized functions such as social networking, shopping, document search, and C2C commerce at planetary scale. All these applications run on their own databases, each with a task specific schema. They communicate by web pages and by predefined messages for diverse application-specific transactions and reports.
These silos are scalable because in general their data has some natural partitioning, and because the set of transactions is predetermined and the data structure is set up for this.
The Linked Data Web proposes to create a data infrastructure that can hold anything, just like a network can transport anything. This is not a network with a memory of messages, but a whole that can answer arbitrary questions about what has been said. The prerequisite is that the questions are phrased in a vocabulary that is compatible with the vocabulary in which the statements themselves were made.
In this setting, the vocabulary takes the place of the application. Of course, there continues to be a procedural element to applications; this has the function of translating statements between the domain vocabulary and a user interface. Examples are data import from existing applications, running predefined reports, composing new reports, and translating between natural language and the domain vocabulary.
The big difference is that the database moves outside of the silo, at least in logical terms. The database will be like the network — horizontal and ubiquitous. The equivalent of TCP/IP will be the RDF/SPARQL combination. The equivalent of routing protocols between ISPs will be gateways between the specific DBMS engines supporting the services.
The RDBMS in itself is eternal, or at least as eternal as a culture with heavy reliance on written records is. Any such culture will invent the RDBMS and use it where it best fits. We are not replacing this; we are building an abstracted worldwide data layer. This is to the RDBMS supporting line-of-business applications what the www was to enterprise content management systems.
For transactions, the Web 2.0-style application-specific messages are fine. Also, any transactional system that must be audited must physically reside somewhere, have physical security, etc. It can't just be somewhere in the DataSphere, managed by some system with which one has no contract, just like Google's web page cache can't be relied on as a permanent repository of web content.
Providing space on the Linked Data Web is like providing hosting on the Document Web. This may have varying service levels, pricing models, etc. The value of a queriable DataSphere is that a new application does not have to begin by building its own schema, database infrastructure, service hosting, etc. The application becomes more like a language meme, a cultural form of interaction mediated by a relatively lightweight user-facing component, laterally open for unforeseen interaction with other applications from other domains of discourse.
For the end user, the web will still look like a place where one can shop, discuss, date, whatever. These activities will be mediated by user interfaces as they are now. Right now, the end user's web presence is his/her blog or web site, and their contributions to diverse wikis, social web sites, and so forth. These are scattered. The user's Data Space is the collection of all these things, now presented in a queriable form. The user's Data Space is the user's statement of presence, referencing the diverse contributions of the user on diverse sites.
The personal Data Space being a queriable, structured whole facilitates finding and being found, which is what brings individuals to the web in the first place. The best applications and sites are those which make this the easiest. The Linked Data Web allows saying what one wishes in a structured, queriable manner, across all application domains, independently of domain specific silos. The end user's interaction with the personal data space is through applications, like now. But these applications are just wrappers on top of self describing data, represented in domain specific vocabularies; one vocabulary is used for social networking, another for C2C commerce, and so on. The user is the master of their personal Data Space, free to take it where he or she wishes.
Further benefits will include more ready referencing between these spaces, more uniform identity management, cross-application operations, and the emergence of "meta-applications," i.e., unified interfaces for managing many related applications/tasks.
Of course, there is the increase in semantic richness, such as better contextuality derived from entity extraction from text. But this is also possible in a silo. The Linked Data Web angle is the sharing of identifiers for real world entities, which makes extracts of different sources by different parties potentially joinable. The user interaction will hardly ever be with the raw data. But the raw data being still at hand makes for better targeting of advertisements, better offering of related services, easier discovery of related content, and less noise overall.
Kingsley Idehen has coined the term SDQ, for Serendipitous Discovery Quotient, to denote this. When applications expose explicit semantics, constructing a user experience that combines relevant data from many sources, including applications as well as highly targeted advertising, becomes natural. It is no longer a matter of "mashing up" web service interfaces with procedural code, but of "meshing" data through declarative queries across application spaces.
The workflows supported by the DataSphere are essentially those taking place on the web now. The DataSphere dimension is expressed by bookmarklets, browser plugins, and the like, with ready access to related data and actions that are relevant for this data. Actions triggered by data can be anything from posting a comment to making an e-commerce purchase. Web 2.0 models fit right in.
Web application development now consists of designing an application-specific database schema and writing web pages to interact with this schema. In the DataSphere, the database is abstracted away, as is a large part of the schema. The application floats on a sea of data instead of being tied to its own specific store and schema. Some local transaction data should still be handled in the old way, though.
For the application developer, the question becomes one of vocabulary choice. How will the application synthesize URIs from the user interaction? Which URIs will be used, since pretty much anything will in practice have many names (e.g., DBpedia Vs. Freebase identifiers). The end user will generally have no idea of this choice, nor of the various degrees of normalization, etc., in the vocabularies. Still, usage of such applications will produce data using some identifiers and vocabularies. Benefits of ready joining without translation will drive adoption. A vocabulary with instance data will get more instance data.
The Linked Data Web infrastructure itself must support vocabulary and identifier choice by answering questions about who uses a particular identifier and where. Even now, we offer entity ranks and resolution of synonyms, queries on what graphs mention a certain identifier and so on. This is a means of finding the most commonly used term for each situation. Convergence of terminology cuts down on translation and makes for easier and more efficient querying.
The application developer is, for purposes of advertising, in the position of the inventory owner, just like a traditional publisher, whether web or other. But with smarter data, it is not a matter of static keywords but of the semantically explicit data behind each individual user impression driving the ads. Data itself carries no ads but the user impression will still go through a display layer that can show ads. If the application relies on reuse of licensed content, such as media, then the content provider may get a cut of the ad revenue even if it is not the direct owner of the inventory. The specifics of implementing and enforcing this are to be worked out.
For the content provider, the URI is the brand carrier. If the data is well linked and queriable, this will drive usage and traffic to the services of the content provider. This is true of any provider, whether a media publisher, e-commerce business, government agency, or anything else.
Intellectual property considerations will make the URI a first class citizen. Just like the URI is a part of the document web experience, it is a part of the Linked Data Web experience. Just like Creative Commons licenses allow the licensor to define what type of attribution is required, a data publisher can mandate that a user experience mediated by whatever application should expose the source as a dereferenceable URI.
One element of data dereferencing must be linking to applications that facilitate human interaction with the data. A generic data browser is a developer tool; the end user experience must still be mediated by interfaces tailored to the domain. This layer can take care of making the brand visible and can show advertising or be monetized on a usage basis.
Next we will look at the service provider and infrastructure side of this.
This update adds more explanation and some comments on how we rank entities.
]]>We continue enhancing our hosting of the Linked Open Data (LOD) cloud at http://lod.openlinksw.com.
We have now added result ranking for both text and URIs. Text hit scores are based on word frequency and proximity; URI scores are based on link density.
We calculate each URI's rank by adding up references and weighing these by the score of the referrer. This is like in web search. Each iteration of the ranking will join every referred to each of its referrers. We do about 1.2 million such joins per second, across partitions, over 2.2 billion triples and 400M distinct subjects without any great optimization, just using SQL stored procedures and partitioned function calls. This is a sort of SQL map-reduce. We would do over twice as fast if it were all in C but this is adequate for now. The more interesting bit will be tuning the scoring based on what type of link we have. This is what the web search engines cannot do as well, since document links are untyped.
We are moving toward a decent user interface for the LOD hosting, including offering ready-made domain-specific queries, e.g., biomedical.
Things like "URI finding with autocomplete" are done and just have to be put online.
With linked data, there is the whole question of identifier choice. We will have a special page just for this. There we show reference statistics, synonyms declared by owl:sameAs
, synonyms determined by shared property values, etc. In this way we become a terminology lookup service.
Copies of the LOD cluster system are available for evaluators, on a case by case basis. We will make this publicly available on EC2 also in not too long.
Otherwise, we continue working on productization, primarily things like reliability and recovery. One exercise is running TPC-C with intentionally stupid partitioning, so that almost all joins and deadlocks are distributed. Then we simulate a cluster interconnect that drops messages now and then, sometimes kill server processes, and still keep full ACID properties. Cloud capable, also in bad weather.
The open source release of Virtuoso 6 (no cluster) is basically ready to go, mostly this is a question of logistics.
I will talk about these things in greater individual detail next week.
]]>The thing is intermittently live with both Dbpedia on one instance and a LOD Cloud data collection of about 2 billion triples on another. We will give out the links once we have tested a bit more.
The present activity is all about testing Virtuoso 6 for release, cluster and otherwise.
]]>The old problem has been that it is not really practical to pre-compute counts of everything for all possible combinations of search conditions and counting/grouping/sorting. The actual matches take time.
Well, neither is in fact necessary. When there are large numbers of items matching the conditions, counting them can take time but then this is the beginning of the search, and the user is not even likely to look very closely at the counts. It is enough to see that there are many of one and few of another. If the user already knows the precise predicate or class to look for, then the top-level faceted view is not even needed. The faceted view for guiding search and precise analytics are two different problems.
There are client-side faceted views like Exhibit or our own ODE. The problem with these is that there are a few orders of magnitude difference between the actual database size and what fits on the user agent. This is compounded by the fact that one does not know what to cache on the user agent because of the open nature of the data web. If this were about a fixed workflow, then a good guess would be possible — but we are talking about the data web, the very soul of serendipity and unexpected discovery.
So we made a web service that will do faceted search on arbitrary RDF. If it does not get complete results within a timeout, it will return what it has counted so far, using Virtuoso's Anytime feature. Looking for subjects with some specific combination of properties is however a bit limited, so this will also do JOINs
. Many features are one or two JOINs
away; take geographical locations or social networks, for example.
Yet a faceted search should be point-and-click, and should not involve a full query construction. We put the compromise at starting with full text or property or class, then navigating down properties or classes, to arbitrary depth, tree-wise. At each step, one can see the matching instances or their classes or properties, all with counts, faceted-style.
This is good enough for queries like 'what do Harry Potter fans also like' or 'who are the authors of articles tagged semantic web and machine learning and published in 2008'. For complex grouping, sub-queries, arithmetic or such, one must write the actual query.
But one can begin with facets, and then continue refining the query by hand since the service also returns SPARQL text. We made a small web interface on top of the service with all logic server side. This proves that the web service is usable and that an interface with no AJAX, and no problems with browser interoperability or such, is possible and easy. Also, the problem of syncing between a user-agent-based store and a database is entirely gone.
If we are working with a known data structure, the user interface should choose the display by the data type and offer links to related reports. This is all easy to build as web pages or AJAX. We show how the generic interface is done in Virtuoso PL, and you can adapt that or rewrite it in PHP, Java, JavaScript, or anything else, to accommodate use-case specific navigation needs such as data format.
The web service takes an XML representation of the search, which is more restricted and easier to process by machine than the SPARQL syntax. The web service returns the results, the SPARQL query it generated, whether the results are complete or not, and some resource use statistics.
The source of the PL functions, Web Service and Virtuoso Server Page (HTML UI) will be available as part of Virtuoso 6.0 and higher. A Programmer's Guide will be available as part of the standard Virtuoso Documentation collection, including the Virtuoso Open Source Edition Website.
]]>Sir Tim said it at WWW08 in Beijing — linked data and the linked data web is the semantic web and the Web done right.
The grail of ad hoc analytics on infinite data has lost none of its appeal. We have seen fresh evidence of this in the realm of data warehousing products, as well as storage in general.
The benefits of a data model more abstract than the relational are being increasingly appreciated also outside the data web circles. Microsoft's Entity Frameworks technology is an example. Agility has been a buzzword for a long time. Everything should be offered in a service based business model and should interoperate and integrate with everything else — business needs first; schema last.
Not to forget that when money is tight, reuse of existing assets and paying on a usage basis are naturally emphasized. Information, as the asset it is, is none the less important, on the contrary. But even with information, value should be realized economically, which, among other things, entails not reinventing the wheel.
It is against this backdrop that this year will play out.
As concerns research, I will again quote Harry Halpin at ESWC 2008: "Men will fight in a war, and even lose a war, for what they believe just. And it may come to pass that later, even though the war were lost, the things then fought for will emerge under another name and establish themselves as the prevailing reality" [or words to this effect].
Something like the data web, and even the semantic web, will happen. Harry's question was whether this would be the descendant of what is today called semantic web research.
I heard in conversation about a project for making a very large metadata store. I also heard that the makers did not particularly insist on this being RDF-based, though.
Why should such a thing be RDF-based? If it is already accepted that there will be ad hoc schema and that queries ought to be able to view the data from all angles, not be limited by having indices one way and not another way, then why not RDF?
The justification of RDF is in reusing and linking-to data and terminology out there. Another justification is that by using an RDF store, one is spared a lot of work and tons of compromises which attend making an entity-attribute-value (EAV, i.e., triple) store on a generic RDBMS. The sem-web world has been there, trust me. We came out well because we put all inside the RDBMS, lowest level, which you can't do unless you own the RDBMS. Source access is not enough; you also need the knowledge.
Technicalities aside, the question is one of proprietary vs. standards-based. This is not only so with software components, where standards have consistently demonstrated benefits, but now also with the data. Zemanta and OpenCalais serving DBpedia URIs are examples. Even in entirely closed applications, there is benefit in reusing open vocabularies and identifiers: One does not need to create a secret language for writing a secret memo.
Where data is a carrier of value, its value is enhanced by it being easy to repurpose (i.e., standard vocabularies) and to discover (i.e., data set metadata). As on the web, so on the enterprise intranet. In this lies the strength of RDF as opposed to proprietary flexible database schemes. This is a qualitative distinction.
In this light, we welcome the voiD (VOcabulary of Interlinked Data), which is the first promise of making federatable data discoverable. Now that there is a point of focus for these efforts, the needed expressivity will no doubt accrete around the voiD core.
For data as a service, we clearly see the value of open terminologies as prerequisites for service interchangeability, i.e., creating a marketplace. XML is for the transaction; RDF is for the discovery, query, and analytics. As with databases in general, first there was the transaction; then there was the query. Same here. For monetizing the query, there are models ranging from renting data sets and server capacity in the clouds to hosted services where one pays for processing past a certain quota. For the hosted case, we just removed a major barrier to offering unlimited query against unlimited data when we completed the Virtuoso Anytime feature. With this, the user gets what is found within a set time, which is already something, and in case of needing more, one can pay for the usage. Of course, we do not forget advertising. When data has explicit semantics, contextuality is better than with keywords.
For these visions to materialize on top of the linked data platform, linked data must join the world of data. This means messaging that is geared towards the database public. They know the problem, but the RDF proposition is still not well enough understood for it to connect.
For the relational IT world, we offer passage to the data web and its promise of integration through RDF mapping. We are also bringing out new Microsoft Entity Framework components. This goes in the direction of defining a unified database frontier with RDF and non-RDF entity models side by side.
For OpenLink Software, 2008 was about developing technology for scale, RDF as well as generic relational. We did show a tiny preview with the Billion Triples Challenge demo. Now we are set to come out with the real thing, featuring, among other things, faceted search at the billion triple scale. We started offering ready-to-go Virtuoso-hosted linked open data sets on Amazon EC2 in December. Now we continue doing this based on our next-generation server, as well as make Virtuoso 6 Cluster commercially available. Technical specifics are amply discussed on this blog. There are still some new technology things to be developed this year; first among these are strong SPARQL federation, and on-the-fly resizing of server clusters. On the research partnerships side, we have an EU grant for working with the OntoWiki project from the University of Leipzig, and we are partners in DERI's Líon project. These will provide platforms for further demonstrating the "web" in data web, as in web-scale smart databasing.
2009 will see change through scale. The things that exist will start interconnecting and there will be emergent value. Deployments will be larger and scale will be readily available through a services model or by installation at one's own facilities. We may see the start of Search becoming Find, like Kingsley says, meaning semantics of data guiding search. Entity extraction will multiply data volumes and bring parts of the data web to real time.
Exciting 2009 to all.
]]>Q: What is the storage cost per triple? answer
Q: What is the cost to insert a triple? answer
Q: What is the cost to delete a triple? (For the insertion itself, as well as for updating any indices) answer
Q: What is the cost to search on a given property? answer
Q: What data types are supported? answer
Q: What inferencing is supported? answer
Q: Is the inferencing dynamic or is an extra step required before inferencing can be used? answer
Q: Do you support full text search? answer
Q: What programming interfaces are supported? Do you support standard SPARQL protocol? answer
Q: How can data be partitioned across multiple servers? answer
Q: How many triples can a single server handle? answer
Q: What is the performance impact of going from the billion to the trillion triples? answer
Q: Do you support additional metadata for triples, such as timestamps, security tags etc? answer
Q: Should we use RDF for our large metadata store? What are the alternatives? answer
Q: How multithreaded is Virtuoso? answer
Q: Can multiple servers run off a single shared disk database? answer
Q: Can Virtuoso run on a SAN? answer
Q: How does Virtuoso join across partitions? answer
Q: Does Virtuoso support federated triple stores? If there are multiple SPARQL end points, can Virtuoso be used to do queries joining between these? answer
Q: How many servers can a cluster contain? answer
Q: How do I reconfigure a cluster, adding and removing machines, etc? answer
Q: How will Virtuoso handle regional clusters? answer
Q: Is there a mechanism for terminating long running queries? answer
Q: Can the user be asynchronously notified when a long running query terminates? answer
Q: How many concurrent queries can Virtuoso handle? answer
Q: What is the relative performance of SPARQL queries vs. native relational queries answer
Q: Does Virtuoso support property tables? answer
Q: What performance metrics does Virtuoso offer? answer
Q: What support do you provide for concurrency/multithreading operation? Is your interface thread-safe? answer
Q: What level of ACID properties are supported? answer
Q: Do you provide the ability to atomically add a set of triples, where either all are added or none are added? answer
Q: Do you provide the ability to add a set of triples, respecting the isolation property (so concurrent accessors either see none of the triple values, or all of them)? answer
Q: What is the time to start a database, create/open a graph? answer
Q: What sort of security features are built into Virtuoso? answer
]]>The purpose of this presentation is to show a programmer how to put RDF into Virtuoso and how to query it. This is done programmatically, with no confusing user interfaces.
You should have a Virtuoso Open Source tree built and installed. We will look at the LUBM benchmark demo that comes with the package. All you need is a Unix shell. Running the shell under emacs (m-x shell
) is the best. But the open source isql
utility should have command line editing also. The emacs shell is however convenient for cutting and pasting things between shell and files.
To get started, cd into binsrc/tests/lubm
.
To verify that this works, you can do
./test_server.sh virtuoso-t
This will test the server with the LUBM queries. This should report 45 tests passed. After this we will do the tests step-by-step.
The file lubm-load.sql
contains the commands for loading the LUBM single university qualification database.
The data files themselves are in lubm_8000
, 15 files in RDFXML.
There is also a little ontology called inf.nt
. This declares the subclass and subproperty relations used in the benchmark.
So now let's go through this procedure.
Start the server:
$ virtuoso-t -f &
This starts the server in foreground mode, and puts it in the background of the shell.
Now we connect to it with the isql utility.
$ isql 1111 dba dba
This gives a SQL>
prompt. The default username and password are both dba
.
When a command is SQL, it is entered directly. If it is SPARQL, it is prefixed with the keyword sparql
. This is how all the SQL clients work. Any SQL client, such as any ODBC or JDBC application, can use SPARQL if the SQL string starts with this keyword.
The lubm-load.sql
file is quite self-explanatory. It begins with defining an SQL procedure that calls the RDF/XML load function, DB..RDF_LOAD_RDFXML
, for each file in a directory.
Next it calls this function for the lubm_8000
directory under the server's working directory.
sparql CLEAR GRAPH <lubm>; sparql CLEAR GRAPH <inf>; load_lubm ( server_root() || '/lubm_8000/' );
Then it verifies that the right number of triples is found in the <lubm> graph.
sparql SELECT COUNT(*) FROM <lubm> WHERE { ?x ?y ?z } ;
The echo commands below this are interpreted by the isql utility, and produce output to show whether the test was passed. They can be ignored for now.
Then it adds some implied subOrganizationOf
triples. This is part of setting up the LUBM test database.
sparql PREFIX ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#> INSERT INTO GRAPH <lubm> { ?x ub:subOrganizationOf ?z } FROM <lubm> WHERE { ?x ub:subOrganizationOf ?y . ?y ub:subOrganizationOf ?z . };
Then it loads the ontology file, inf.nt
, using the Turtle load function, DB.DBA.TTLP
. The arguments of the function are the text to load, the default namespace prefix, and the URI of the target graph.
DB.DBA.TTLP ( file_to_string ( 'inf.nt' ), 'http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl', 'inf' ) ; sparql SELECT COUNT(*) FROM <inf> WHERE { ?x ?y ?z } ;
Then we declare that the triples in the <inf>
graph can be used for inference at run time. To enable this, a SPARQL query will declare that it uses the 'inft'
rule set. Otherwise this has no effect.
rdfs_rule_set ('inft', 'inf');
This is just a log checkpoint to finalize the work and truncate the transaction log. The server would also eventually do this in its own time.
checkpoint;
Now we are ready for querying.
The queries are given in 3 different versions: The first file, lubm.sql
, has the queries with most inference open coded as UNIONs
. The second file, lubm-inf.sql
, has the inference performed at run time using the ontology information in the <inf>
graph we just loaded. The last, lubm-phys.sql
, relies on having the entailed triples physically present in the <lubm>
graph. These entailed triples are inserted by the SPARUL commands in the lubm-cp.sql
file.
If you wish to run all the commands in a SQL file, you can type load <filename>;
(e.g., load lubm-cp.sql;
) at the SQL>
prompt. If you wish to try individual statements, you can paste them to the command line.
For example:
SQL> sparql PREFIX ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#> SELECT * FROM <lubm> WHERE { ?x a ub:Publication . ?x ub:publicationAuthor <http://www.Department0.University0.edu/AssistantProfessor0> }; VARCHAR _______________________________________________________________________ http://www.Department0.University0.edu/AssistantProfessor0/Publication0 http://www.Department0.University0.edu/AssistantProfessor0/Publication1 http://www.Department0.University0.edu/AssistantProfessor0/Publication2 http://www.Department0.University0.edu/AssistantProfessor0/Publication3 http://www.Department0.University0.edu/AssistantProfessor0/Publication4 http://www.Department0.University0.edu/AssistantProfessor0/Publication5 6 Rows. -- 4 msec.
To stop the server, simply type shutdown;
at the SQL>
prompt.
If you wish to use a SPARQL protocol end point, just enable the HTTP listener. This is done by adding a stanza like —
[HTTPServer] ServerPort = 8421 ServerRoot = . ServerThreads = 2
— to the end of the virtuoso.ini
file in the lubm
directory. Then shutdown and restart (type shutdown;
at the SQL>
prompt and then virtuoso-t -f &
at the shell prompt).
Now you can connect to the end point with a web browser. The URL is http://localhost:8421/sparql
. Without parameters, this will show a human readable form. With parameters, this will execute SPARQL.
We have shown how to load and query RDF with Virtuoso using the most basic SQL tools. Next you can access RDF from, for example, PHP, using the PHP ODBC interface.
To see how to use Jena or Sesame with Virtuoso, look at Native RDF Storage Providers. To see how RDF data types are supported, see Extension datatype for RDF
To work with large volumes of data, you must add memory to the configuration file and use the row-autocommit mode, i.e., do log_enable (2);
before the load command. Otherwise Virtuoso will do the entire load as a single transaction, and will run out of rollback space. See documentation for more.
NumberOfBuffers = 256
), the process size stays under 30MB on 32-bit Linux.
The value of this is that one can now have RDF and full text indexing on the desktop without running a Java VM or any other memory-intensive software. And of course, all of SQL (transactions, stored procedures, etc.) is in the same embeddably-sized container.
The Lite executable is a full Virtuoso executable; the Lite mode is controlled by a switch in the configuration file. The executable size is about 10MB for 32-bit Linux. A database created in the Lite mode will be converted into a fully-featured database (tables and indexes are added, among other things) if the server is started with the Lite setting "off"; functionality can be reverted to Lite mode, though it will now consume somewhat more memory, etc.
Lite mode offers full SQL and SPARQL/SPARUL (via SPASQL), but disables all HTTP-based services (WebDAV, application hosting, etc.). Clients can still use all typical database access mechanisms (i.e., ODBC, JDBC, OLE-DB, ADO.NET, and XMLA) to connect, including the Jena and Sesame frameworks for RDF. ODBC now offers full support of RDF data types for C-based clients. A Redland-compatible API also exists, for use with Redland v1.0.8 and later.
Especially for embedded use, we now allow restricting the listener to be a Unix socket, which allows client connections only from the localhost.
Shipping an embedded Virtuoso is easy. It just takes one executable and one configuration file. Performance is generally comparable to "normal" mode, except that Lite will be somewhat less scalable on multicore systems.
The Lite mode will be included in the next Virtuoso 5 Open Source release.
]]>This is complex, so I will begin with the point and the interested may read on for the details and implications. Starting with soon to be released version 6, Virtuoso allows you to say that two things, if they share a uniquely identifying property, are the same. Examples of uniquely identifying properties would be a book's ISBN number, or a person's social security plus full name. In relational language this is a unique key, and in RDF parlance, an inverse functional property.
In most systems, such problems are dealt with as a preprocessing step before querying. For example, all the items that are considered the same will get the same properties or at load time all identifiers will be normalized according to some application rules. This is good if the rules are clear and understood. This is so in closed situations, where things tend to have standard identifiers to begin with. But on the open web this is not so clear cut.
In this post, we show how to do these things ad hoc, without materializing anything. At the end, we also show how to materialize identity and what the consequences of this are with open web data. We use real live web crawls from the Billion Triples Challenge data set.
On the linked data web, there are independently arising descriptions of the same thing and thus arises the need to smoosh, if these are to be somehow integrated. But this is only the beginning of the problems.
To address these, we have added the option of specifying that some property will be considered inversely functional in a query. This is done at run time and the property does not really have to be inversely functional in the pure sense. foaf:name
will do for an example. This simply means that for purposes of the query concerned, two subjects which have at least one foaf:name
in common are considered the same. In this way, we can join between FOAF files. With the same database, a query about music preferences might consider having the same name as "same enough," but a query about criminal prosecution would obviously need to be more precise about sameness.
Our ontology is defined like this:
-- Populate a named graph with the triples you want to use in query time inferencing
ttlp ( ' @prefix foaf: <xmlns="http" xmlns.com="xmlns.com" foaf="foaf"> </> @prefix owl: <xmlns="http" www.w3.org="www.w3.org" owl="owl"> </> foaf:mbox_sha1sum a owl:InverseFunctionalProperty . foaf:name a owl:InverseFunctionalProperty . ', 'xx', 'b3sifp' );
-- Declare that the graph contains an ontology for use in query time inferencing
rdfs_rule_set ( 'http://example.com/rules/b3sifp#', 'b3sifp' );
Then use it:
sparql DEFINE input:inference "http://example.com/rules/b3sifp#" SELECT DISTINCT ?k ?f1 ?f2 WHERE { ?k foaf:name ?n . ?n bif:contains "'Kjetil Kjernsmo'" . ?k foaf:knows ?f1 . ?f1 foaf:knows ?f2 };
VARCHAR VARCHAR VARCHAR ______________________________________ _______________________________________________ ______________________________
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/dajobe http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/net_twitter http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/amyvdh http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/pom http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/mattb http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/davorg http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/distobj http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/perigrin ....
Without the inference, we get no matches. This is because the data in question has one graph per FOAF file, and blank nodes for persons. No graph references any person outside the ones in the graph. So if somebody is mentioned as known, then without the inference there is no way to get to what that person's FOAF file says, since the same individual will be a different blank node there. The declaration in the context named b3sifp
just means that all things with a matching foaf:name
or foaf:mbox_sha1sum
are the same.
Sameness means that two are the same for purposes of DISTINCT
or GROUP BY
, and if two are the same, then both have the UNION
of all of the properties of both.
If this were a naive smoosh, then the individuals would have all the same properties but would not be the same for DISTINCT
.
If we have complex application rules for determining whether individuals are the same, then one can materialize owl:sameAs
triples and keep them in a separate graph. In this way, the original data is not contaminated and the materialized volume stays reasonable — nothing like the blow-up of duplicating properties across instances.
The pro-smoosh argument is that if every duplicate makes exactly the same statements, then there is no great blow-up. Best and worst cases will always depend on the data. In rough terms, the more ad hoc the use, the less desirable the materialization. If the usage pattern is really set, then a relational-style application-specific representation with identity resolved at load time will perform best. We can do that too, but so can others.
The principal point is about agility as concerns the inference. Run time is more agile than materialization, and if the rules change or if different users have different needs, then materialization runs into trouble. When talking web scale, having multiple users is a given; it is very uneconomical to give everybody their own copy, and the likelihood of a user accessing any significant part of the corpus is minimal. Even if the queries were not limited, the user would typically not wait for the answer of a query doing a scan or aggregation over 1 billion blog posts or something of the sort. So queries will typically be selective. Selective means that they do not access all of the data, hence do not benefit from ready-made materialization for things they do not even look at.
The exception is corpus-wide statistics queries. But these will not be done in interactive time anyway, and will not be done very often. Plus, since these do not typically run all in memory, these are disk bound. And when things are disk bound, size matters. Reading extra entailment on the way is just a performance penalty.
Enough talk. Time for an experiment. We take the Yahoo and Falcon web crawls from the Billion Triples Challenge set, and do two things with the FOAF data in them:
For the experiment, we will consider two people the same if they have the same foaf:name
and are both instances of foaf:Person
. This gets some extra hits but should not be statistically significant.
The following is a commented SQL script performing the smoosh. We play with internal IDs of things, thus some of these operations cannot be done in SPARQL alone. We use SPARQL where possible for readability. As the documentation states, iri_to_id
converts from the qualified name of an IRI to its ID and id_to_iri
does the reverse.
We count the triples that enter into the smoosh:
-- the name is an existence because else we'd get several times more due to -- the names occurring in many graphs
sparql SELECT COUNT(*) WHERE { { SELECT DISTINCT ?person WHERE { ?person a foaf:Person } } . FILTER ( bif:exists ( SELECT (1) WHERE { ?person foaf:name ?nn } ) ) . ?person ?p ?o };
-- We get 3284674
We make a few tables for intermediate results.
-- For each distinct name, gather the properties and objects from -- all subjects with this name
CREATE TABLE name_prop ( np_name ANY, np_p IRI_ID_8, np_o ANY, PRIMARY KEY ( np_name, np_p, np_o ) ); ALTER INDEX name_prop ON name_prop PARTITION ( np_name VARCHAR (-1, 0hexffff) );
-- Map from name to canonical IRI used for the name
CREATE TABLE name_iri ( ni_name ANY PRIMARY KEY, ni_s IRI_ID_8 ); ALTER INDEX name_iri ON name_iri PARTITION ( ni_name VARCHAR (-1, 0hexffff) );
-- Map from person IRI to canonical person IRI
CREATE TABLE pref_iri ( i IRI_ID_8, pref IRI_ID_8, PRIMARY KEY ( i ) ); ALTER INDEX pref_iri ON pref_iri PARTITION ( i INT (0hexffff00) );
-- a table for the materialization where all aliases get all properties of every other
CREATE TABLE smoosh_ct ( s IRI_ID_8, p IRI_ID_8, o ANY, PRIMARY KEY ( s, p, o ) ); ALTER INDEX smoosh_ct ON smoosh_ct PARTITION ( s INT (0hexffff00) );
-- disable transaction log and enable row auto-commit. This is necessary, otherwise -- bulk operations are done transactionally and they will run out of rollback space.
LOG_ENABLE (2);
-- Gather all the properties of all persons with a name under that name. -- INSERT SOFT means that duplicates are ignored
INSERT SOFT name_prop SELECT "n", "p", "o" FROM ( sparql DEFINE output:valmode "LONG" SELECT ?n ?p ?o WHERE { ?x a foaf:Person . ?x foaf:name ?n . ?x ?p ?o } ) xx ;
-- Now choose for each name the canonical IRI
INSERT INTO name_iri SELECT np_name, ( SELECT MIN (s) FROM rdf_quad WHERE o = np_name AND p = IRI_TO_ID ('http://xmlns.com/foaf/0.1/name') ) AS mini FROM name_prop WHERE np_p = IRI_TO_ID ('http://xmlns.com/foaf/0.1/name') ;
-- For each person IRI, map to the canonical IRI of that person
INSERT SOFT pref_iri (i, pref) SELECT s, ni_s FROM name_iri, rdf_quad WHERE o = ni_name AND p = IRI_TO_ID ('http://xmlns.com/foaf/0.1/name') ;
-- Make a graph where all persons have one iri with all the properties of all aliases -- and where person-to-person refs are canonicalized
INSERT SOFT rdf_quad (g,s,p,o) SELECT IRI_TO_ID ('psmoosh'), ni_s, np_p, COALESCE ( ( SELECT pref FROM pref_iri WHERE i = np_o ), np_o ) FROM name_prop, name_iri WHERE ni_name = np_name OPTION ( loop, quietcast ) ;
-- A little explanation: The properties of names are copied into rdf_quad with the name -- replaced with its canonical IRI. If the object has a canonical IRI, this is used as -- the object, else the object is unmodified. This is the COALESCE with the sub-query.
-- This takes a little time. To check on the progress, take another connection to the -- server and do
STATUS ('cluster');
-- It will return something like -- Cluster 4 nodes, 35 s. 108 m/s 1001 KB/s 75% cpu 186% read 12% clw threads 5r 0w 0i -- buffers 549481 253929 d 8 w 0 pfs
-- Now finalize the state; this makes it permanent. Else the work will be lost on server -- failure, since there was no transaction log
CL_EXEC ('checkpoint');
-- See what we got
sparql SELECT COUNT (*) FROM <psmoosh> WHERE {?s ?p ?o};
-- This is 2253102
-- Now make the copy where all have the properties of all synonyms. This takes so much -- space we do not insert it as RDF quads, but make a special table for it so that we can -- run some statistics. This saves time.
INSERT SOFT smoosh_ct (s, p, o) SELECT s, np_p, np_o FROM name_prop, rdf_quad WHERE o = np_name AND p = IRI_TO_ID ('http://xmlns.com/foaf/0.1/name') ;
-- as above, INSERT SOFT so as to ignore duplicates
SELECT COUNT (*) FROM smoosh_ct;
-- This is 167360324
-- Find out where the bloat comes from
SELECT TOP 20 COUNT (*), ID_TO_IRI (p) FROM smoosh_ct GROUP BY p ORDER BY 1 DESC;
The results are:
54728777 http://www.w3.org/2002/07/owl#sameAs 48543153 http://xmlns.com/foaf/0.1/knows 13930234 http://www.w3.org/2000/01/rdf-schema#seeAlso 12268512 http://xmlns.com/foaf/0.1/interest 11415867 http://xmlns.com/foaf/0.1/nick 6683963 http://xmlns.com/foaf/0.1/weblog 6650093 http://xmlns.com/foaf/0.1/depiction 4231946 http://xmlns.com/foaf/0.1/mbox_sha1sum 4129629 http://xmlns.com/foaf/0.1/homepage 1776555 http://xmlns.com/foaf/0.1/holdsAccount 1219525 http://xmlns.com/foaf/0.1/based_near 305522 http://www.w3.org/1999/02/22-rdf-syntax-ns#type 274965 http://xmlns.com/foaf/0.1/name 155131 http://xmlns.com/foaf/0.1/dateOfBirth 153001 http://xmlns.com/foaf/0.1/img 111130 http://www.w3.org/2001/vcard-rdf/3.0#ADR 52930 http://xmlns.com/foaf/0.1/gender 48517 http://www.w3.org/2004/02/skos/core#subject 45697 http://www.w3.org/2000/01/rdf-schema#label 44860 http://purl.org/vocab/bio/0.1/olb
Now compare with the predicate distribution of the smoosh with identities canonicalized
sparql SELECT COUNT (*) ?p FROM <psmoosh> WHERE { ?s ?p ?o } GROUP BY ?p ORDER BY 1 DESC LIMIT 20;
Results are:
748311 http://xmlns.com/foaf/0.1/knows 548391 http://xmlns.com/foaf/0.1/interest 140531 http://www.w3.org/2000/01/rdf-schema#seeAlso 105273 http://www.w3.org/1999/02/22-rdf-syntax-ns#type 78497 http://xmlns.com/foaf/0.1/name 48099 http://www.w3.org/2004/02/skos/core#subject 45179 http://xmlns.com/foaf/0.1/depiction 40229 http://www.w3.org/2000/01/rdf-schema#comment 38272 http://www.w3.org/2000/01/rdf-schema#label 37378 http://xmlns.com/foaf/0.1/nick 37186 http://dbpedia.org/property/abstract 34003 http://xmlns.com/foaf/0.1/img 26182 http://xmlns.com/foaf/0.1/homepage 23795 http://www.w3.org/2002/07/owl#sameAs 17651 http://xmlns.com/foaf/0.1/mbox_sha1sum 17430 http://xmlns.com/foaf/0.1/dateOfBirth 15586 http://xmlns.com/foaf/0.1/page 12869 http://dbpedia.org/property/reference 12497 http://xmlns.com/foaf/0.1/weblog 12329 http://blogs.yandex.ru/schema/foaf/school
We can drop the owl:sameAs
triples from the count, so the bloat is a bit less by that but it still is tens of times larger than the canonicalized copy or the initial state.
Now, when we try using the psmoosh graph, we still get different results from the results with the original data. This is because foaf:knows
relations to things with no foaf:name
are not represented in the smoosh. The exist:
sparql SELECT COUNT (*) WHERE { ?s foaf:knows ?thing . FILTER ( !bif:exists ( SELECT (1) WHERE { ?thing foaf:name ?nn } ) ) };
-- 1393940
So the smoosh graph is not an accurate rendition of the social network. It would have to be smooshed further to be that, since the data in the sample is quite irregular. But we do not go that far here.
Finally, we calculate the smoosh blow up factors. We do not include owl:sameAs
triples in the counts.
select (167360324 - 54728777) / 3284674.0; 34.290022997716059
select 2229307 / 3284674.0; = 0.678699621332284
So, to get a smoosh that is not really the equivalent of the original, either multiply the original triple count by 34 or 0.68, depending on whether synonyms are collapsed or not.
Making the smooshes does not take very long, some minutes for the small one. Inserting the big one would be longer, a couple of hours maybe. It was 33 minutes for filling the smoosh_ct
table. The metrics were not with optimal tuning so the performance numbers just serve to show that smooshing takes time. Probably more time than allowable in an interactive situation, no matter how the process is optimized.
December 11, 2008
Author: Jay Krall
For public relations professionals, finding mentions about a particular brand or product is getting more challenging as the vast clutter of the Web continues to grow. While paid monitoring services like those offered by Cision and others can help, for those using free-text search engines like Google for media monitoring, combing through pages of irrelevant search results has become routine. For example, acronyms pose a problem: how many instances of the term “HP” referring to “horsepower” do you have to sift through to find articles about Hewlett-Packard products? Plenty.
Worse yet, the longer your queries get, the harder it is for search engines to find what you really want. It’s almost 2009. With all this technological innovation happening so fast, why does it seem like computers still can’t read very well? If they were more literate, the monitoring of media and social media for brand mentions would be a lot easier for everyone.
That’s just one practical argument for the importance of the Semantic Web. First described in 1999 by World Wide Web Consortium director Tim Berners-Lee, the Semantic Web, also referred to as Web 3.0, is often described as a vision for the next generation of the Web: pages that can search each other and pull from each other’s data intelligently, melding Web sites and news feeds into precisely honed, individual Web experiences. But actually, the technologies of the Semantic Web are already hard at work, thanks to a group of computer scientists from around the world who are making Berners-Lee’s vision a reality.
Kingsley Idehen, CEO of OpenLink Software, is one of those pioneers. He is one of the creators of DBpedia, a Semantic Web tool that culls data from Wikipedia in amazingly precise ways. The project is a collaboration of OpenLink Software, the University of Leipzig and Freie University Berlin. Simply put, it divides up the site’s information into tags, and uses those tags to develop searches in which the subject is clearly defined, using a computer language that could soon be applied all across the Web. Beginning in late 2006, a program assigned 274 million tags describing nearly 1 billion facts to catalog Wikipedia in this way using the Resource Description Framework (RDF), a commonly accepted format for Semantic Web applications.
( Full story ... )
]]>As we are on the brink of hosting the whole DBpedia Linked Open Data cloud in Virtuoso Cluster, we have had to think of what we'll do if, for example, somebody decides to count all the triples in the set.
How can we encourage clever use of data, yet not die if somebody, whether through malice, lack of understanding, or simple bad luck, submits impossible queries?
Restricting the language is not the way; any language beyond text search can express queries that will take forever to execute. Also, just returning a timeout after the first second (or whatever arbitrary time period) leaves people in the dark and does not produce an impression of responsiveness. So we decided to allow arbitrary queries, and if a quota of time or resources is exceeded, we return partial results and indicate how much processing was done.
Here we are looking for the top 10 people whom people claim to know without being known in return, like this:
SQL> sparql SELECT ?celeb, COUNT (*) WHERE { ?claimant foaf:knows ?celeb . FILTER (!bif:exists ( SELECT (1) WHERE { ?celeb foaf:knows ?claimant } ) ) } GROUP BY ?celeb ORDER BY DESC 2 LIMIT 10;
celeb callret-1 VARCHAR VARCHAR ________________________________________ _________
http://twitter.com/BarackObama 252 http://twitter.com/brianshaler 183 http://twitter.com/newmediajim 101 http://twitter.com/HenryRollins 95 http://twitter.com/wilw 81 http://twitter.com/stevegarfield 78 http://twitter.com/cote 66 mailto:adam.westerski@deri.org 66 mailto:michal.zaremba@deri.org 66 http://twitter.com/dsifry 65
*** Error S1TAT: [Virtuoso Driver][Virtuoso Server]RC...: Returning incomplete results, query interrupted by result timeout. Activity: 1R rnd 0R seq 0P disk 1.346KB / 3 messages
SQL> sparql SELECT ?celeb, COUNT (*) WHERE { ?claimant foaf:knows ?celeb . FILTER (!bif:exists ( SELECT (1) WHERE { ?celeb foaf:knows ?claimant } ) ) } GROUP BY ?celeb ORDER BY DESC 2 LIMIT 10;
celeb callret-1 VARCHAR VARCHAR ________________________________________ _________
http://twitter.com/JasonCalacanis 496 http://twitter.com/Twitterrific 466 http://twitter.com/ev 442 http://twitter.com/BarackObama 356 http://twitter.com/laughingsquid 317 http://twitter.com/gruber 294 http://twitter.com/chrispirillo 259 http://twitter.com/ambermacarthur 224 http://twitter.com/t 219 http://twitter.com/johnedwards 188
*** Error S1TAT: [Virtuoso Driver][Virtuoso Server]RC...: Returning incomplete results, query interrupted by result timeout. Activity: 329R rnd 44.6KR seq 342P disk 638.4KB / 46 messages
The first query read all data from disk; the second run had the working set from the first and could read some more before time ran out, hence the results were better. But the response time was the same.
If one has a query that just loops over consecutive joins, like in basic SPARQL, interrupting the processing after a set time period is simple. But such queries are not very interesting. To give meaningful partial answers with nested aggregation and sub-queries requires some more tricks. The basic idea is to terminate the innermost active sub-query/aggregation at the first timeout, and extend the timeout a bit so that accumulated results get fed to the next aggregation, like from the GROUP BY
to the ORDER BY
. If this again times out, we continue with the next outer layer. This guarantees that results are delivered if there were any results found for which the query pattern is true. False results are not produced, except in cases where there is comparison with a count and the count is smaller than it would be with the full evaluation.
One can also use this as a basis for paid services. The cutoff does not have to be time; it can also be in other units, making it insensitive to concurrent usage and variations of working set.
This system will be deployed on our Billion Triples Challenge demo instance in a few days, after some more testing. When Virtuoso 6 ships, all LOD Cloud AMIs and OpenLink-hosted LOD Cloud SPARQL endpoints will have this enabled by default. (AMI users will be able to disable the feature, if desired.) The feature works with Virtuoso 6 in both single server and cluster deployment.
]]>This month's DataSpaces contains material of interest to the Virtuoso developer and UDA user community alike —
The scalability dream is to add hardware and get increased performance in proportion to the power the added component has when measured by itself. A corollary dream is to take scalability effects that are measured in a simple task and see them in a complex task.
Below we show how we do 3.3 million random triple lookups per second on two 8 core commodity servers producing complete results, joining across partitions. On a single 4 core server, the figure is about 1 million lookups per second. With a single thread, it is about 250K lookups per second. This is the good case. But even our worse case is quite decent.
We took a simple SPARQL query, counting how many people say they reciprocally know each other. In the Billion Triples Challenge data set, there are 25M foaf:knows
quads of which 92K are reciprocal. Reciprocal here means that when x knows y in some graph, y knows x in the same or any other graph.
SELECT COUNT (*) WHERE { ?p1 foaf:knows ?p2 . ?p2 foaf:knows ?p1 }
There is no guarantee that the triple of x knows y
is in the same partition as the triple y knows x. Thus the join is randomly distributed, n partitions to n partitions.
We left this out of the Billion Triples Challenge demo because this did not run fast enough for our liking. Since then, we have corrected this.
If run on a single thread, this query would be a loop over all the quads with a predicate of foaf:knows
, and an inner loop looking for a quad with 3 of 4 fields given (SPO
). If we have a partitioned situation, we have a loop over all the foaf:knows
quads in each partition, and an inner lookup looking for the reciprocal foaf:knows
quad in whatever partition it may be found.
We have implemented this with two different message patterns:
Centralized: One process reads all the foaf:knows
quads from all processes. Every 50K quads, it sends a batch of reciprocal quad checks to each partition that could contain a reciprocal quad. Each partition keeps the count of found reciprocal quads, and these are gathered and added up at the end.
Symmetrical: Each process reads the foaf:knows
quads in its partition, and sends a batch of checks to each process that could have the reciprocal foaf:knows
quad every 50K quads. At the end, the counts are gathered from all partitions. There is some additional control traffic but we do not go into its details here.
Below is the result measured on 2 machines each with 2 x Xeon 5345 (quad core; total 8 cores), 16G RAM, and each machine running 6 Virtuoso instances. The interconnect is dual 1-Gbit ethernet. Numbers are with warm cache.
Centralized: 35,543 msec, 728,634 sequential + random lookups per second
Cluster 12 nodes, 35 s. 1072 m/s 39,085 KB/s 316% cpu ...
Symmetrical: 7706 msec, 3,360,740 sequential + random lookups per second
Cluster 12 nodes, 7 s. 572 m/s 16,983 KB/s 1137% cpu ...
The second line is the summary from the cluster status report for the duration of the query. The interesting numbers are the KB/s and the %CPU. The former is the cross-sectional data transfer rate for intra-cluster communication; the latter is the consolidated CPU utilization, where a constantly-busy core counts for 100%. The point to note is that the symmetrical approach takes 4x less real time with under half the data transfer rate. Further, when using multiple machines, the speed of a single interface does not limit the overall throughput as it does in the centralized situation.
These figures represent the best and worst cases of distributed JOIN
ing. If we have a straight sequence of JOIN
s, with single pattern optionals and existences and the order in which results are produced is not significant (i.e., there is aggregation, existence test, or ORDER BY
), the symmetrical pattern is applicable. On the other hand, if there are multiple triple pattern optionals, complex sub-queries, DISTINCT
s in the middle of the query, or results have to be produced in the order of an index, then the centralized approach must be used at least part of the time.
Also, if we must make transitive closures, which can be thought of as an extension of a DISTINCT
in a subquery, we must pass the data through a single point before moving the bindings to the next JOIN
in the sequence. This happens for example in resolving owl:sameAs
at run time. However, the good news is that performance does not fall much below the centralized figure even when there are complex nested structures with intermediate transitive closures, DISTINCT
s, complex existence tests, etc., that require passing all intermediate results through a central point. No matter the complexity, it is always possible to vector some tens-of-thousands of variable bindings into a single message exchange. And if there are not that many intermediate results, then single query execution time is not a problem anyhow.
For our sample query, we would get still more speed by using a partitioned hash join, filling the hash from the foaf:knows
relations and then running the foaf:knows
relations through the hash. If the hash size is right, a hash lookup is somewhat better than an index lookup. The problem is that when the hash join is not the right solution, it is an expensive mistake: the best case is good; the worst case is very bad. But if there is no index then hash join is better than nothing. One problem of hash joins is that they make temporary data structures which, if large, will skew the working set. One must be quite sure of the cardinality before it is safe to try a hash join. So we do not do hash joins with RDF, but we do use them sometimes with relational data.
These same methods apply to relational data just as well. This does not make generic RDF storage outperform an application-specific relational representation on the same platform, as the latter benefits from all the same optimizations, but in terms of sheer numbers, this makes RDF representation an option where it was not an option before. RDF is all about not needing to design the schema around the queries, and not needing to limit what joins with what else.
]]>So we decided to do it ourselves.
The score is (updated with revised innodb_buffer_pool_size
setting, based on advice noted down below):
n-clients | Virtuoso | MySQL (with increased buffer pool size) |
MySQL (with default buffer poll size) |
---|---|---|---|
1 | 41,161.33 | 27,023.11 | 12,171.41 |
4 | 127,918.30 | (pending) | 37,566.82 |
8 | 218,162.29 | 105,524.23 | 51,104.39 |
16 | 214,763.58 | 98,852.42 | 47,589.18 |
The metric is the query mixes per hour from the BSBM test driver output. For the interested, the complete output is here.
The benchmark is pure SQL, nothing to do with SPARQL or RDF.
The hardware is 2 x Xeon 5345 (2 x quad core, 2.33 GHz), 16 G RAM. The OS is 64-bit Debian Linux.
The benchmark was run at a scale of 200,000. Each run had 2000 warm-up query mixes and 500 measured query mixes, which gives steady state, eliminating any effects of OS disk cache and the like. Both databases were configured to use 8G for disk cache. The test effectively runs from memory. We ran an analyze table on each MySQL table but noticed that this had no effect. Virtuoso does the stats sampling on the go; possibly MySQL also since the explicit stats did not make any difference. The MySQL tables were served by the InnoDB engine. MySQL appears to cache results of queries in some cases. This was not apparent in the tests.
The versions are 5.09 for Virtuoso and 5.1.29 for MySQL. You can download and examine --
MySQL ought to do better. We suspect that here, just as in the TPC-D experiment we made way back, the query plans are not quite right. Also we rarely saw over 300% CPU utilization for MySQL. It is possible there is a config parameter that affects this. The public is invited to tell us about such.
Update:
Andreas Schultz of the BSBM team advised us to increase the innodb_buffer_pool_size
setting in the MySQL config. We did and it produced some improvement. Indeed, this is more like it, as we now see CPU utilization around 700% instead of the 300% in the previously published run, which rendered it suspect. Also, our experiments with TPC-D led us to expect better. We ran these things a few times so as to have warm cache.
On the first run, we noticed that the Innodb warm up time was somewhere well in excess of 2000 query mixes. Another time, we should make a graph of throughput as a function of time for both MySQL and Virtuoso. We recently made a greedy prefetch hack that should give us some mileage there. For the next BSBM, all we can advise is to run larger scale system for half an hour first and then measure and then measure again. If the second measurement is the same as the first then it is good.
As always, since MySQL is not our specialty, we confidently invite the public to tell us how to make it run faster. So, unless something more turns up, our next trial is a revisit of TPC-H.
]]>We got a number of questions about Virtuoso's inference support. It seems that we are the odd one out, as we do not take it for granted that inference ought to consist of materializing entailment.
Firstly, of course one can materialize all one wants with Virtuoso. The simplest way to do this is using SPARUL. With the recent transitivity extensions to SPARQL, it is also easy to materialize implications of transitivity with a single statement. Our point is that for trivial entailment such as subclass, sub-property, single transitive property, and owl:sameAs, we do not require materialization, as we can resolve these at run time also, with backward-chaining built into the engine.
For more complex situations, one needs to materialize the entailment. At the present time, we know how to generalize our transitive feature to run arbitrary backward-chaining rules, including recursive ones. We could have a sort of Datalog backward-chaining embedded in our SQL/SPARQL and could run this with good parallelism, as the transitive feature already works with clustering and partitioning without dying of message latency. Exactly when and how we do this will be seen. Even if users want entailment to be materialized, such a rule system could be used for producing the materialization at good speed.
We had a word with Ian Horrocks on the question. He noted that it is often naive on behalf of the community to tend to equate description of semantics with description of algorithm. The data need not always be blown up.
The advantage of not always materializing is that the working set stays better. Once the working set is no longer in memory, response times jump disproportionately. Also, if the data changes or is retracted or is unreliable, one can end up doing a lot of extra work with materialization. Consider the effect of one malicious sameAs statement. This can lead to a lot of effects that are hard to retract. On the other hand, if running in memory with static data such as the LUBM benchmark, the queries run some 20% faster if entailment subclasses and sub-properties are materialized rather than done at run time.
Our compliments for the wildest idea of the conference go to Eyal Oren, Christophe Guéret, and Stefan Schlobach, et al, for their paper on using genetic algorithms for guessing how variables in a SPARQL query ought to be instantiated. Prisoners of our "conventional wisdom" as we are, this might never have occurred to us.
It is interesting to see how the industry comes to the semantic web conferences talking about schema last while at the same time the traditional semantic web people stress enforcing schema constraints and making more predictably performing and database friendlier logics. So do the extremes converge.
There is a point to schema last. RDF is very good for getting a view of ad hoc or unknown data. One can just load and look at what there is. Also, additions of unforeseen optional properties or relations to the schema are easy and efficient. However, it seems that a really high traffic online application would always benefit from having some application specific data structures. Such could also save considerably in hardware.
It is not a sharp divide between RDF and relational application oriented representation. We have the capabilities in our RDB to RDF mapping. We just need to show this and have SPARUL and data loading
]]>The live demo is at http://b3s.openlinksw.com/. This site is under development and may not be on all the time. We are taking it in the direction of hosting the whole LOD cloud. This is an evolving operation where we will continue showcasing how one can ask increasingly interesting questions from a growing online database, in the spirit of the billion triples charter.
In the words of Jim Hendler, we were not selected for the finale because this would have made the challenge a database shootout instead of a more research-oriented event. There is some point to this since if the event becomes like the TPC benchmarks, this will limit the entrance to full time database players. Anyway, we got a special mention in the intro of the challenge track.
The winner was Semaplorer, a federated SPARQL query system. There is some merit to this, as we ourselves are not convinced that centralization is always the right direction. As discussed in the DARQ Matter of Federation post, we have a notion of how to do this production-strength with our cluster engine, now also over wide area networks. We shall see.
The entries from Deri and LARKC (MaRVIN, "Massive RDF Versatile Inference Network") were doing materialization of inference results in a cluster environment. The thing they were not doing was joining across partitions. Thus, the data was partitioned on whatever criterion and then the data in each partition was further refined according to rules known to all partitions. Deri did not address joining further.
"Nature shall be the guide of the alchemist," goes the old adage. We can look at MaRVIN as an example of this dictum. Networks of people are low bandwidth, not nearly fully connected. Asking a colleague for information is expensive and subject to misunderstanding; asking another research group might never produce an answer.
Even looking at one individual, we have no reason to think that the human expert would do complete reasoning. Indeed, the brain is a sort of compute cluster, but it does not have flat latency point to point connectivity — some joins are fast; others are not even tried, for all we know.
A database running on a cluster is a sort of counter-example. A database with RDF workload will end up joining across partitions pretty much all of the time.
MaRVIN's approach to joining could be likened to a country dance: Boys get to take a whirl with different girls according to a complex pattern. For match-making, some matches are produced early but one never knows if the love of a lifetime might be just around the corner. Also, if the dancers are inexperienced, they will have little ability to evaluate how good a match they have with their partner. A few times around the dance floor are needed to get the hang of things.
The question is, at what point will it no longer be possible to join across the database? This depends on the interconnect latency. The higher the latency, the more useful the square-dancing approach becomes.
Another practical consideration is the fact that RDF reasoners are not usually built for distributed memory multiprocessors. If the reasoner must be a plug-in component, then it cannot be expected to be written for grids.
We can think of a product safety use case: Find cosmetics that have ingredients that are considered toxic in the amounts they are present in each product. This can be done as a database query with some transitive operations, like running through a cosmetics taxonomy and a poisons database. If the business logic deciding whether the presence of an ingredient in the product is a health hazard is very complex, we can get a lot of joins.
The MaRVIN way would be to set up a ball where each lipstick and eyeliner dances with every poison and then see if matches are made. The matching logic could be arbitrarily complex since it would run locally. Of course here, some domain knowledge is needed in order to set up the processing so that each product and poison carry all the associated information with them. Dancing with half a partner can bias one's perceptions: Again, it is like nature, sometimes not all cards are on the table.
It would seem that there is some setup involved before answering a question: Composition of partitions, frequency of result exchange, etc. How critical the domain knowledge implicit in the setup is for the quality of results is an interesting question.
The question is, at what point will a cluster using distributed database operations for inference become impractical? Of course, it is impractical from the get-go if the reasoners and query processors are not made for this. But what if they are? We are presently evaluating different message patterns for joining between partitions. The baseline is some 250,000 random single-triple lookups per second per core. Using a cluster increases this throughput. The increase is more or less linear depending on whether all intermediate results pass via one coordinating node (worst case) or whether each node can decide which other node will do the next join step for each result (best case). For example, a DISTINCT
operation requires that data passes through a single place but JOIN
ing and aggregation in general do not.
We will still publish numbers during this November.
]]>The meeting was about writing a charter for a working group that would define a standard for mapping relational databases to RDF, either for purposes of import into RDF stores or of query mapping from SPARQL to SQL. There was a lot of agreement and the meeting even finished ahead of the allotted time.
There was discussion concerning using the Entity Name Service from the Okkam project for assigning URIs to entities mapped from relational databases. This makes sense when talking about long-lived, legal entities, such as people or companies or geography. Of course, there are cases where this makes no sense; for example, a purchase order or maintenance call hardly needs an identifier registered with the ENS. The problem is, in practice, a CRM could mention customers that have an ENS registered ID (or even several such IDs) and others that have none. Of course, the CRM's reference cannot depend on any registration. Also, even when there is a stable URI for the entity, a CRM may need a key that specifies some administrative subdivision of the customer.
Also we note that an on-demand RDB-to-RDF mapping may have some trouble dealing with "same as" assertions. If names that are anything other than string forms of the keys in the system must be returned, there will have to be a lookup added to the RDB. This is an administrative issue. Certainly going over the network to ask for names of items returned by queries has a prohibitive cost. It would be good for ad hoc integration to use shared URIs when possible. The trouble of adding and maintaining lookups for these, however, makes this more expensive than just mapping to RDF and using literals for joining between independently maintained systems.
We talked about having a language for human consumption and another for discovery and machine processing of mappings. Would this latter be XML or RDF based? Describing every detail of syntax for a mapping as RDF is really tedious. Also such descriptions are very hard to query, just as OWL ontologies are. One solution is to have opaque strings embedded into RDF, just like XSLT has XPath in string form embedded into XML. Maybe it will end up in this way here also. Having a complete XML mapping of the parse tree for mappings, XQueryX-style, could be nice for automatic generation of mappings with XSLT from an XML view of the information schema. But then XSLT can also produce text, so an XML syntax that has every detail of a mapping language as distinct elements is not really necessary for this.
Another matter is then describing the RDF generated by the mapping in terms of RDFS or OWL. This would be a by-product of declaring the mapping. Most often, I would presume the target ontology to be given, though, reducing the need for this feature. But if RDF mapping is used for discovery of data, such a description of the exposed data is essential.
We agreed with Sören Auer that we could make Virtuoso's mapping language compatible with Triplify. Triplify is very simple, extraction only, no SPARQL, but does have the benefit of expressing everything in SQL. As it happens, I would be the last person to tell a web developer what language to program in. So if it is SQL, then let it stay SQL. Technically, a lot of the information the Virtuoso mapping expresses is contained in the Triplify SQL statements, but not all. Some extra declarations are needed still but can have reasonable defaults.
There are two ways of stating a mapping. Virtuoso starts with the triple and says which tables and columns will produce the triple. Triplify starts with the SQL statement and says what triples it produces. These are fairly equivalent. For the web developer, the latter is likely more self-evident, while the former may be more compact and have less repetition.
Virtuoso and Triplify alone would give us the two interoperable implementations required from a working group, supposing the language were annotations on top of SQL. This would be a guarantee of delivery, as we would be close enough to the result from the get go.
I gave a talk about the Virtuoso Cluster edition, wherein I repeated essentially the same ground facts as Mike and outlined how we (in spite of these) profit from distributed memory multiprocessing. To those not intimate with these questions, let me affirm that deriving benefit from threading in a symmetric multiprocessor box, let alone a cluster connected by a network, totally depends on having many relatively long running things going at a time and blocking as seldom as possible.
Further, Mike Dean talked about ASIO, the BBN suite of semantic web tools. His most challenging statement was about the storage engine, a network-database-inspired triple-store using memory-mapped files.
Will the CODASYL days come back, and will the linked list on disk be the way to store triples/quads? I would say that this will have, especially with a memory-mapped file, probably a better best-case as a B-tree but that this also will be less predictable with fragmentation. With Virtuoso, using a B-tree index, we see about 20-30% of CPU time spent on index lookup when running LUBM queries. With a disk-based memory-mapped linked-list storage, we would see some improvements in this while getting hit probably worse than now in the case of fragmentation. Plus compaction on the fly would not be nearly as easy and surely far less local, if there were pointers between pages. So it is my intuition that trees are a safer bet with varying workloads while linked lists can be faster in a query-dominated in-memory situation.
Chris Bizer presented the Berlin SPARQL Benchmark (BSBM), which has already been discussed here in some detail. He did acknowledge that the next round of the race must have a real steady-state rule. This just means that the benchmark must be run long enough for the system under test to reach a state where the cache is full and the performance remains indefinitely at the same level. Reaching steady state can take 20-30 minutes in some cases.
Regardless of steady state, BSBM has two generally valid conclusions:
Mike Dean asked whether BSBM was a case of a setup to have triple stores fail. Not necessarily, I would say; we should understand that one motivation of BSBM is testing mapping technologies. Therefore it must have a workload where mapping makes sense. Of course there are workloads where triples are unchallenged — take the Billion Triples Challenge data set for one.
Also, with BSBM, once should note that the query optimization time plays a fairly large role since most queries touch relatively little data. Also, even if the scale is large, the working set is not nearly the size of the database. This in fact penalizes mapping technologies against native SQL since the difference there is compiling the query, especially since parameters are not used. So, Chris, since we both like to map, let's make a benchmark that shows mapping closer to native SQL.
When we run Virtuoso relational against Virtuoso triple store with the TPC-H workload, we see that the relational case is significantly faster. These are long queries, thus query optimization time is negligible; we are here comparing memory-based access times. Why is this? The answer is that a single index lookup gives multiple column values with almost no penalty for the extra column. Also, since the number of total joins is lower, the overhead coming from moving from join to next join is likewise lower. This is just a meter of count of executed instructions.
A column store joins in principle just as much as a triple store. However, since the BI workload often consists of scanning over large tables, the joins tend to be local, the needed lookup can often use the previous location as a starting point. A triple store can do the same if queries have high locality. We do this in some SQL situations and can try this with triples also. The RDF workload is typically more random in its access pattern, though. The other factor is the length of control path. A column store has a simpler control flow if it knows that the column will have exactly one value per row. With RDF, this is not a given. Also, the column store's row is identified by a single number and not a multipart key. These two factors give the column store running with a fixed schema some edge over the more generic RDF quad store.
There was some discussion on how much closer a triple store could come to a relational one. Some gains are undoubtedly possible. We will see. For the ideal row store workload, the RDBMS will continue to have some edge. Large online systems typically have a large part of the workload that is simple and repetitive. There is nothing to prevent one having special indices for supporting such workload, even while retaining the possibility of arbitrary triples elsewhere. Some degree of application-specific data structure does make sense. We just need to show how this is done. In this way, we have a continuum and not an either/or choice of triples vs. tables.
Concerning the future direction of the workshop, there were a few directions suggested. One of the more interesting ones was Mike Dean's suggestion about dealing with a large volume of same-as assertions, specifically a volume where materializing all the entailed triples was no longer practical. Of course, there is the question of scale. This time, we were the only ones focusing on a parallel database with no restrictions on joining.
]]>For us, this is divided into
I will talk about each in turn.
]]>I will here engage in some critical introspection as well as amplify on some answers given to Virtuoso-related questions in recent times.
I use some conversations from the Vienna Linked Data Practitioners meeting as a starting point. These views are mine and are limited to the Virtuoso server. These do not apply to the ODS (OpenLink Data Spaces) applications line, OAT (OpenLink Ajax Toolkit), or ODE (OpenLink Data Explorer).
Well, personally, I am all for core competence. This is why I do not participate in all the online conversations and groups as much as I could, for example. Time and energy are critical resources and must be invested where they make a difference. In this case, the real core competence is running in the database race. This in itself, come to think of it, is a pretty broad concept.
This is why we put a lot of emphasis on Linked Data and the Data Web for now, as this is the emerging game. This is a deliberate choice, not an outside imperative or built-in limitation. More specifically, this means exposing any pre-existing relational data as linked data plus being the definitive RDF store.
We can do this because we own our database and SQL and data access middleware and have a history of connecting to any RDBMS out there.
The principal message we have been hearing from the RDF field is the call for scale of triple storage. This is even louder than the call for relational mapping. We believe that in time mapping will exceed triple storage as such, once we get some real production strength mappings deployed, enough to outperform RDF warehousing.
There are also RDF middleware things like RDF-ization and demand-driven web harvesting (i.e, the so-called Sponger). These are SPARQL options, thus accessed via standard interfaces. We have little desire to create our own languages or APIs, or to tell people how to program. This is why we recently introduced Sesame- and Jena-compatible APIs to our RDF store. From what we hear, these work. On the other hand, we do not hesitate to move beyond the standards when there is obvious value or necessity. This is why we brought SPARQL up to and beyond SQL expressivity. It is not a case of E3 (Embrace, Extend, Extinguish).
Now, this message could be better reflected in our material on the web. This blog is a rather informal step in this direction; more is to come. For now we concentrate on delivering.
The conventional communications wisdom is to split the message by target audience. For this, we should split the RDF, relational, and web services messages from each other. We believe that a challenger, like the semantic web technology stack, must have a compelling message to tell for it to be interesting. This is not a question of research prototypes. The new technology cannot lack something the installed technology takes for granted.
This is why we do not tend to show things like how to insert and query a few triples: No business out there will insert and query triples for the sake of triples. There must be a more compelling story — for example, turning the whole world into a database. This is why our examples start with things like turning the TPC-H database into RDF, queries and all. Anything less is not interesting. Why would an enterprise that has business intelligence and integration issues way more complex than the rather stereotypical TPC-H even look at a technology that pretends to be all for integration and all for expressivity of queries, yet cannot answer the first question of the entry exam?
The world out there is complex. But maybe we ought to make some simple tutorials? So, as a call to the people out there, tell us what a good tutorial would be. The question is more about figuring out what is out there and adapting these and making a sort of compatibility list. Jena and Sesame stuff ought to run as is. We could offer a webinar to all the data web luminaries showing how to promote the data web message with Virtuoso. After all, why not show it on the best platform?
We should answer in multiple parts.
For general collateral, like web sites and documentation:
The web site gives a confused product image. For the Virtuoso product, we should divide at the top into
For each point, one simple statement. We all know what the above things mean?
Then we add a new point about scalability that impacts all the above, namely the Virtuoso version 6 Cluster, meaning that you can do all these things at 10 to 1000 times the scale. This means this much more data or in some cases this much more requests per second. This too is clear.
Far as I am concerned, hosting Java or .NET does not have to be on the front page. Also, we have no great interest in going against Apache when it comes to a web server only situation. The fact that we have a web listener is important for some things but our claim to fame does not rest on this.
Then for documentation and training materials: The documentation should be better. Specifically it should have more of a how-to dimension since nobody reads the whole thing anyhow. About online tutorials, the order of presentation should be different. They do not really reflect what is important at the present moment either.
Now for conference papers: Since taking the data web as a focus area, we have submitted some papers and had some rejected because these do not have enough references and do not explain what is obvious to ourselves.
I think that the communications failure in this case is that we want to talk about end to end solutions and the reviewers expect research. For us, the solution is interesting and exists only if there is an adequate functionality mix for addressing a specific use case. This is why we do not make a paper about query cost model alone because the cost model, while indispensable, is a thing that is taken for granted where we come from. So we mention RDF adaptations to cost model, as these are important to the whole but do not find these to be the justification for a whole paper. If we made papers on this basis, we would have to make five times as many. Maybe we ought to.
One thing that is not obvious from the Virtuoso packaging is that the minimum installation is an executable under 10MB and a config file. Two files.
This gives you SQL and SPARQL out of the box. Adding ODBC and JDBC clients is as simple as it gets. After this, there is basic database functionality. Tuning is a matter of a few parameters that are explained on this blog and elsewhere. Also, the full scale installation is available as an Amazon EC2 image, so no installation required.
Now for the difficult side:
Use SQL and SPARQL; use stored procedures whenever there is server side business logic. For some time critical web pages, use VSP. Do not use VSPX. Otherwise, use whatever you are used to — PHP or Java or anything else. For web services, simple is best. Stick to basics. "The engineer is one who can invent a simple thing." Use SQL statements rather than admin UI.
Know that you can start a server with no database file and you get an initial database with nothing extra. The demo database, the way it is produced by installers is cluttered.
We should put this into a couple of use case oriented how-tos.
Also, we should create a network of "friendly local virtuoso geeks" for providing basic training and services so we do not have to explain these things all the time. To all you data-web-ers out there — please sign up and we will provide instructions, etc. Contact Yrjänä Rankka (ghard[at-sign]openlinksw.com), or go through the mailing lists; do not contact me directly.
Now, what is good for one end is usually good for the other. Namely, a database, no matter the scale, needs to have space efficient storage, fast index lookup, and correct query plans. Then there are things that occur only at the high-end, like clustering, but these are separate things. For embedding, the initial memory footprint needs to be small. With Virtuoso, this is accomplished by leaving out some 200 built-in tables and 100,000 lines of SQL procedures that are normally in by default, supporting things such as DAV and diverse other protocols. After all, if SPARQL is all one wants these are not needed.
If one really wants to do one's server logic (like web listener and thread dispatching) oneself, this is not impossible but requires some advice from us. On the other hand, if one wants to have logic for security close to the data, then using stored procedures is recommended; these execute right next to the data, and support inline SPARQL and SQL. Depending on the license status of the other code, some special licensing arrangements may apply.
We are talking about such things with different parties at present.
"Webby means distributed, heterogeneous, open; not monolithic consolidation of everything."
We are philosophically webby. We come from open standards; we are after all called OpenLink; our history consists of connecting things. We believe in choice — the user should be able to pick the best of breed for components and have them work together. We cannot and do not wish to force replacement of existing assets. Transforming data on the fly and connecting systems, leaving data where it originally resides, is the first preference. For the data web, the first preference is a federation of independent SPARQL end points. When there is harvesting, we prefer to do it on demand, as with our Sponger. With the immense amount of data out there we believe in finding what is relevant when it is relevant, preferably close at hand, leveraging things like social networks. With a data web, many things which are now siloized, such as marketplaces and social networks, will return to the open.
Google-style crawling of everything becomes less practical if one needs to run complex ad hoc queries against the mass of data. For these types of scenarios, if one needs to warehouse, the data cloud will offer solutions where one pays for database on demand. While we believe in loosely coupled federation where possible, we have serious work on the scalability side for the data center and the compute-on-demand cloud.
Personally, I think we have the basics for the birth of a new inflection in the knowledge economy. The URI is the unit of exchange; its value and competitive edge lie in the data it links you with. A name without context is worth little, but as a name gets more use, more information can be found through that name. This is anything from financial statistics, to legal precedents, to news reporting or government data. Right now, if the SEC just added one line of markup to the XBRL template, this would instantaneously make all SEC-mandated reporting into linked data via GRDDL.
The URI is a carrier of brand. An information brand gets traffic and references, and this can be monetized in diverse ways. The key word is context. Information overload is here to stay, and only better context offers the needed increase in productivity to stay ahead of the flood.
Semantic technologies on the whole can help with this. Why these should be semantic web or data web technologies as opposed to just semantic is the linked data value proposition. Even smart islands are still islands. Agility, scale, and scope, depend on the possibility of combining things. Therefore common terminologies and dereferenceability and discoverability are important. Without these, we are at best dealing with closed systems even if they were smart. The expert systems of the 1980s are a case in point.
Ever since the .com era, the URL has been a brand. Now it becomes a URI. Thus, entirely hiding the URI from the user experience is not always desirable. The URI is a sort of handle on the provenance and where more can be found; besides, people are already used to these.
With linked data, information value-add products become easy to build and deploy. They can be basically just canned SPARQL queries combining data in a useful and insightful manner. And where there is traffic there can be monetization, whether by advertizing, subscription, or other means. Such possibilities are a natural adjunct to the blogosphere. To publish analysis, one no longer needs to be a think tank or media company. We could call this scenario the birth of a meshup economy.
For OpenLink itself, this is our roadmap. The immediate future is about getting our high end offerings like clustered RDF storage generally available, both on the cloud and for private data centers. Ourselves, we will offer the whole Linked Open Data cloud as a database. The single feature to come in version 2 of this is fully automatic partitioning and repartitioning for on-demand scale; now, you have to choose how many partitions you have.
This makes some things possible that were hard thus far.
On the mapping front, we go for real-scale data integration scenarios where we can show that SPARQL can unify terms and concepts across databases, yet bring no added cost for complex queries. Enterprises can use their existing warehouses and have an added level of abstraction, the possibility of cross systems interlinking, the advantages of using the same taxonomies and ontologies across systems, and so forth.
Then there will be developments in the direction of smarter web harvesting on demand with the Virtuoso Sponger, and federation of heterogeneous SPARQL end points. The federation is not so unlike clustering, except the time scales are 2 orders of magnitude longer. The work on SPARQL end point statistics and data set description and discovery is a good development in the community.
Then there will be NLP integration, as exemplified by the Open Calais linked data wrapper and more.
Can we pull this off or is this being spread too thin? We know from experience that all this can be accomplished. Scale is already here; we show it with the billion triples set. Mapping is here; we showed it last in the Berlin Benchmark. We will also show some TPC-H results after we get a little quiet after the ISWC event. Then there is ongoing maintenance but with this we have shown a steady turnaround and quick time to fix for pretty much anything.
]]>There are already wrappers producing RDF from many applications. Since any structured or semi-structured data can be converted to RDF and often there is even a pre-existing terminology for the application domain, the availability of the data per se is not the concern.
The triples may come from any application or database, but they will not come from the end user directly. There was a good talk about photograph annotation in Vienna, describing many ways of deriving metadata for photos. The essential wisdom is annotating on the spot and wherever possible doing so automatically. The consumer is very unlikely to go annotate photos after the fact. Further, one can infer that photos made with the same camera around the same time are from the same location. There are other such heuristics. In this use case, the end user does not need to see triples. There is some benefit though in using commonly used geographical terminology for linking to other data sources.
I'd say one will develop them much the same way as thus far. In PHP, for example. Whether one's query language is SPARQL or SQL does not make a large difference in how basic web UI is made.
A SPARQL end-point is no more an end-user item than a SQL command-line is.
A common mistake among techies is that they think the data structure and user experience can or ought to be of the same structure. The UI dialogs do not, for example, have to have a 1:1 correspondence with SQL tables.
The idea of generating UI from data, whether relational or data-web, is so seductive that generation upon generation of developers fall for it, repeatedly. Even I, at OpenLink, after supposedly having been around the block a couple of times made some experiments around the topic. What does make sense is putting a thin wrapper or HTML around the application, using XSLT and such for formatting. Since the model does allow for unforeseen properties of data, one can build a viewer for these alongside the regular forms. For this, Ajax technologies like OAT (the OpenLink AJAX Toolkit) will be good.
The UI ought not to completely hide the URIs of the data from the user. It should offer a drill down to faceted views of the triples for example. Remember when Xerox talked about graphical user interfaces in 1980? "Don't mode me in" was the slogan, as I recall.
Since then, we have vacillated between modal and non-modal interaction models. Repetitive workflows like order entry go best modally and are anyway being replaced by web services. Also workflows that are very infrequent benefit from modality; take personal network setup wizards, for example. But enabling the knowledge worker is a domain that by its nature must retain some respect for human intelligence and not kill this by denying access to the underlying data, including provenance and URIs. Face it: the world is not getting simpler. It is increasingly data dependent and when this is so, having semantics and flexibility of access for the data is important.
For a real-time task-oriented user interface like a fighter plane cockpit, one will not show URIs unless specifically requested. For planning fighter sorties though, there is some potential benefit in having all data such as friendly and hostile assets, geography, organizational structure, etc., as linked data. It makes for more flexible querying. Linked data does not per se mean open, so one can be joinable with open data through using the same identifiers even while maintaining arbitrary levels of security and compartmentalization.
For automating tasks that every time involve the same data and queries, RDF has no intrinsic superiority. Thus the user interfaces in places where RDF will have real edge must be more capable of ad hoc viewing and navigation than regular real-time or line of business user interfaces.
The OpenLink Data Explorer idea of a "data behind the web page" view goes in this direction. Read the web as before, then hit a switch to go to the data view. There are and will be separate clarifications and demos about this.
When SWEO was beginning, there was an endlessly protracted discussion of the so-called layer cake. This acronym jungle is not good messaging. Just say linked, flexibly repurpose-able data, and rich vocabularies and structure. Just the right amount of structure for the application, less rigid and easier to change than relational.
Do not even mention the different serialization formats. Just say that it fits on top of the accepted web infrastructure — HTTP, URIs, and XML where desired.
It is misleading to say inference is a box at some specific place in the diagram. Inference of different types may or may not take place at diverse points, whether presentation or storage, on demand or as a preprocessing step. Since there is structure and semantics, inference is possible if desired.
Yes, in principle, but what do you have in mind? The answer is very context dependent. The person posing the question had an E-learning system in mind, with things such as course catalogues, course material, etc. In such a case, RDF is a great match, especially since the user count will not be in the millions. No university has that many students and anyway they do not hang online browsing the course catalogue.
On the other hand, if I think of making a social network site with RDF as the exclusive data model, I see things that would be very inefficient. For example, keeping a count of logins or the last time of login would be by default several times less efficient than with a RDBMS.
If some application is really large scale and has a knowable workload profile, like any social network does, then some task-specific data structure is simply economical. This does not mean that the application language cannot be SPARQL but this means that the storage format must be tuned to favor some operations over others, relational style. This is a matter of cost more than of feasibility. Ten servers cost less than a hundred and have failures ten times less frequently.
In the near term we will see the birth of an application paradigm for the data web. The data will be open, exposed, first-class citizen; yet the user experience will not have to be in a 1:1 image of the data.
]]>Sören Auer asked me to say a few things about relational to RDF mapping. I will cite some highlights from this, as they pertain to the general scene. There was an "open hacking" session Wednesday night featuring lightning talks. I will use some of these too as a starting point.
The SWEO (Semantic Web Education and Outreach) interest group of the W3C spent some time looking for an elevator pitch for the Semantic Web. It became "Data Unleashed." Why not? Let's give this some context.
So, if we are holding a Semantic Web 101 session, where should we begin? I hazard to guess that we should not begin by writing a FOAF file in Turtle by hand, as this is one thing that is not likely to happen in the real world.
Of course, the social aspect of the Data Web is the most immediately engaging, so a demo might be to go make an account with myopenlink.net and see that after one has entered the data one normally enters for any social network, one has become a Data Web citizen. This means that one can be found, just like this, with a query against the set of data spaces hosted on the system. Then we just need a few pages that repurpose this data and relate it to other data. We show some samples of queries like this in our Billion Triples Challenge demo. We will make a webcast about this to make it all clearer.
Behold: The Data Web is about the world becoming a database; writing SPARQL queries or triples is incidental. You will write FOAF files by hand just as little as you now write SQL insert statements for filling in your account information on Myspace.
Every time there is a major shift in technology, this shift needs to be motivated by addressing a new class of problem. This means doing something that could not be done before. The last time this happened was when the relational database became the dominant IT technology. At that time, the questions involved putting the enterprise in the database and building a cluster of Line Of Business (LOB) applications around the database. The argument for the RDBMS was that you did not have to constrain the set of queries that might later be made, when designing the database. In other words, it was making things more ad hoc. This was opposed then on grounds of being less efficient than the hierarchical and network databases which the relational eventually replaced.
Today, the point of the Data Web is that you do not have to constrain what your data can join or integrate with, when you design your database. The counter-argument is that this is slow and geeky and not scalable. See the similarity?
A difference is that we are not specifically aiming at replacing the RDBMS. In fact, if you know exactly what you will query and have a well defined workload, a relational representation optimized for the workload will give you about 10x the performance of the equivalent RDF warehouse. OLTP remains a relational-only domain.
However, when we are talking about doing queries and analytics against the Web, or even against more than a handful of relational systems, the things which make RDBMS good become problematic.
The most reliable of human drives is the drive to make oneself known. This drives all, from any social scene to business communications to politics. Today, when you want to proclaim you exist, you do so first on the Web. The Web did not become the prevalent media because business loved it for its own sake, it became prevalent because business could not afford not to assert their presence there. If anything, the Web eroded the communications dominance of a lot of players, which was not welcome but still had to be dealt with, by embracing the Web.
Today, in a world driven by data, the Data Web will be catalyzed by similar factors: If your data is not there, you will not figure in query results. Search engines will play some role there but also many social applications will have reports that are driven by published data. Also consider any e-commerce, any marketplace, and so forth. The Data Portability movement is a case in point: Users want to own their own content; silo operators want to capitalize on holding it. Right now, we see these things in silos; the Data Web will create bridges between these, and what is now in silo data centers will be increasingly available on an ad hoc basis with Open Data.
Again, we see a movement from the specialized to the generic: What LinkedIn does in its data center can be done with ad hoc queries with linked open data. Of course, LinkedIn does these things somewhat more efficiently because their system is built just for this task, but the linked data approach has the built-in readiness to join with everything else at almost no cost, without making a new data warehouse for each new business question.
We could call this the sociological aspect of the thing. Getting to more concrete business, we see an economy that, we could say, without being alarmists, is confronted with some issues. Well, generally when times are bad, this results in consolidation of property and power. Businesses fail and get split up and sold off in pieces, government adds controls and regulations and so forth. This means ad hoc data integration, as control without data is just pretense. If times are lean, this also means that there is little readiness to do wholesale replacement of systems, which will take years before producing anything. So we must play with what there is and make it deliver, in ways and conditions that were not necessarily anticipated. The agility of the Data Web, if correctly understood, can be of great benefit there, especially on the reporting and business intelligence side. Specifically mapping line-of-business systems into RDF on the fly will help with integration, making the specialized warehouse the slower and more expensive alternative. But this too is needed at times.
But for the RDF community to be taken seriously there, the messaging must be geared in this direction. Writing FOAF files by hand is not where you begin the pitch. Well, what is more natural then having a global, queriable information space, when you have a global information driven economy?
The Data Web is about making this happen. First with doing this in published generally available data; next with the enterprises having their private data for their own use but still linking toward the outside, even though private data stays private: You can still use standard terms and taxonomies, where they apply, when talking of proprietary information.
At the lightning talks in Vienna, one participant said, "Man's enemy is not the lion that eats men, it's his own brother. Semantic Web's enemy is the XML Web services stack that ate its lunch." There is some truth to the first part. The second part deserves some comment. The Web services stack is about transactions. When you have a fixed, often repeating task, it is a natural thing to make this a Web service. Even though SOA is not really prevalent in enterprise IT, it has value in things like managing supply-chain logistics with partners, etc. Lots of standard messages with unambiguous meaning. To make a parallel with the database world: first there was OLTP; then there was business intelligence. Of course, you must first have the transactions, to have something to analyze.
SOA is for the transactions; the Data Web is for integration, analysis, and discovery. It is the ad hoc component of the real time enterprise, if you will. It is not a competitor against a transaction oriented SOA. In fact, RDF has no special genius for transactions. Another mistake that often gets made is stretching things beyond their natural niche. Doing transactions in RDF is this sort of over-stretching without real benefit.
"I made an ontology and it really did solve a problem. How do I convince the enterprise people, the MBA who says it's too complex, the developer who says it is not what he's used to, and so on?"
This is an education question. One of the findings of SWEO's enterprise survey was that there was awareness that difficult problems existed. There were and are corporate ontologies and taxonomies, diversely implemented. Some of these needs are recognized. RDF based technologies offer to make these more open standards based. open standards have proven economical in the past. What we also hear is that major enterprises do not even know what their information and human resources assets are: Experts can't be found even when they are in the next department, or reports and analysis gets buried in wikis, spreadsheets, and emails.
Just as when SQL took off, we need vendors to do workshops on getting started with a technology. The affair in Vienna was a step in this direction. Another type of event specially focusing on vertical problems and their Data Web solutions is a next step. For example, one could do a workshop on integrating supply chain information with Data Web technologies. Or one on making enterprise knowledge bases from HR, CRM, office automation, wikis, etc. The good thing is that all these things are additions to, not replacements of, the existing mission-critical infrastructure. And better use of what you already have ought to be the theme of the day.
]]>I will say a few things about what we have been doing and where we can go.
Firstly, we have a fairly scalable platform with Virtuoso 6 Cluster. It was most recently tested with the workload discussed in the previous Billion Triples post.
There is an updated version of the paper about this. This will be presented at the web scale workshop of ISWC 2008 in Karlsruhe.
Right now, we are polishing some things in Virtuoso 6 -- some optimizations for smarter balancing of interconnect traffic over multiple network interfaces, and some more SQL optimizations specific to RDF. The must-have basics, like parallel running of sub-queries and aggregates, and all-around unrolling of loops of every kind into large partitioned batches, is all there and proven to work.
We spent a lot of time around the Berlin SPARQL Benchmark story, so we got to the more advanced stuff like the Billion Triples Challenge rather late. We did along the way also run BSBM with an Oracle back-end, with Virtuoso mapping SPARQL to SQL. This merits its own analysis in the near future. This will be the basic how-to of mapping OLTP systems to RDF. Depending on the case, one can use this for lookups in real-time or ETL.
RDF will deliver value in complex situations. An example of a complex relational mapping use case came from Ordnance Survey, presented at the RDB2RDF XG. Examples of complex warehouses include the Neurocommons database, the Billion Triples Challenge, and the Garlik DataPatrol.
In comparison, the Berlin workload is really simple and one where RDF is not at its best, as amply discussed on the Linked Data forum. BSBM's primary value is as a demonstrator for the basic mapping tasks that will be repeated over and over for pretty much any online system when presence on the data web becomes as indispensable as presence on the HTML web.
I will now talk about the complex warehouse/web-harvesting side. I will come to the mapping in another post.
Now, all the things shown in the Billion Triples post can be done with a relational system specially built for each purpose. Since we are a general purpose RDBMS, we use this capability where it makes sense. For example, storing statistics about which tags or interests occur with which other tags or interests as RDF blank nodes makes no sense. We do not even make the experiment; we know ahead of time that the result is at least an order of magnitude in favor of the relational row-oriented solution in both space and time.
Whenever there is a data structure specially made for answering one specific question, like joint occurrence of tags, RDB and mapping is the way to go. With Virtuoso, this can fully-well coexist with physical triples, and can still be accessed in SPARQL and mixed with triples. This is territory that we have not extensively covered yet, but we will be giving some examples about this later.
The real value of RDF is in agility. When there is no time to design and load a new warehouse for every new question, RDF is unparalleled. Also SPARQL, once it has the necessary extensions of aggregating and sub-queries, is nicer than SQL, especially when we have sub-classes and sub-properties, transitivity, and "same as" enabled. These things have some run time cost and if there is a report one is hitting absolutely all the time, then chances are that resolving terms and identity at load-time and using materialized views in SQL is the reasonable thing. If one is inventing a new report every time, then RDF has a lot more convenience and flexibility.
We are just beginning to explore what we can do with data sets such as the online conversation space, linked data, and the open ontologies of UMBEL and OpenCyc. It is safe to say that we can run with real world scale without loss of query expressivity. There is an incremental cost for performance but this is not prohibitive. Serving the whole billion triples set from memory would cost about $32K in hardware. $8K will do if one can wait for disk part of the time. One can use these numbers as a basis for costing larger systems. For online search applications, one will note that running the indexes pretty much from memory is necessary for flat response time. For back office analytics this is not necessarily as critical. It all depends on the use case.
We expect to be able to combine geography, social proximity, subject matter, and named entities, with hierarchical taxonomies and traditional full text, and to present this through a simple user interface.
We expect to do this with online response times if we have a limited set of starting points and do not navigate more than 2 or 3 steps from each starting point. An example would be to have a full text pattern and news group, and get the cloud of interests from the authors of matching posts. Another would be to make a faceted view of the properties of the 1000 people most closely connected to one person.
Queries like finding the fastest online responders to questions about romance across the global board-scape, or finding the person who initiates the most long running conversations about crime, take a bit longer but are entirely possible.
The genius of RDF is to be able to do these things within a general purpose database, ad hoc, in a single query language, mostly without materializing intermediate results. Any of these things could be done with arbitrary efficiency in a custom built system. But what is special now is that the cost of access to this type of information and far beyond drops dramatically as we can do these things in a far less labor intensive way, with a general purpose system, with no redesigning and reloading of warehouses at every turn. The query becomes a commodity.
Still, one must know what to ask. In this respect, the self-describing nature of RDF is unmatched. A query like list the top 10 attributes with the most distinct values for all persons cannot be done in SQL. SQL simply does not allow the columns to be variable.
Further, we can accept queries as text, the way people are used to supplying them, and use structure for drill-down or result-relevance, and also recognize named entities and subject matter concepts in query text. Very simple NLP will go a long way towards keeping SPARQL out of the user experience.
The other way of keeping query complexity hidden is to publish hand-written SPARQL as parameter-fed canned reports.
Between now and ISWC 2008, the last week of October, we will put out demos showing some of these things. Stay tuned.
]]>We use Virtuoso 6 Cluster Edition to demonstrate the following:
The demo is based on a set of canned SPARQL queries that can be invoked using the OpenLink Data Explorer (ODE) Firefox extension.
The demo queries can also be run directly against the SPARQL end point.
The demo is being worked on at the time of submission and may be shown online by appointment.
Automatic annotation of the data based on named entity extraction is being worked on at the time of this submission. By the time of ISWC 2008 the set of sample queries will be enhanced with queries based on extracted named entities and their relationships in the UMBEL and Open CYC ontologies.
Also examples involving owl:sameAs are being added, likewise with similarity metrics and search hit scores.
The database consists of the billion triples data sets and some additions like Umbel. Also the Freebase extract is newer than the challenge original.
The triple count is 1115 million.
In the case of web harvested resources, the data is loaded in one graph per resource.
In the case of larger data sets like Dbpedia or the US census, all triples of the provenance share a data set specific graph.
All string literals are additionally indexed in a full text index. No stop words are used.
Most queries do not specify a graph. Thus they are evaluated against the union of all the graphs in the database. The indexing scheme is SPOG, GPOS, POGS, OPGS. All indices ending in S are bitmap indices.
The demo uses Virtuoso SPARQL extensions in most queries. These extensions consist on one hand of well known SQL features like aggregation with grouping and existence and value subqueries and on the other of RDF specific features. The latter include run time RDFS and OWL inferencing support and backward chaining subclasses and transitivity.
sparql select ?s ?p (bif:search_excerpt (bif:vector ('semantic', 'web'), ?o)) where { ?s ?p ?o . filter (bif:contains (?o, "'semantic web'")) } limit 10 ;
This looks up triples with semantic web in the object and makes a search hit summary of the literal, highlighting the search terms.
sparql select ?tp count(*) where { ?s ?p2 ?o2 . ?o2 a ?tp . ?s foaf:nick ?o . filter (bif:contains (?o, "plaid_skirt")) } group by ?tp order by desc 2 limit 40 ;
This looks at what sorts of things are referenced by the properties of the foaf handle plaid_skirt.
What are these things called?
sparql select ?lbl count(*) where { ?s ?p2 ?o2 . ?o2 rdfs:label ?lbl . ?s foaf:nick ?o . filter (bif:contains (?o, "plaid_skirt")) } group by ?lbl order by desc 2 ;
Many of these things do not have a rdfs:label. Let us use a more general concept of lable which groups dc:title, foaf:name and other name-like properties together. The subproperties are resolved at run time, there is no materialization.
sparql define input:inference 'b3s' select ?lbl count(*) where { ?s ?p2 ?o2 . ?o2 b3s:label ?lbl . ?s foaf:nick ?o . filter (bif:contains (?o, "plaid_skirt")) } group by ?lbl order by desc 2 ;
We can list sources by the topics they contain. Below we look for graphs that mention terrorist bombing.
sparql select ?g count(*) where { graph ?g { ?s ?p ?o . filter (bif:contains (?o, "'terrorist bombing'")) } } group by ?g order by desc 2 ;
Now some web 2.0 tagging of search results. The tag cloud of "computer"
sparql select ?lbl count (*) where { ?s ?p ?o . ?o bif:contains "computer" . ?s sioc:topic ?tg . optional { ?tg rdfs:label ?lbl } } group by ?lbl order by desc 2 limit 40 ;
This query will find the posters who talk the most about sex.
sparql select ?auth count (*) where { ?d dc:creator ?auth . ?d ?p ?o filter (bif:contains (?o, "sex")) } group by ?auth order by desc 2 ;
We look for people who are joined by having relatively uncommon interests but do not know each other.
sparql select ?i ?cnt ?n1 ?n2 ?p1 ?p2 where { { select ?i count (*) as ?cnt where { ?p foaf:interest ?i } group by ?i } filter ( ?cnt > 1 && ?cnt < 10) . ?p1 foaf:interest ?i . ?p2 foaf:interest ?i . filter (?p1 != ?p2 && !bif:exists ((select (1) where {?p1 foaf:knows ?p2 })) && !bif:exists ((select (1) where {?p2 foaf:knows ?p1 }))) . ?p1 foaf:nick ?n1 . ?p2 foaf:nick ?n2 . } order by ?cnt limit 50 ;
The query takes a fairly long time, mostly spent counting the interested in 25M interest triples. It then takes people that share the interest and checks that neither claims to know the other. It then sorts the results rarest interest first. The query can be written more efficently but is here just to show that database-wide scans of the population are possible ad hoc.
Now we go to SQL to make a tag co-occurrence matrix. This can be used for showing a Technorati-style related tags line at the bottom of a search result page. This showcases the use of SQL together with SPARQL. The half-matrix of tags t1, t2 with the co-occurrence count at the intersection is much more efficiently done in SQL, specially since it gets updated as the data changes. This is an example of materialized intermediate results based on warehoused RDF.
create table tag_count (tcn_tag iri_id_8, tcn_count int, primary key (tcn_tag)); alter index tag_count on tag_count partition (tcn_tag int (0hexffff00)); create table tag_coincidence (tc_t1 iri_id_8, tc_t2 iri_id_8, tc_count int, tc_t1_count int, tc_t2_count int, primary key (tc_t1, tc_t2)) alter index tag_coincidence on tag_coincidence partition (tc_t1 int (0hexffff00)); create index tc2 on tag_coincidence (tc_t2, tc_t1) partition (tc_t2 int (0hexffff00));
How many times each topic is mentioned?
insert into tag_count select * from (sparql define output:valmode "LONG" select ?t count (*) as ?cnt where { ?s sioc:topic ?t } group by ?t) xx option (quietcast);
Take all t1, t2 where t1 and t2 are tags of the same subject, store only the permutation where the internal id of t1 < that of t2.
insert into tag_coincidence (tc_t1, tc_t2, tc_count) select "t1", "t2", cnt from (select "t1", "t2", count (*) as cnt from (sparql define output:valmode "LONG" select ?t1 ?t2 where { ?s sioc:topic ?t1 . ?s sioc:topic ?t2 }) tags where "t1" < "t2" group by "t1", "t2") xx where isiri_id ("t1") and isiri_id ("t2") option (quietcast);
Now put the individual occurrence counts into the same table with the co-occurrence. This denormalization makes the related tags lookup faster.
update tag_coincidence set tc_t1_count = (select tcn_count from tag_count where tcn_tag = tc_t1), tc_t2_count = (select tcn_count from tag_count where tcn_tag = tc_t2);
Now each tag_coincidence row has the joint occurrence count and individual occurrence counts. A single select will return a Technorati-style related tags listing.
To show the URI's of the tags:
select top 10 id_to_iri (tc_T1), id_to_iri (tc_t2), tc_count from tag_coincidence order by tc_count desc;
We look at what interests people have
sparql select ?o ?cnt where { { select ?o count (*) as ?cnt where { ?s foaf:interest ?o } group by ?o } filter (?cnt > 100) } order by desc 2 limit 100 ;
Now the same for the Harry Potter fans
sparql select ?i2 count (*) where { ?p foaf:interest <http://www.livejournal.com/interests.bml?int=harry+potter> . ?p foaf:interest ?i2 } group by ?i2 order by desc 2 limit 20 ;
We see whether knows relations are symmmetrical. We return the top n people that others claim to know without being reciprocally known.
sparql select ?celeb, count (*) where { ?claimant foaf:knows ?celeb . filter (!bif:exists ((select (1) where { ?celeb foaf:knows ?claimant }))) } group by ?celeb order by desc 2 limit 10 ;
We look for a well connected person to start from.
sparql select ?p count (*) where { ?p foaf:knows ?k } group by ?p order by desc 2 limit 50 ;
We look for the most connected of the many online identities of Stefan Decker.
sparql select ?sd count (distinct ?xx) where { ?sd a foaf:Person . ?sd ?name ?ns . filter (bif:contains (?ns, "'Stefan Decker'")) . ?sd foaf:knows ?xx } group by ?sd order by desc 2 ;
We count the transitive closure of Stefan Decker's connections
sparql select count (*) where { { select * where { ?s foaf:knows ?o } } option (transitive, t_distinct, t_in(?s), t_out(?o)) . filter (?s = <mailto:stefan.decker@deri.org>) } ;
Now we do the same while following owl:sameAs links.
sparql define input:same-as "yes" select count (*) where { { select * where { ?s foaf:knows ?o } } option (transitive, t_distinct, t_in(?s), t_out(?o)) . filter (?s = <mailto:stefan.decker@deri.org>) } ;
The system runs on Virtuoso 6 Cluster Edition. The database is partitioned into 12 partitions, each served by a distinct server process. The system demonstrated hosts these 12 servers on 2 machines, each with 2 xXeon 5345 and 16GB memory and 4 SATA disks. For scaling, the processes and corresponding partitions can be spread over a larger number of machines. If each ran on its own server with 16GB RAM, the whole data set could be served from memory. This is desirable for search engine or fast analytics applications. Most of the demonstrated queries run in memory on second invocation. The timing difference between first and second run is easily an order of magnitude.
]]>Many of you will know about the W3C relational-to-RDF mapping incubator activity. The group is planning to suggest forming a working group for drawing up a specification for relational-to-RDF mapping.
To this effect, I recently summarized the group discussions and some of our own experiences around the topic at <http://esw.w3.org/topic/Rdb2RdfXG/ReqForMappingByOErling>.
I will here discuss this less formally and more in the light of our own experience. A working group goal statement must be neutral vis à vis the following points, even if any working group will unavoidably encounter these issues on the way. A blog post on the other hand can be more specific.
I gave a talk to the RDB2RDF XG this spring, with these slides.
The main point is that people would really like to map on-the-fly, if they only could. Making an RDF warehouse is not of value in itself, but it is true that in some cases this cannot be avoided.
At first sight, one would think that a mapping specification could be neutral as regards whether one stores the mapped triples as triples or makes them on demand. There is almost no comparison between the complexity of doing non-trivial mappings on-the-fly versus mapping as ETL. Some of this complexity spills over into the requirements for a mapping language.
We expect to have a situation where one virtual triple can have many possible sources. The mapping is a union of mapped databases. Any integration scenario will have this feature. In such a situation, if we are JOIN
ing using such triples, we end up with UNION
s of all databases that could produce the triples in question. This is generally not desired. Therefore, in the on-demand mapping case, there must be a lot of type inference logic that is not relevant in the ETL scenario.
To make the point clearer, suppose a query like "list the organizations whose representatives have published about xx." Suppose that there are three databases mapped, all of which have a table of organizations, a table of persons with affiliation to organizations, a table of publications by these persons, and finally a table of tags for the publications. Now, we want the laboratories that have published with articles with tag XX. It is a matter of common sense in this scenario that a publication will have the author and the author's affiliation in the same database. However, the RDB-to-RDF mapping does not necessarily know this, if all that it is told is that a table makes IRIs of publications by applying a certain pattern to the primary key of the publications table. To infer what needs to be inferred, the system must realize that IRIs from one mapping are disjoint from IRIs from another: A paper in database X will usually not have an author in database Y. The IDs in database Y, even if perchance equal to the IDs in X, do not mean the same thing, and there is no point joining across databases by them.
This entire question is a non-issue in the ETL scenario, but is absolutely vital in the real-time mapping. This is also something that must be stated, at least implicitly, in any mapping. If a mapping translates keys of one place to IRIs with one pattern, and keys from another using another pattern, it must be inferable from the patterns whether the sets of IRIs will be disjoint.
This is critical. Otherwise we will be joining everything to everything else, and there will be orders of magnitude of penalty compared to hand-crafted SQL over the same data sources.
SPARQL queries translate quite well to SQL when there is only one table that can produce a triple with a subject of a given class, when there are few columns that can map to a given predicate, and when classes and predicates are literals in the query.
Virtuoso has some SQL extensions for dealing with breaking a wide table into a row per column. This facilitates dealing with predicates that are not known at query compile time. If the table in question is not managed by Virtuoso, Virtuoso's SQL virtualization/federation takes care of the matter. If a mapping system goes directly to third-party SQL, no such tricks can be used.
The above example suggests that for supporting on-the-fly mapping without relying on owning the SQL underneath, some subsets of SPARQL may have to be defined. For example, one will probably have to require that all predicates be literals. The alternative is prohibitive run-time cost and complexity.
But we must not lose the baby with the bath-water. Aside from offering global identifiers, RDF's attractions include subclasses and sub-predicates. In relational terms, these translate to UNION
s and do involve some added cost. A mapping system just has to have means of dealing with this cost, and of recognizing cases where this cost is prohibitive. Some further work is likely to be required for defining well-behaved subsets of SPARQL and mappings.
Whether to warehouse or not? If one has hundreds of sources, of which some are not even relational, some ETL would seem necessary. Kashiup Vipul gave a position paper at last year's RDB-to-RDF mapping workshop in Cambridge, Massachusetts, about a system of relational mapping and on-demand RDF-izers of diverse semi-structured biomedical data, e.g., spreadsheets. The issue certainly exists, and any mapping work will likely encounter integration scenarios where one part is fairly neatly mapped from relational stores, and another part comes from a less structured repository of ETLed physical triples.
Our take is that if something is a large or very large relational store, then map; else, ETL. With Virtuoso, we can mix mapped and local triples, but this is not a generally available feature of triple stores and standardization will likely have to wait until there are more implementations.
UNION
s when integrating sources that talk of similar things.This was a quick summary, by no means comprehensive, on what an eventual RDB2RDF working group would come across. This is a sort of addendum to the requirements I outlined on the ESW wiki.
]]>I have mentioned on a couple of prior occasions that basic graph operations ought to be integrated into the SQL query language.
The history of databases is by and large about moving from specialized applications toward a generic platform. The introduction of the DBMS itself is the archetypal example. It is all about extracting the common features of applications and making these the features of a platform instead.
It is now time to apply this principle to graph traversal.
The rationale is that graph operations are somewhat tedious to write in a parallelize-able, latency-tolerant manner. Writing them as one would for memory-based data structures is easier but totally unscalable as soon as there is any latency involved, i.e., disk reads or messages between cluster peers.
The ad-hoc nature and very large volume of RDF data makes this a timely question. Up until now, the answer to this question has been to materialize any implied facts in RDF stores. If a was part of b, and b part of c, the implied fact that a is part of c would be inserted explicitly into the database as a pre-query step.
This is simple and often efficient, but tends to have the downside that one makes a specialized warehouse for each new type of query. The activity becomes less ad-hoc.
Also, this becomes next to impossible when the scale approaches web scale, or if some of the data is liable to be on-and-off included-into or excluded-from the set being analyzed. This is why with Virtuoso we have tended to favor inference on demand ("backward chaining") and mapping of relational data into RDF without copying.
The SQL world has taken steps towards dealing with recursion with the WITH - UNION
construct which allows definition of recursive views. The idea there is to define, for example, a tree walk as a UNION
of the data of the starting node plus the recursive walk of the starting node's immediate children.
The main problem with this is that I do not very well see how a SQL optimizer could effectively rearrange queries involving JOIN
s between such recursive views. This model of recursion seems to lose SQL's non-procedural nature. One can no longer easily rearrange JOIN
s based on what data is given and what is to be retrieved. If the recursion is written from root to leaf, it is not obvious how to do this from leaf to root. At any rate, queries written in this way are so complex to write, let alone optimize, that I decided to take another approach.
Take a question like "list the parts of products of category C which have materials that are classified as toxic." Suppose that the product categories are a tree, the product parts are a tree, and the materials classification is a tree taxonomy where "toxic" has a multilevel substructure.
Depending on the count of products and materials, the query can be evaluated as either going from products to parts to materials and then climbing up the materials tree to see if the material is toxic. Or one could do it in reverse, starting with the different toxic materials, looking up the parts containing these, going to the part tree to the product, and up the product hierarchy to see if the product is in the right category. One should be able to evaluate the identical query either way depending on what indices exist, what the cardinalities of the relations are, and so forth — regular cost based optimization.
Especially with RDF, there are many problems of this type. In regular SQL, it is a long-standing cultural practice to flatten hierarchies, but this is not the case with RDF.
In Virtuoso, we see SPARQL as reducing to SQL. Any RDF-oriented database-engine or query-optimization feature is accessed via SQL. Thus, if we address run-time-recursion in the Virtuoso query engine, this becomes, ipso facto, an SQL feature. Besides, we remember that SQL is a much more mature and expressive language than the current SPARQL recommendation.
We will here look at some simple social network queries. A later article will show how to do more general graph operations. We extend the SQL derived table construct, i.e., SELECT
in another SELECT
's FROM
clause, with a TRANSITIVE
clause.
Consider the data:
CREATE TABLE "knows" ("p1" INT, "p2" INT, PRIMARY KEY ("p1", "p2") ); ALTER INDEX "knows" ON "knows" PARTITION ("p1" INT); CREATE INDEX "knows2" ON "knows" ("p2", "p1") PARTITION ("p2" INT);
We represent a social network with the many-to-many relation "knows". The persons are identified by integers.
INSERT INTO "knows" VALUES (1, 2); INSERT INTO "knows" VALUES (1, 3); INSERT INTO "knows" VALUES (2, 4);
SELECT * FROM (SELECT TRANSITIVE T_IN (1) T_OUT (2) T_DISTINCT "p1", "p2" FROM "knows" ) "k" WHERE "k"."p1" = 1;
We obtain the result:
p1 p2 1 3 1 2 1 4
The operation is reversible:
SELECT * FROM (SELECT TRANSITIVE T_IN (1) T_OUT (2) T_DISTINCT "p1", "p2" FROM "knows" ) "k" WHERE "k"."p2" = 4;
p1 p2 2 4 1 4
Since now we give p2, we traverse from p2 towards p1. The result set states that 4 is known by 2 and 2 is known by 1.
To see what would happen if x knowing y also meant y knowing x, one could write:
SELECT * FROM (SELECT TRANSITIVE T_IN (1) T_OUT (2) T_DISTINCT "p1", "p2" FROM (SELECT "p1", "p2" FROM "knows" UNION ALL SELECT "p2", "p1" FROM "knows" ) "k2" ) "k" WHERE "k"."p2" = 4;
p1 p2 2 4 1 4 3 4
Now, since we know that 1 and 4 are related, we can ask how they are related.
SELECT * FROM (SELECT TRANSITIVE T_IN (1) T_OUT (2) T_DISTINCT "p1", "p2", T_STEP (1) AS "via", T_STEP ('step_no') AS "step", T_STEP ('path_id') AS "path" FROM "knows" ) "k" WHERE "p1" = 1 AND "p2" = 4;
p1 p2 via step path 1 4 1 0 0 1 4 2 1 0 1 4 4 2 0
The two first columns are the ends of the path. The next column is the person that is a step on the path. The next one is the number of the step, counting from 0, so that the end of the path that corresponds to the end condition on the column designated as input, i.e., p1, has number 0. Since there can be multiple solutions, the last column is a sequence number allowing distinguishing multiple alternative paths from each other.
For LinkedIn users, the friends ordered by distance and descending friend count query, which is at the basis of most LinkedIn search result views can be written as:
SELECT p2, dist, (SELECT COUNT (*) FROM "knows" "c" WHERE "c"."p1" = "k"."p2" ) FROM (SELECT TRANSITIVE t_in (1) t_out (2) t_distinct "p1", "p2", t_step ('step_no') AS "dist" FROM "knows" ) "k" WHERE "p1" = 1 ORDER BY "dist", 3 DESC;
p2 dist aggregate 2 1 1 3 1 0 4 2 0
The queries shown above work on Virtuoso v6. When running in cluster mode, several thousand graph traversal steps may be proceeding at the same time, meaning that all database access is parallelized and that the algorithm is internally latency-tolerant. By default, all results are produced in a deterministic order, permitting predictable slicing of result sets.
Furthermore, for queries where both ends of a path are given, the optimizer may decide to attack the path from both ends simultaneously. So, supposing that every member of a social network has an average of 30 contacts, and we need to find a path between two users that are no more than 6 steps apart, we begin at both ends, expanding each up to 3 levels, and we stop when we find the first intersection. Thus, we reach 2 * 30^3 = 54,000 nodes, and not 30^6 = 729,000,000 nodes.
Writing a generic database driven graph traversal framework on the application side, say in Java over JDBC, would easily be over a thousand lines. This is much more work than can be justified just for a one-off, ad-hoc query. Besides, the traversal order in such a case could not be optimized by the DBMS.
In a future blog post I will show how this feature can be used for common graph tasks like critical path, itinerary planning, traveling salesman, the 8 queens chess problem, etc. There are lots of switches for controlling different parameters of the traversal. This is just the beginning. I will also give examples of the use of this in SPARQL.
]]>Virtuoso has an extensive collection of RDF-izers called Sponger Cartridges. These take a web resource in one of 30+ formats (so far) and extract RDF from it. The Virtuoso Sponger is a device which evaluates a query and along the way, finds dereferenceable links, dereferences them, and iteratively re-evaluates the query, until either nothing new is found or some limit is reached.
We could call this query-driven crawling. The idea is intuitive — what one looks for, determines what one finds.
This does however raise certain questions pertaining to the nature and ultimate possibility of knowledge, i.e., epistemology.
The process of querying could be said to go from the few to the many, just like the process of harvesting data from the web, the way any search engine does. One follows links or makes joins and thereby increases one's reach.
The difference is that a query has no a priori direction. If I ask for the phone numbers of my friends and there are no phone numbers in the database, then it is valid to give an empty result without looking at my friends at all. Closed world, as it is said. Never mind that the friends would have had a "see also" link to a retrievable document that did have a phone number.
The problem is that a query execution plan determines what possible dereferenceable material the query will encounter during its execution. What is worse, a query plan tends toward the minimal, i.e., toward minimizing the chances of encountering something dereferenceable along the way. Where query and crawl appeared to have a similarity, in fact they have two opposite goals.
The user generally has no idea of the execution plan. In the general case, the user cannot have an idea of this plan. There are valid, over 40 year old reasons for leaving the query planning to the database. In exceptional situations the user can read or direct these, but this is really quite tedious and requires understanding that is basically never present.
So, given a query, how do we find data that will match it, short of having a pre-loaded database of absolutely everything? This is certainly a desirable goal, and all in the open world, distributed spirit of the web.
Let us limit ourselves to queries that have some literals in the object or subject positions. A SPARQL query is basically a graph. Its vertices are variables and literals, and its edges are triple patterns. An edge is labeled by a predicate. For now, we will consider the predicate to always be a literal. From each literal, we can draw a tree, following each edge starting at this literal and descending until we find another literal. Each tree is not always a spanning tree of the graph, but all the trees collectively span the graph.
Consider the query
{ <john> knows ?x . <mary> knows ?x . ?x label ?l }.
The starting points are the literals john
and mary
. The john
tree has one child, ?x
, which has the children mary
and ?l
. One could notate it as
{ <john> knows ?x . {{ <mary> knows ?x} UNION {?x label ?l}}}
That is, the head first, and if it has more than one child, a union listing them, recursively.
If one composed such queries for each literal in the original pattern and evaluated each as a breadth first walk of the tree, no query optimization tricks, and for each binding of each variable, recorded whether there was something to dereference, one would in a finite time have reached all the directly reachable data. Then one could evaluate the original query, using whatever plan was preferred.
The check for dereferenceable data applied to each IRI-valued binding formed in the above evaluation, would consist of looking for "see also", "same as", and other such properties of the IRI. It could also consult text based search engines. Since the evaluation is breadth first, it generates a large number of parallel tasks and is fairly latency tolerant, i.e., it will not die if it must retrieve a few pages from remote sources. We will leave the exact rewrite rules for unions, optionals, aggregates, subqueries, and so on, as an exercise; the general idea should be clear enough.
We have here shown a way of transforming SPARQL queries in such a way as to guarantee dereferencing of findable links, without requiring the end user to either explicitly specify or understand query plans.
The present Sponger does not work exactly in this manner but it will be developed in this direction. Fortunately, the algorithms outlined above are nothing complicated.
]]>I finally got around to running the SP2B SPARQL Performance Benchmark on the current Virtuoso Open Source Edition, v5.0.8.
I ran it with the 5M triples scale, which is the highest scale for which the authors give numbers.
I got a run time of 25 minutes for the 12 queries, giving an arithmetic mean of the query time of 125 seconds. This is better than the 800 or so seconds that the authors had measured. Also, Q6 of the set had failed for the authors, but we have since fixed this; the fix is in the v5.0.8 cut.
I also tried it with a scale of 25M, but this became I/O bound and took a bit longer. I will try this with v6 and v7 cluster later, which are vastly better at anything I/O bound.
The machine was a 2GHz Xeon with 8G RAM. The query text was the one from the authors, with an explicit FROM
clause added; the client was the command line Interactive SQL (iSQL).
If one does the test with the default index layout without specifying a graph, things will not work very well. Also, returning the million-row results of these queries over the SPARQL protocol is not practical.
I will say something more about SP2B when I get to have a closer look.
]]>I will here summarize what should be known about running benchmarks with Virtuoso.
For 8G RAM, in the [Parameters]
stanza of virtuoso.ini
, set —
[Parameters]
...
NumberOfBuffers = 550000
For 16G RAM, double this—
[Parameters]
...
NumberOfBuffers = 1100000
For most cases, certainly all RDF cases, Read Committed should be the default transaction isolation. In the [Parameters]
stanza of virtuoso.ini
, set —
[Parameters]
...
DefaultIsolation = 2
If ODBC, JDBC, or similarly connected client applications are used, there must be more ServerThreads
available than there will be client connections. In the [Parameters]
stanza of virtuoso.ini
, set —
[Parameters]
...
ServerThreads = 100
With web clients (unlike ODBC, JDBC, or similar clients), it may be justified to have fewer ServerThreads
than there are concurrent clients. The MaxKeepAlives
should be the maximum number of expected web clients. This can be more than the ServerThreads
count. In the [HTTPServer]
stanza of virtuoso.ini
, set —
[HTTPServer]
...
ServerThreads = 100
MaxKeepAlives = 1000
KeepAliveTimeout = 10
Note — The [HTTPServer] ServerThreads
are taken from the total pool made available by the [Parameters] ServerThreads
. Thus, the [Parameters] ServerThreads
should always be at least as large as (and is best set greater than) the [HTTPServer] ServerThreads
, and if using the closed-source Commercial Version, should not exceed the licensed thread count.
The basic rule is to use one stripe (file) per distinct physical device (not per file system), using no RAID. For example, one might stripe a database over 6 files (6 physical disks), with an initial size of 60000 pages (the files will grow as needed).
For the above described example, in the [Database]
stanza of virtuoso.ini
, set —
[Database]
...
Striping = 1
MaxCheckpointRemap = 2000000
— and in the [Striping]
stanza, on one line per SegmentName
, set —
[Striping]
...
Segment1 = 60000 , /virtdev/db/virt-seg1.db = q1 , /data1/db/virt-seg1-str2.db = q2 , /data2/db/virt-seg1-str3.db = q3 , /data3/db/virt-seg1-str4.db = q4 , /data4/db/virt-seg1-str5.db = q5 , /data5/db/virt-seg1-str6.db = q6
As can be seen here, each file gets a background IO thread (the = qxxx
clause). It should be noted that all files on the same physical device should have the same qxxx
value. This is not directly relevant to the benchmarking scenario above, because we have only one file per device, and thus only one file per IO queue.
If queries have lots of joins but access little data, as with the Berlin SPARQL Benchmark, the SQL compiler must be told not to look for better plans if the best plan so far is quicker than the compilation time expended so far. Thus, in the [Parameters]
stanza of virtuoso.ini
, set —
[Parameters]
...
StopCompilerWhenXOverRunTime = 1
]]>The special contribution of the Berlin SPARQL Benchmark (BSBM) to the RDF world is to raise the question of doing OLTP with RDF.
Of course, here we immediately hit the question of comparisons with relational databases. To this effect, BSBM also specifies a relational schema and can generate the data as either triples or SQL inserts.
The benchmark effectively simulates the case of exposing an existing RDBMS as RDF. OpenLink Software calls this RDF Views. Oracle is beginning to call this semantic covers. The RDB2RDF XG, a W3C incubator group, has been active in this area since Spring, 2008.
We believe this is relevant because RDF promises to be the interoperability factor between potentially all of traditional IS. If data is online for human consumption, it may be online via a SPARQL end-point as well. The economic justification will come from discoverability and from applications integrating multi-source structured data. Online shopping is a fine use case.
Warehousing all the world's publishable data as RDF is not our first preference, nor would it be the publisher's. Considerations of duplicate infrastructure and maintenance are reason enough. Consequently, we need to show that mapping can outperform an RDF warehouse, which is what we'll do here.
First, we found that making the query plan took much too long in proportion to the run time. With BSBM this is an issue because the queries have lots of joins but access relatively little data. So we made a faster compiler and along the way retouched the cost model a bit.
But the really interesting part with BSBM is mapping relational data to RDF. For us, BSBM is a great way of showing that mapping can outperform even the best triple store. A relational row store is as good as unbeatable with the query mix. And when there is a clear mapping, there is no reason the SPARQL could not be directly translated.
If Chris Bizer et al launched the mapping ship, we will be the ones to pilot it to harbor!
We filled two Virtuoso instances with a BSBM200000 data set, for 100M triples. One was filled with physical triples; the other was filled with the equivalent relational data plus mapping to triples. Performance figures are given in "query mixes per hour". (An update or follow-on to this post will provide elapsed times for each test run.)
With the unmodified benchmark we got:
Physical Triples: 1297 qmph Mapped Triples: 3144 qmph
In both cases, most of the time was spent on Q6, which looks for products with one of three words in the label. We altered Q6 to use text index for the mapping, and altered the databases accordingly. (There is no such thing as an e-commerce site without a text index, so we are amply justified in making this change.)
The following were measured on the second run of a 100 query mix series, single test driver, warm cache.
Physical Triples: 5746 qmph Mapped Triples: 7525 qmph
We then ran the same with 4 concurrent instances of the test driver. The qmph here is 400 / the longest run time.
Physical Triples: 19459 qmph Mapped Triples: 24531 qmph
The system used was 64-bit Linux, 2GHz dual-Xeon 5130 (8 cores) with 8G RAM. The concurrent throughputs are a little under 4 times the single thread throughput, which is normal for SMP due to memory contention. The numbers do not evidence significant overhead from thread synchronization.
The query compilation represents about 1/3 of total server side CPU. In an actual online application of this type, queries would be parameterized, so the throughputs would be accordingly higher. We used the StopCompilerWhenXOverRunTime = 1
option here to cut needless compiler overhead, the queries being straightforward enough.
We also see that the advantage of mapping can be further increased by more compiler optimizations, so we expect in the end mapping will lead RDF warehousing by a factor of 4 or so.
Reporting Rules. The benchmark spec should specify a form for disclosure of test run data, TPC style. This includes things like configuration parameters and exact text of queries. There should be accepted variants of query text, as with the TPC.
Multiuser operation. The test driver should get a stream number as parameter, so that each client makes a different query sequence. Also, disk performance in this type of benchmark can only be reasonably assessed with a naturally parallel multiuser workload.
Add business intelligence. SPARQL has aggregates now, at least with Jena and Virtuoso, so let's use these. The BSBM business intelligence metric should be a separate metric off the same data. Adding synthetic sales figures would make more interesting queries possible. For example, producing recommendations like "customers who bought this also bought xxx."
For the SPARQL community, BSBM sends the message that one ought to support parameterized queries and stored procedures. This would be a SPARQL protocol extension; the SPARUL syntax should also have a way of calling a procedure. Something like select proc (??, ??)
would be enough, where ??
is a parameter marker, like ?
in ODBC/JDBC.
Add transactions.Especially if we are contrasting mapping vs. storing triples, having an update flow is relevant. In practice, this could be done by having the test driver send web service requests for order entry and the SUT could implement these as updates to the triples or a mapped relational store. This could use stored procedures or logic in an app server.
The time of most queries is less than linear to the scale factor. Q6 is an exception if it is not implemented using a text index. Without the text index, Q6 will inevitably come to dominate query time as the scale is increased, and thus will make the benchmark less relevant at larger scales.
We include the sources of our RDF view definitions and other material for running BSBM with our forthcoming Virtuoso Open Source 5.0.8 release. This also includes all the query optimization work done for BSBM. This will be available in the coming days.
]]>We had a look at Chris Bizer's initial results with the Berlin SPARQL Benchmark (BSBM) on Virtuoso. The first results were rather bad, as nearly all of the run time was spent optimizing the SPARQL statements and under 10% actually running them.
So I spent a couple of days on the SPARQL/SQL compiler, to the effect of making it do a better guess of initial execution plan and streamlining some operations. In fact, many of the queries in BSBM are not particularly sensitive to execution plan, as they access a very small portion of the database. So to close the matter, I put in a flag that makes the SQL compiler give up on devising new plans if the time of the best plan so far is less than the time spent compiling so far.
With these changes, available now as a diff on top of 5.0.7, we run quite well, several times better than initially. With the compiler time cut-off in place (ini parameter StopCompilerWhenXOverRunTime = 1
), we get the following times, output from the BSBM test driver:
Starting test... 0: 1031.22 ms, total: 1151 ms 1: 982.89 ms, total: 1040 ms 2: 923.27 ms, total: 968 ms 3: 898.37 ms, total: 932 ms 4: 855.70 ms, total: 865 ms Scale factor: 10000 Number of query mix runs: 5 times min/max Query mix runtime: 0.8557 s / 1.0312 s Total runtime: 4.691 seconds QMpH: 3836.77 query mixes per hour CQET: 0.93829 seconds average runtime of query mix CQET (geom.): 0.93625 seconds geometric mean runtime of query mix Metrics for Query 1: Count: 5 times executed in whole run AQET: 0.012212 seconds (arithmetic mean) AQET(geom.): 0.009934 seconds (geometric mean) QPS: 81.89 Queries per second minQET/maxQET: 0.00684000s / 0.03115700s Average result count: 7.0 min/max result count: 3 / 10 Metrics for Query 2: Count: 35 times executed in whole run AQET: 0.030490 seconds (arithmetic mean) AQET(geom.): 0.029776 seconds (geometric mean) QPS: 32.80 Queries per second minQET/maxQET: 0.02467300s / 0.06753000s Average result count: 22.5 min/max result count: 15 / 30 Metrics for Query 3: Count: 5 times executed in whole run AQET: 0.006947 seconds (arithmetic mean) AQET(geom.): 0.006905 seconds (geometric mean) QPS: 143.95 Queries per second minQET/maxQET: 0.00580000s / 0.00795100s Average result count: 4.0 min/max result count: 0 / 10 Metrics for Query 4: Count: 5 times executed in whole run AQET: 0.008858 seconds (arithmetic mean) AQET(geom.): 0.008829 seconds (geometric mean) QPS: 112.89 Queries per second minQET/maxQET: 0.00804400s / 0.01019500s Average result count: 3.4 min/max result count: 0 / 10 Metrics for Query 5: Count: 5 times executed in whole run AQET: 0.087542 seconds (arithmetic mean) AQET(geom.): 0.087327 seconds (geometric mean) QPS: 11.42 Queries per second minQET/maxQET: 0.08165600s / 0.09889200s Average result count: 5.0 min/max result count: 5 / 5 Metrics for Query 6: Count: 5 times executed in whole run AQET: 0.131222 seconds (arithmetic mean) AQET(geom.): 0.131216 seconds (geometric mean) QPS: 7.62 Queries per second minQET/maxQET: 0.12924200s / 0.13298200s Average result count: 3.6 min/max result count: 3 / 5 Metrics for Query 7: Count: 20 times executed in whole run AQET: 0.043601 seconds (arithmetic mean) AQET(geom.): 0.040890 seconds (geometric mean) QPS: 22.94 Queries per second minQET/maxQET: 0.01984400s / 0.06012600s Average result count: 26.4 min/max result count: 5 / 96 Metrics for Query 8: Count: 10 times executed in whole run AQET: 0.018168 seconds (arithmetic mean) AQET(geom.): 0.016205 seconds (geometric mean) QPS: 55.04 Queries per second minQET/maxQET: 0.01097600s / 0.05066900s Average result count: 12.8 min/max result count: 6 / 20 Metrics for Query 9: Count: 20 times executed in whole run AQET: 0.043813 seconds (arithmetic mean) AQET(geom.): 0.043807 seconds (geometric mean) QPS: 22.82 Queries per second minQET/maxQET: 0.04274900s / 0.04504100s Average result count: 0.0 min/max result count: 0 / 0 Metrics for Query 10: Count: 15 times executed in whole run AQET: 0.030697 seconds (arithmetic mean) AQET(geom.): 0.029651 seconds (geometric mean) QPS: 32.58 Queries per second minQET/maxQET: 0.02072000s / 0.03975700s Average result count: 1.1 min/max result count: 0 / 4 real 0 m 5.485 s user 0 m 2.233 s sys 0 m 0.170 s
Of the approximately 5.5 seconds of running five query mixes, the test driver spends 2.2 s. The server side processing time is 3.1 s, of which SQL compilation is 1.35 s. The rest is miscellaneous system time. The measurement is on 64-bit Linux, 2GHz dual-Xeon 5130 (8 cores) with 8G RAM.
We note that this type of workload would be done with stored procedures or prepared, parameterized queries in the SQL world.
There will be some further tuning still but this addresses the bulk of the matter. There will be a separate message about the patch containing these improvements.
]]>LIMIT
and OFFSET
I thought that we had talked ourselves to exhaustion and beyond over the issue of the semantic web layer cake. Apparently not. There was a paper called Functional Architecture for the Semantic Web by Aurona Gerber et al at ESWC2008.
The thrust of the matter was that for newcomers the layer cake was confusing and did not clearly indicate the architecture. Why, sure. My point is that no rearranging of the boxes will cut it for the general case.
Any diagram containing the boxes of the layer cake (i.e., URI, XML, SPARQL, OWL, RIF, Crypto, etc., etc.) in whatever order or arrangement can at best be a sort of overview of how these standards reference each other.
Such diagrams are a little like saying that a car combines the combustion properties of fuel/air mixes with the tension and compression resistance properties of metals and composites for producing motion and secondly links to Newton's laws of motion and to aerodynamics.
Not false. But it does not say that a car is good for economical commute or showing off at the strip or any number of niches that a mature industry has grown to serve.
Now, talking of software engineering, modules and interfaces are good and even necessary. The trick is to know where to put the interface.
Such a thing cannot possibly be inferred from the standards' inter-reference picture. APIs, especially if these are Web service APIs, should go where there is low data volume and tolerance for latency. For example, either inference is a preprocessing step or it is embedded right inside a SPARQL engine. Such a thing cannot be seen from the picture. Same for trust. Trust is not an after-thought at the top of the picture, except maybe in the sense of referring to the other parts.
We hear it over and over. Scale and speed are critical. Arrange the blocks of any real system as makes sense for data flow; do not confuse literature references with control or data structure.
The even-more foundational issue is the promotion of the general concept of a Web of Data.
The core idea that the Web would be a query-able collection of data with meaningful reference between data of different provenance cannot be inferred from the picture, even though this should be its primary message. Or it is better to say that the first picture shown should stress this idea and then one could leave the layer cake, in whatever version, for explaining the standards' order of evolution or inter-reference.
So, the value proposition:
Why? Explosion of data volume, increased need of keeping up-to-date, increasing opportunity cost of not keeping in real time.
What? An architecture that is designed for unanticipated joining and evolution of data across heterogeneous sources, either at Web or enterprise scale.
How? URI everything and everything is cool, or, give things global names. Use RDF. Reuse names or ontologies where can. (An ontology is a set of classes and property names plus some more.) Map relational data on the fly or store as RDF, whichever works. Query with SPARQL, easier than SQL.
So, my challenge for the graphics people would be to make an illustration of the above. Forget the alphabet soup. Show the layer cake as a historical reference or literature guide. Do not imply that this proliferation of boxes equates to an equal proliferation of Web services, for example.
]]>At ESWC2008, we saw the Linked Open Data Cloud condense its first drops of precipitation.
voiD, Vocabulary of Interlinked Datasets, is an idea whose time has clearly come. By the end of the conference, many speakers had already adopted the meme.
The point is to describe what is inside the data sets. People may know this from having worked with the sets or from putting them together but to an outsider this is not evident.
The Semantic Sitemap says where there are files or end points for access. But it does not say what is inside these. Also for federation, it is important to be able to determine whether it makes sense to send a particular query to a particular end point.
If we play this right, this is what voiD will provide. I have to think of Dan Simmons' flamboyant Hyperion sci-fi series where the "void which binds" was a sort of hyperspace containing the thoughts of entities, past and present and even provided teleportation.
So what does the voiD hold, aside infinite potentialities?
The obvious part is DC-like provenance, version, authorship, license and such data set wide information. Also the subject matter could be classified by reference to UMBEL or the Yago classification of DBpedia.
More is needed, though. The simple part is listing the ontologies, if any. Also a set of namespaces would be an idea but this could be very large.
So let us look at what we'd like to be able to answer with the voiD set.
The below could be a sample of voiD questions?
What subjects are in the LOD cloud?
Given this URI, what set in the LOD cloud can tell me more? This is divided into asking a text index like Sindice for the location, getting the namespace or data set and then querying voiD.
What need I federate/load in order to combine all that is reachable from a given vocabulary? There could be for example a graph showing the data sets and edges between them, edges being qualified by a set of same as assertions, itself a voiD described set, if translations were needed.
What sets are from the same or equally trusted publisher as this one?
These things are roughly divided into description of the set and then some details on how it is stored on a given end point.
Given this set, in which other sets will I find use of the same URIs? For example, if I have language version x, I wish to know that language version y will have the same URIs insofar the things meant are the same.
Given this set, which sets of same as assertions will I have for mapping to which other sets? For example, if I have Geonames, I wish to know that set x will map at least some of the URIs in Geonames to DBpedia URIs.
Let me further point out that it is increasingly clear to the community that universal sameAs is dubious, hence sameAs assertions ought to be kept separate and included or excluded depending on the usage context.
Given this set, what are the interesting queries I can do? This is a sort of advertisement for human consumption. This is not a list of queries for crashing the end point. Denial of service can be done in SPARQL without knowing the end point content anyhow, so this is not an added risk exposer.
Vocabularies used. This is a reference to the OWL or RDFS resources giving the applicable ontologies, if present. Also, a complete list of classes whose direct instances actually occur in the set is useful.
Ballpark cardinality. Something like a DARQ optimization profile would be a good idea. I would say that there should be a possibility of just including a DARQ description file as is. This is a sort of baseline and since it already exist, we are spared the committee trouble of figuring out what it ought to contain and what not. If we start defining this from scratch, it will take long. Further, let this be optional. Quite Independently of this, query processors may make optimization related queries to remote end points insofar the specific end point supports these. This will come in time. For now, just the basics.
Along with this, LOD SPARQL end points could adopt a couple of basic conventions. The simplest would be to agree that each would host a graph with a given URI that would contain the voiD descriptions of the data sets contained, along with the graph URI used for each set, if different from the publisher's URI for the graph. There is a point to this since an end point may load multiple data sets into one graph.
We hope to have a good idea of the matter in a couple of weeks, certainly a general statement of direction to be published at Linked Data Planet in a couple of weeks.
]]>Astronomers propose that the universe is held together, so to speak, by the gravity of invisible "dark matter" spread in interstellar and intergalactic space.
For the data web, it will be held together by federation, also an invisible factor. As in Minkowski space, so in cyberspace.
To take the astronomical analogy further, putting too much visible stuff in one place makes a black hole, whose chief properties are that it is very heavy, can only get heavier and that nothing comes out.
DARQ is Bastian Quilitz's federated extension of the Jena ARQ SPARQL processor. It has existed for a while and was also presented at ESWC2008. There is also SPARQL FED from Andy Seaborne, an explicit means of specifying which end point will process which fragment of a distributed SPARQL query. Still, for federation to deliver in an open, decentralized world, it must be transparent. For a specific application, with a predictable workload, it is of course OK to partition queries explicitly.
Bastian had split DBpedia among five Virtuoso servers and was querying this set with DARQ. The end result was that there was a rather frightful cost of federation as opposed to all the data residing in a single Virtuoso. The other result was that if selectivity of predicates was not correctly guessed by the federation engine, the proposition was a non-starter. With correct join order it worked, though.
Yet, we really want federation. Looking further down the road, we simply must make federation work. This is just as necessary as running on a server cluster for mid-size workloads.
Since we are convinced of the cause, let's talk about the means.
For DARQ as it now stands, there's probably an order of magnitude or even more to gain from a couple of simple tricks. If going to a SPARQL end point that is not the outermost in the loop join sequence, batch the requests together in one HTTP/1.1 message. So, if the query is "get me my friends living in cities of over a million people," there will be the fragment "get city where x lives" and later "ask if population of x greater than 1000000". If I have 100 friends, I send the 100 requests in a batch to each eligible server.
Further, if running against a server of known brand, use a client-server connection and prepared statements with array parameters. This can well improve the processing speed at the remote end point by another order of magnitude. This gain may however not be as great as the latency savings from message batching. We will provide a sample of how to do this with Virtuoso over JDBC so Bastian can try this if interested.
These simple things will give a lot of mileage and may even decide whether federation is an option in specific applications. For the open web however, these measures will not yet win the day.
When federating SQL, colocation of data is sort of explicit. If two tables are joined and they are in the same source, then the join can go to the source. For SPARQL this is also so but with a twist:
If a foaf:Person is found on a given server, this does not mean that the Person's geek code or email hash will be on the same server. Thus {?p name "Johnny" . ?p geekCode ?g . ?p emailHash ?h }
does not necessarily denote a colocated join if many servers serve items of the vocabulary.
However, in most practical cases, for obtaining a rapid answer, treating this as a colocated fragment will be appropriate. Thus, it may be necessary to be able to declare that geek codes will be assumed colocated with names. This will save a lot of message passing and offer decent, if not theoretically total recall. For search style applications, starting with such assumptions will make sense. If nothing is found, then we can partition each join step separately for the unlikely case that there were a server that gave geek codes but not names.
For Virtuoso, we find that a federated query's asynchronous, parallel evaluation model is not so different from that on a local cluster. So the cluster version could have the option of federated query. The difference is that a cluster is local and tightly coupled and predictably partitioned but a federated setting is none of these.
For description, we would take DARQ's description model and maybe extend it a little where needed. Also we would enhance the protocol to allow just asking for the query cost estimate given a query with literals specified. We will do this eventually.
We would like to talk to Bastian about large improvements to DARQ, specially when working with Virtuoso. We'll see.
Of course, one mode of federating is the crawl-as-you-go approach of the Virtuoso Sponger. This will bring in fragments following seeAlso or sameAs declarations or other references. This will however not have the recall of a warehouse or federation over well described SPARQL end-points. But up to a certain volume it has the speed of local storage.
The emergence of voiD (Vocabulary of Interlinked Data) is a step in the direction of making federation a reality. There is a separate post about this.
]]>The W3C has recently launched an incubator group about mapping relational data to RDF.
From participating in the group for the few initial sessions, I get the following impressions.
There is a segment of users, for example from the biomedical community, who do heavy duty data integration and look to RDF for managing complexity. Unifying heterogeneous data under OWL ontologies, reasoning, and data integrity, are points of interest.
There is another segment that is concerned with semantifying the document web, which topic includes initiatives such as Triplify and semantic web search such as Sindice. The emphasis there is on minimizing entry cost and creating critical mass. The next one to come will clean up the semantics, if these need be cleaned up at all.
(Some cleanup is taking place with Yago and Zitgist, but this is a matter for a different post.)
Thus, technically speaking, the mapping landscape is diverse, but ETL (extract-transform-load) seems to predominate. The biomedical people make data warehouses for answering specific questions. The web people are interested in putting data out in the expectation that the next player will warehouse it and allow running complex meshups against the whole of the RDF-ized web.
As one would expect, these groups see different issues and needs. Roughly speaking, one is about quality and structure and the other is about volume.
Where do we stand?
We are with the research data warehousers in saying that the mapping question is very complex and that it would indeed be nice to bypass ETL and go to the source RDBMS(s) on demand. Projects in this direction are ongoing.
We are with the web people in building large RDF stores with scalable query answering for arbitrary RDF, for example, hosting a lot of the Linking Open Data sets, and working with Zitgist.
These things are somewhat different.
At present, both the research warehousers and the web scalers predominantly go for ETL.
This is fine by us as we definitely are in the large RDF store race.
Still, mapping has its point. A relational store will perform quite a bit faster than a quad store if it has the right covering indices or application-specific compressed columnar layout. Thus, there is nothing to block us from querying analytics in SPARQL, once the obviously necessary extensions of sub-query, expressions and aggregation are in place.
To cite an example, the Ordnance Survey of the UK has a GIS system running on Oracle with an entry pretty much for each mailbox, lamp post, and hedgerow in the country. According to Ordnance Survey, this would be 1 petatriple, 1e15 triples. "Such a big server farm that we'd have to put it on our map," as Jenny Harding put it at ESWC2008. I'd add that an even bigger map entry would be the power plant needed to run the 100,000 or so PCs this would take. This is counting 10 gigatriples per PC, which would not even give very good working sets.
So, on-the-fly RDBMS-to-RDF mapping in some cases is simply necessary. Still, the benefits of RDF for integration can be preserved if the translation middleware is smart enough. Specifically, this entails knowing what tables can be joined with what other tables and pushing maximum processing to the RDBMS(s) involved in the query.
You can download the slide set I used for the Virtuoso presentation for the RDB to RDF mapping incubator group (PPT; other formats coming soon). The main point is that real integration is hard and needs smart query splitting and optimization, as well as real understanding of the databases and subject matter from the information architect. Sometimes in the web space it can suffice to put data out there with trivial RDF translation and hope that a search engine or such will figure out how to join this with something else. For the enterprise, things are not so. Benefits are clear if one can navigate between disjoint silos but making this accurate enough for deriving business conclusions, as well as efficient enough for production, is a soluble and non-trivial question.
We will show the basics of this with the TPC-H mapping, and by joining this with physical triples. We will also make a set of TPC-H format table sets, make mappings between keys in one to keys in the other, and show joins between the two. The SPARQL querying of one such data store is a done deal, including the SPARQL extensions for this. There is even a demo paper, Business Intelligence Extensions for SPARQL (PDF; other formats coming soon), by us on the subject in the ESWC 2008 proceedings. If there is an issue left, it is just the technicality of always producing SQL that looks hand-crafted and hence is better understood by the target RDBMS(s). For example, Oracle works better if one uses an IN
sub-query instead of the equivalent existence test.
Follow this blog for more on the topic; published papers are always a limited view on the matter.
]]>Yrjänä Rankka and I attended ESWC2008 on behalf of OpenLink.
We were invited at the last minute to give a Linked Open Data talk at Paolo Bouquet's Identity and Reference workshop. We also had a demo of SPARQL BI (PPT); other formats coming soon), our business intelligence extensions to SPARQL as well as joining between relational data mapped to RDF and native RDF data. i was also speaking at the social networks panel chaired by Harry Halpin.
I have gathered a few impressions that I will share in the next few posts (1 - RDF Mapping, 2 - DARQ, 3 - voiD, 4 - Paradigmata). Caveat: This is not meant to be complete or impartial press coverage of the event but rather some quick comments on issues of personal/OpenLink interest. The fact that I do not mention something does not mean that it is unimportant.
Linked Open Data was well represented, with Chris Bizer, Tom Heath, ourselves and many others. The great advance for LOD this time around is voiD, the Vocabulary of Interlinked Datasets, a means to describe what in fact is inside the LOD cloud, how to join it with what and so forth. Big time important if there is to be a web of federatable data sources, feeding directly into what we have been saying for a while about SPARQL end-point self-description and discovery. There is reasonable hope of having something by the date of Linked Data Planet in a couple of weeks.
Bastian Quilitz gave a talk about his DARQ, a federated version of Jena's ARQ.
Something like DARQ's optimization statistics should make their way into the SPARQL protocol as well as the voiD data set description.
We really need federation but more on this in a separate post.
Axel Polleres et al had a paper about XSPARQL, a merge of XQuery and SPARQL. While visiting DERI a couple of weeks back and again at the conference, we talked about OpenLink implementing the spec. It is evident that the engines must be in the same process and not communicate via the SPARQL protocol for this to be practical. We could do this. We'll have to see when.
Politically, using XQuery to give expressions and XML synthesis to SPARQL would be fitting. These things are needed anyhow, as surely as aggregation and sub-queries but the latter would not so readily come from XQuery. Some rapprochement between RDF and XML folks is desirable anyhow.
The social web panel presented the question of whether the sem web was ready for prime time with data portability.
The main thrust was expressed in Harry Halpin's rousing closing words: "Men will fight in a battle and lose a battle for a cause they believe in. Even if the battle is lost, the cause may come back and prevail, this time changed and under a different name. Thus, there may well come to be something like our semantic web, but it may not be the one we have worked all these years to build if we do not rise to the occasion before us right now."
So, how to do this? Dan Brickley asked the audience how many supported, or were aware of, the latest Web 2.0 things, such as OAuth and OpenID. A few were. The general idea was that research (after all, this was a research event) should be more integrated and open to the world at large, not living at the "outdated pace" of a 3 year funding cycle. Stefan Decker of DERI acquiesced in principle. Of course there is impedance mismatch between specialization and interfacing with everything.
I said that triples and vocabularies existed, that OpenLink had ODS (OpenLink Data Spaces, Community LinkedData) for managing one's data-web presence, but that scale would be the next thing. Rather large scale even, with 100 gigatriples (Gtriples) reached before one even noticed. It takes a lot of PCs to host this, maybe $400K worth at today's prices, without replication. Count 16G ram and a few cores per Gtriple so that one is not waiting for disk all the time.
The tricks that Web 2.0 silos do with app-specific data structures and app-specific partitioning do not really work for RDF without compromising the whole point of smooth schema evolution and tolerance of ragged data.
So, simple vocabularies, minimal inference, minimal blank nodes. Besides, note that the inference will have to be done at run time, not forward-chained at load time, if only because users will not agree on what sameAs and other declarations they want for their queries. Not to mention spam or malicious sameAs declarations!
As always, there was the question of business models for the open data web and for semantic technologies in general. As we see it, information overload is the factor driving the demand. Better contextuality will justify semantic technologies. Due to the large volumes and complex processing, a data-as-service model will arise. The data may be open, but its query infrastructure, cleaning, and keeping up-to-date, can be monetized as services.
For the identity and reference workshop, the ultimate question is metaphysical and has no single universal answer, even though people, ever since the dawn of time and earlier, have occupied themselves with the issue. Consequently, I started with the Genesis quote where Adam called things by nominibus suis, off-hand implying that things would have some intrinsic ontologically-due names. This would be among the older references to the question, at least in widely known sources.
For present purposes, the consensus seemed to be that what would be considered the same as something else depended entirely on the application. What was similar enough to warrant a sameAs for cooking purposes might not warrant a sameAs for chemistry. In fact, complete and exact sameness for URIs would be very rare. So, instead of making generic weak similarity assertions like similarTo or seeAlso, one would choose a set of strong sameAs assertions and have these in effect for query answering if they were appropriate to the granularity demanded by the application.
Therefore sameAs is our permanent companion, and there will in time be malicious and spam sameAs. So, nothing much should be materialized on the basis of sameAs assertions in an open world. For an app-specific warehouse, sameAs can be resolved at load time.
There was naturally some apparent tension between the Occam camp of entity name services and the LOD camp. I would say that the issue is more a perceived polarity than a real one. People will, inevitably, continue giving things names regardless of any centralized authority. Just look at natural language. But having a dictionary that is commonly accepted for established domains of discourse is immensely helpful.
The semantic search workshop was interesting, especially CYC's presentation. CYC is, as it were, the grand old man of knowledge representation. Over the long term, I would have support of the CYC inference language inside a database query processor. This would mostly be for repurposing the huge knowledge base for helping in search type queries. If it is for transactions or financial reporting, then queries will be SQL and make little or no use of any sort of inference. If it is for summarization or finding things, the opposite holds. For scaling, the issue is just making correct cardinality guesses for query planning, which is harder when inference is involved. We'll see.
I will also have a closer look at natural language one of these days, quite inevitably, since Zitgist (for example) is into entity disambiguation.
Garlic gave a talk about their Data Patrol and QDOS. We agree that storing the data for these as triples instead of 1000 or so constantly changing relational tables could well make the difference between next-to-unmanageable and efficiently adaptive.
Garlic probably has the largest triple collection in constant online use to date. We will soon join them with our hosting of the whole LOD cloud and Sindice/Zitgist as triples.
There is a mood to deliver applications. Consequently, scale remains a central, even the principal topic. So for now we make bigger centrally-managed databases. At the next turn around the corner we will have to turn to federation. The point here is that a planetary-scale, centrally-managed, online system can be made when the workload is uniform and anticipatable, but if it is free-form queries and complex analysis, we have a problem. So we move in the direction of federating and charging based on usage whenever the workload is more complex than making simple lookups now and then.
For the Virtuoso roadmap, this changes little. Next we make data sets available on Amazon EC2, as widely promised at ESWC. With big scale also comes rescaling and repartitioning, so this gets additional weight, as does further parallelizing of single user workloads. As it happens, the same medicine helps for both. At Linked Data Planet, we will make more announcements.
]]>We ran the DBpedia benchmark queries again with different configurations of Virtuoso. I had not studied the details of the matter previously but now did have a closer look at the queries.
Comparing numbers given by different parties is a constant problem. In the case reported here, we loaded the full DBpedia 3, all languages, with about 198M triples, onto Virtuoso v5 and Virtuoso Cluster v6, all on the same 4 core 2GHz Xeon with 8G RAM. All databases were striped on 6 disks. The Cluster configuration was with 4 processes in the same box.
We ran the queries in two variants:
FROM
clause, using the default indices.The times below are for the sequence of 5 queries; individual query times are not reported. I did not do a line-by-line review of the execution plans since they seem to run well enough. We could get some extra mileage from cost model tweaks, especially for the numeric range conditions, but we will do this when somebody comes up with better times.
First, about Virtuoso v5: Because there is a query in the set that specifies no condition on S or O and only P, this simply cannot be done with the default indices. With Virtuoso Cluster v6 it sort-of can, because v6 is more space efficient.
So we added the index:
create bitmap index rdf_quad_pogs on rdf_quad (p, o, g, s);
Virtuoso v5 with gspo, ogps, pogs |
Virtuoso Cluster v6 with gspo, ogps |
Virtuoso Cluster v6 with gspo, ogps, pogs |
|
cold | 210 s | 136 s | 33.4 s |
warm | 0.600 s | 4.01 s | 0.628 s |
OK, so now let us do it without a graph being specified. For all platforms, we drop any existing indices, and --
create table r2 (g iri_id_8, s, iri_id_8, p iri_id_8, o any, primary key (s, p, o, g))
alter index R2 on R2 partition (s int (0hexffff00));
log_enable (2);
insert into r2 (g, s, p, o) select g, s, p, o from rdf_quad;
drop table rdf_quad;
alter table r2 rename RDF_QUAD;
create bitmap index rdf_quad_opgs on rdf_quad (o, p, g, s) partition (o varchar (-1, 0hexffff));
create bitmap index rdf_quad_pogs on rdf_quad (p, o, g, s) partition (o varchar (-1, 0hexffff));
create bitmap index rdf_quad_gpos on rdf_quad (g, p, o, s) partition (o varchar (-1, 0hexffff));
The code is identical for v5 and v6, except that with v5 we use
iri_id (32 bit)
for the type, not iri_id_8 (64 bit)
. We note that
we run out of IDs with v5 around a few billion triples, so with v6
we have double the ID length and still manage to be vastly more
space efficient.
With the above 4 indices, we can query the data pretty much in any combination without hitting a full scan of any index. We note that all indices that do not begin with s end with s as a bitmap. This takes about 60% of the space of a non-bitmap index for data such as DBpedia.
If you intend to do completely arbitrary RDF queries in Virtuoso, then chances are you are best off with the above index scheme.
Virtuoso v5 with gspo, ogps, pogs |
Virtuoso Cluster v6 with spog, pogs, opgs, gpos |
|
warm | 0.595 s | 0.617 s |
The cold times were about the same as above, so not reproduced.
It is in the SPARQL spirit to specify a graph and for pretty much any application, there are entirely sensible ways of keeping the data in graphs and specifying which ones are concerned by queries. This is why Virtuoso is set up for this by default.
On the other hand, for the open web scenario, dealing with an unknown large number of graphs, enumerating graphs is not possible and questions like which graph of which source asserts x become relevant. We have two distinct use cases which warrant different setups of the database, simple as that.
The latter use case is not really within the SPARQL spec, so implementations may or may not support this. For example Oracle or Vertica would not do this well since they partition data according to graph or predicate, respectively. On the other hand, stores that work with one quad table, which is most of the ones out there, should do it maybe with some configuring, as shown above.
Frameworks like Jena are not to my knowledge geared towards having a wildcard for graph, although I would suppose this can be arranged by adding some "super-graph" object, a graph of all graphs. I don't think this is directly supported and besides most apps would not need it.
Once the indices are right, there is no difference between specifying a graph and not specifying a graph with the queries considered. With more complex queries, specifying a graph or set of graphs does allow some optimizations that cannot be done with no graph specified. For example, bitmap intersections are possible only when all leading key parts are given.
The best warm cache time is with v5; the five queries run under 600 ms after the first go. This is noted to show that all-in-memory with a single thread of execution is hard to beat.
Cluster v6 performs the same queries in 623 ms. What is gained in parallelism is lost in latency if all operations complete in microseconds. On the other hand, Cluster v6 leaves v5 in the dust in any situation that has less than 100% hit rate. This is due to actual benefit from parallelism if operations take longer than a few microseconds, such as in the case of disk reads. Cluster v6 has substantially better data layout on disk, as well as fewer pages to load for the same content.
This makes it possible to run the queries without the pogs index on Cluster v6 even when v5 takes prohibitively long.
The morale of the story is to have a lot of RAM and space-efficient data representation.
The DBpedia benchmark does not specify any random access pattern that would give a measure of sustained throughput under load, so we are left with the extremes of cold and warm cache of which neither is quite realistic.
Chris Bizer and I have talked on and off about benchmarks and I have made suggestions that we will see incorporated into the Berlin SPARQL benchmark, which will, I believe, be much more informative.
For reference, the query texts specifying the graph are below. To
run without specifying the graph, just drop the FROM
<http://dbpedia.org>
from each query. The returned row counts are indicated
below each query's text.
sparql SELECT ?p ?o FROM <http://dbpedia.org> WHERE {
<http://dbpedia.org/resource/Metropolitan_Museum_of_Art> ?p ?o };
-- 1337 rows
sparql PREFIX p: <http://dbpedia.org/property/>
SELECT ?film1 ?actor1 ?film2 ?actor2
FROM <http://dbpedia.org> WHERE {
?film1 p:starring <http://dbpedia.org/resource/Kevin_Bacon> .
?film1 p:starring ?actor1 .
?film2 p:starring ?actor1 .
?film2 p:starring ?actor2 . };
-- 23910 rows
sparql PREFIX p: <http://dbpedia.org/property/>
SELECT ?artist ?artwork ?museum ?director FROM <http://dbpedia.org>
WHERE {
?artwork p:artist ?artist .
?artwork p:museum ?museum .
?museum p:director ?director };
-- 303 rows
sparql PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?s ?homepage FROM <http://dbpedia.org> WHERE {
<http://dbpedia.org/resource/Berlin> geo:lat ?berlinLat .
<http://dbpedia.org/resource/Berlin> geo:long ?berlinLong .
?s geo:lat ?lat .
?s geo:long ?long .
?s foaf:homepage ?homepage .
FILTER (
?lat <= ?berlinLat + 0.03190235436 &&
?long >= ?berlinLong - 0.08679199218 &&
?lat >= ?berlinLat - 0.03190235436 &&
?long <= ?berlinLong + 0.08679199218) };
-- 56 rows
sparql PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX p: <http://dbpedia.org/property/>
SELECT ?s ?a ?homepage FROM <http://dbpedia.org> WHERE {
<http://dbpedia.org/resource/New_York_City> geo:lat ?nyLat .
<http://dbpedia.org/resource/New_York_City> geo:long ?nyLong .
?s geo:lat ?lat .
?s geo:long ?long .
?s p:architect ?a .
?a foaf:homepage ?homepage .
FILTER (
?lat <= ?nyLat + 0.3190235436 &&
?long >= ?nyLong - 0.8679199218 &&
?lat >= ?nyLat - 0.3190235436 &&
?long <= ?nyLong + 0.8679199218) };
-- 13 rows
]]>Andy Seaborne and Eric Prud'hommeaux, editors of the SPARQL recommendation, convened a SPARQL birds of a feather session at WWW 2008. The administrative outcome was that implementors could now experiment with extensions, hopefully keeping each other current about their efforts and that towards the end of 2008, a new W3C working group might begin formalizing the experiences into a new SPARQL spec.
The session drew a good crowd, including many users and developers. The wishes were largely as expected, with a few new ones added. Many of the wishes already had diverse implementations, however most often without interop. I will below give some comments on the main issues discussed.
SPARQL Update - This is likely the most universally agreed upon extension. Implementations exist, largely along the lines of Andy Seaborne's SPARUL spec, which is also likely material for a W3C member submission. The issue is without much controversy; transactions fall outside the scope, which is reasonable enough. With triple stores, we can define things as combinations of inserts and deletes, and isolation we just leave aside. If anything, operating on a transactional platform such as Virtuoso, one wishes to disable transactions for any operations such as bulk loads and long-running inserts and deletes. Transactionality has pretty much no overhead for a few hundred rows, but for a few hundred million rows the cost of locking and rollback is prohibitive. With Virtuoso, we have a row auto-commit mode which we recommend for use with RDF: It commits by itself now and then, optionally keeping a roll forward log, and is transactional enough not to leave half triples around, i.e., inserted in one index but not another.
As far as we are concerned, updating physical triples along the SPARUL lines is pretty much a done deal.
The matter of updating relational data mapped to RDF is a whole other kettle of fish. On this, I should say that RDF has no special virtues for expressing transactions but rather has a special genius for integration. Updating is best left to web service interfaces that use SQL on the inside. Anyway, updating union views, which most mappings will be, is complicated. Besides, for transactions, one usually knows exactly what one wishes to update.
Full Text - Many people expressed a desire for full text access. Here we run into a deplorable confusion with regexps. The closest SPARQL has to full text in its native form is regexps, but these are not really mappable to full text except in rare special cases and I would despair of explaining to an end user what exactly these cases are. So, in principle, some regexps are equivalent to full text but in practice I find it much preferable to keep these entirely separate.
It was noted that what the users want is a text box for search words. This is a front end to the CONTAINS predicate of most SQL implementations. Ours is MS SQL Server compatible and has a SPARQL version called bif:contains
. One must still declare which triples one wants indexed for full text, though. This admin overhead seems inevitable, as text indexing is a large overhead and not needed by all applications.
Also, text hits are not boolean; usually they come with a hit score. Thus, a SPARQL extension for this could look like
select * where { ?thing has_description ?d . ?d ftcontains "gizmo" ftand "widget" score ?score . }
This would return all the subjects, descriptions, and scores, from subjects with a has_description property containing widget and gizmo. Extending the basic pattern is better than having the match in a filter, since the match binds a variable.
The XQuery/XPath groups have recently come up with a full-text spec, so I used their style of syntax above. We already have a full-text extension, as do some others. but for standardization, it is probably most appropriate to take the XQuery work as a basis. The XQuery full-text spec is quite complex, but I would expect most uses to get by with a small subset, and the structure seems better thought out, at first glance, than the more ad-hoc implementations in diverse SQLs.
Again, declaring any text index to support the search, as well as its timeliness or transactionality, are best left to implementations.
Federation - This is a tricky matter. ARQ has a SPARQL extension for sending a nested set of triple patterns to a specific end-point. The DARQ project has something more, including a selectivity model for SPARQL.
With federated SQL, life is simpler since after the views are expanded, we have a query where each table is at a known server and has more or less known statistics. Generally, execution plans where as much work as possible is pushed to the remote servers are preferred, and modeling the latencies is not overly hard. With SPARQL, each triple pattern could in principle come from any of the federated servers. Associating a specific end-point to a fragment of the query just passes the problem to the user. It is my guess that this is the best we can do without getting very elaborate, and possibly buggy, end-point content descriptions for routing federated queries.
Having said this, there remains the problem of join order. I suggested that we enhance the protocol by allowing asking an end-point for the query cost for a given SPARQL query. Since they all must have a cost model for optimization, this should not be an impossible request. A time cost and estimated cardinality would be enough. Making statistics available à la DARQ was also discussed. Being able to declare cardinalities expected of a remote end-point is probably necessary anyway, since not all will implement the cost model interface. For standardization, agreeing on what is a proper description of content and cardinality and how fine grained this must be will be so difficult that I would not wait for it. A cost model interface would nicely hide this within the end-point itself.
With Virtuoso, we do not have a federated SPARQL scheme but we could have the ARQ-like service construct. We'd use our own cost model with explicit declarations of cardinalities of the remote data for guessing a join order. Still, this is a bit of work. We'll see.
For practicality, the service construct coupled with join order hints is the best short term bet. Making this pretty enough for standardization is not self-evident, as it requires end-point description and/or cost model hooks for things to stay declarative.
End-point description - This question has been around for a while; I have blogged about it earlier, but we are not really at a point where there would be even rough consensus about an end-point ontology. We should probably do something on our own to demonstrate some application of this, as we host lots of linked open data sets.
SQL equivalence - There were many requests for aggregation, some for subqueries and nesting, expressions in select, negation, existence and so on. I would call these all SQL equivalence. One use case was taking all the teams in the database and for all with over 5 members, add the big_team class and a property for member count.
With Virtuoso, we could write this as --
construct { ?team a big_team . ?team member_count ?ct } from ... where {?team a team . { select ?team2 count (*) as ?ct where { ?m member_of ?team2 } . filter (?team = ?team2 and ? ct > 5) }}
We have pretty much all the SQL equivalence features, as we have been working for some time at translating the TPC-H workload into SPARQL.
The usefulness of these things is uncontested but standardization could be hard as there are subtle questions about variable scope and the like.
Inference - The SPARQL spec does not deal with transitivity or such matters because it is assumed that these are handled by an underlying inference layer. This is however most often not so. There was interest in more fine grained control of inference, for example declaring that just one property in a query would be transitive or that subclasses should be taken into account in only one triple pattern. As far as I am concerned, this is very reasonable, and we even offer extensions for this sort of thing in Virtuoso's SPARQL. This however only makes sense if the inference is done at query time and pattern by pattern. For instance, if forward chaining is used, this no longer makes sense. Specifying that some forward chaining ought to be done at query time is impractical, as the operation can be very large and time consuming and it is the DBA's task to determine what should be stored and for how long, how changes should be propagated, and so on. All these are application dependent and standardizing will be difficult.
Support for RDF features like lists and bags would all fall into the functions an underlying inference layer should perform. These things are of special interest when querying OWL models, for example.
Path expressions - Path expressions were requested by a few people. We have implemented some, as in
?product+?has_supplier+>s_name = "Gizmos, Inc.".
This means that one supplier of product has name "Gizmo, Inc.". This is a nice shorthand but we run into problems if we start supporting repetitive steps, optional steps, and the like.In conclusion, update, full text, and basic counting and grouping would seem straightforward at this point. Nesting queries, value subqueries, views, and the like should not be too hard if an agreement is reached on scope rules. Inference and federation will probably need more experimentation but a lot can be had already with very simple fine grained control of backward chaining, if such applies, or with explicit end-point references and explicit join order. These are practical but not pretty enough for committee consensus, would be my guess. Anyway, it will be a few months before anything formal will happen.
]]>We had a workshop on Linked Open Data (LOD) last week in Beijing. You can see the papers in the program. The event was a success with plenty of good talks and animated conversation. I will not go into every paper here but will comment a little on the conversation and draw some technology requirements going forward.
Tim Berners-Lee showed a read-write version of Tabulator. This raises the question of updating on the Data Web. The consensus was that one could assert what one wanted in one's own space but that others' spaces would be read-only. What spaces one considered relevant would be the user's or developer's business, as in the document web.
It seems to me that a significant use case of LOD is an open-web situation where the user picks a broad read-only "data wallpaper" or backdrop of assertions, and then uses this combined with a much smaller, local, writable data set. This is certainly the case when editing data for publishing, as in Tim's demo. This will also be the case when developing mesh-ups combining multiple distinct data sets bound together by sets of SameAs assertions, for example. Questions like, "What is the minimum subset of n data sets needed for deriving the result?" will be common. This will also be the case in applications using proprietary data combined with open data.
This means that databases will have to deal with queries that specify large lists of included graphs, all graphs in the store or all graphs with an exclusion list. All this is quite possible but again should be considered when architecting systems for an open linked data web.
"There is data but what can we really do with it? How far can we trust it, and what can we confidently decide based on it?"
As an answer to this question, Zitgist has compiled the UMBEL taxonomy using SKOS. This draws on Wikipedia, Open CYC, Wordnet, and YAGO, hence the acronym WOWY. UMBEL is both a taxononmy and a set of instance data, containing a large set of named entities, including persons, organizations, geopolitical entities, and so forth. By extracting references to this set of named entities from documents and correlating this to the taxonomy, one gets a good idea of what a document (or part thereof) is about.
Kingsley presented this in the Zitgist demo. This is our answer to the criticism about DBpedia having errors in classification. DBpedia, as a bootstrap stage, is about giving names to all things. Subsequent efforts like UMBEL are about refining the relationships.
"Should there be a global URI dictionary?"
There was a talk by Paolo Bouquet about Entity Name System, a a sort of data DNS, with the purpose of associating some description and rough classification to URIs. This would allow discovering URIs for reuse. I'd say that this is good if it can cut down on the SameAs proliferation and if this can be widely distributed and replicated for resilience, à la DNS. On the other hand, it was pointed out that this was not quite in the LOD spirit, where parties would mint their own dereferenceable URIs, in their own domains. We'll see.
"What to do when identity expires?"
Giovanni of Sindice said that a document should be removed from search if it was no longer available. Kingsley pointed out that resilience of reference requires some way to recover data. The data web cannot be less resilient than the document web, and there is a point to having access to history. He recommended hooking up with the Internet Archive, since they make long term persistence their business. In this way, if an application depends on data, and the URIs on which it depends are no longer dereferenceable or or provide content from a new owner of the domain, those who need the old version can still get it and host it themselves.
It is increasingly clear that OWL SameAs is both the blessing and bane of linked data. We can easily have tens of URIs for the same thing, especially with people. Still, these should be considered the same.
Returning every synonym in a query answer hardly makes sense but accepting them as input seems almost necessary. This is what we do with Virtuoso's SameAs support. Even so, this can easily double query times even when there are no synonyms.
Be that as it may, SameAs is here to stay; just consider the mapping of DBpedia to Geonames, for example.
Also, making aberrant SameAs statements can completely poison a data set and lead to absurd query results. Hence choosing which SameAs assertions from which source will be considered seems necessary. In an open web scenario, this leads inevitably to multi-graph queries that can be complex to write with regular SPARQL. By extension, it seems that a good query would also include the graphs actually used for deriving each result row. This is of course possible but has some implications on how databases should be organized.
Yves Raymond gave a talk about deriving identity between Musicbrainz and Jamendo. I see the issue as a core question of linked data in general. The algorithm Yves presented started with attribute value similarities and then followed related entities. Artists would be the same if they had similar names and similar names of albums with similar song titles, for example. We can find the same basic question in any analysis, for example, looking at how news reporting differs between media, supposing there is adequate entity extraction.
There is basic graph diffing in RDFSync, for example. But here we are expanding the context significantly. We will traverse references to some depth, allow similarity matches, SameAs, and so forth. Having presumed identity of two URIs, we can then look at the difference in their environment to produce a human readable summary. This could then be evaluated for purposes of analysis or of combining content.
At first sight, these algorithms seem well parallelizable, as long as all threads have access to all data. For scaling, this means a probably message-bound distributed algorithm. This is something to look into for the next stage of linked data.
Some inference is needed, but if everybody has their own choice of data sets to query, then everybody would also have their own entailed triples. This will make for an explosion of entailed graphs if forward chaining is used. Forward chaining is very nice because it keeps queries simple and easy to optimize. With Virtuoso, we still favor backward chaining since we expect a great diversity of graph combinations and near infinite volume in the open web scenario. With private repositories of slowly changing data put together for a special application, the situation is different.
In conclusion, we have a real LOD movement with actual momentum and a good idea of what to do next. The next step is promoting this to the broader community, starting with Linked Data Planet in New York in June.
]]>"I give the search keywords and you give me a SPARQL end-point and a query that will get the data."
Thus did one SPARQL user describe the task of a semantic/data web search engine.
In a previous post, I suggested that if the data web were the size of the document web, we'd be looking at two orders of magnitude more search complexity. It just might be so.
In the conversation, I pointed out that a search engine might have a copy of everything and even a capability to do SPARQL and full text on it all, yet still the users would be better off doing the queries against the SPARQL end-points of the data publishers. It is a bit like the fact that not all web browsing runs off Google's cache. With the data web, the point is even more pronounced, as serving a hit from Google's cache is a small operation but a complex query might be a very large one.
Yet, the data web is about ad-hoc joining between data sets of different origins. Thus a search engine of the data web ought to be capable of joining also, even if large queries ought to be run against individual publishers' end-points or the user's own data warehouse.
For ranking, the general consensus was that no single hit-ranking would be good for the data web. Thus word frequency-based hit-scores are OK for text hits but more is not obvious. I would think that some link analysis could apply but this will take some more experimentation.
For search summaries, if we have splitting of data sets into small fragments à la Sindice, search summaries are pretty much the same as with just text search. If we store triples, then we can give text style summaries of text hits in literals and Fresnel lens views of the structured data around the literal. For showing a page of hits, the lenses must abbreviate heavily but this is still feasible. The engine would know about the most common ontologies and summarize instance data accordingly.
Chris Bizer pointed out that trust and provenance are critical, especially if an answer is arrived at by joining multiple data sets. The trust of the conclusion is no greater than that of the weakest participating document. Different users will have different trusted sources.
A mature data web search engine would combine a provenance/trust specification, a search condition consisting of SPARQL or full text or both, and a specification for hit rank. Again, most searches would use defaults, but these three components should in principle be orthogonally specifiable.
Many places may host the same data set either for download or SPARQL access. The URI of the data set is not its URL. Different places may further host multiple data sets on one end-point. Thus the search engine ought to return all end-points where the set is to be found. The end-points themselves ought to be able to say what data sets they contain, under what graph IRIs. Since there is no consensus about end-point self description, this too would be left to the search engine. In practice, this could be accomplished by extending Sindice's semantic site map specification. A possible query would be to find an end-point containing a set of named data sets. If none were found, the search engine itself could run a query joining all the sets since it at least would hold them all.
Since many places will host sets like Wordnet or Uniprot, indexing these once for each copy hardly makes sense. Thus a site should identify its data by the data set's URI and not the copy's URL.
It came up in the discussion that search engines should share a ping format so that a single message format would be enough to notify any engine about data being updated. This is already partly the case with Sindice and PTSW (PingTheSemanticWeb) sharing a ping format.
Further, since it is no trouble to publish a copy of the 45G Uniprot file but a fair amount of work to index it, search engines should be smart about processing requests to index things, since these can amount to a denial of service attack.
Probably very large data sets should be indexed only in the form supplied by their publisher, and others hosting copies would just state that they hold a copy. If the claim to the copy proved false, users could complain and the search engine administrator would remove the listing. It seems that some manual curating cannot be avoided here.
It seems there can be an overlap between the data web search and the data web hosting businesses. For example, Talis rents space for hosting RDF data with SPARQL access. A search engine should offer basic indexing of everything for free, but could charge either data publishers or end users for running SPARQL queries across data sets. These do not have the nicely anticipatable and fairly uniform resource consumption of text lookups. In this manner, a search provider could cost-justify the capacity for allowing arbitrary queries.
The value of the data web consists of unexpected joining. Such joining takes place most efficiently if the sources are at least in some proximity, for example in the same data center. Thus the search provider could monetize functioning as the database provider for mesh-ups. In the document web, publishing pages is very simple and there is no great benefit from co-locating search and pages, rather the opposite. For the data web, the hosting with SPARQL and all is more complex and resembles providing search. Thus providing search can combine with providing SPARQL hosting, once we accept in principle that search should have arbitrary inter-document joining, even if it is at an extra premium.
The present search business model is advertising. If the data web is to be accessed by automated agents such as mesh-up code, display of ads is not self-evident. This is quite separate from the fact that semantics can lead to better ad targeting.
One model would be to do text lookups for free from a regular web page but show ads, just a la Google search ads. Using the service via web services for text or SPARQL would have a cost paid by the searching or publishing party and would not be financed by advertising.
In the case of data used in value-add data products (mesh-ups) that have financial value to their users, the original publisher of the data could even be paid for keeping the data up-to-date. This would hold for any time-sensitive feeds like news or financial feeds. Thus the hosting/search provider would be a broker of data-use fees and the data producer would be in the position of an AdSense inventory owner, i.e., a web site which shows AdSense ads. Organizing this under a hub providing back-office functions similar to an ad network could make sense even if the actual processing were divided among many sites.
Kingsley has repeatedly formulated the core value proposition of the semantic web in terms of dealing with information overload: There is the real-time enterprise and the real-time individual and both are beasts of perception. Their image is won and lost in the Internet online conversation space. We know that allegations, even if later proven false, will stick if left unchallenged. The function of semantics on the web is to allow one to track and manage where one stands. In fact, Garlik has made a business of just this, but now from a privacy and security angle. The Garlik DataPatrol harvests data from diverse sources and allows assessing vulnerability to identity theft, for example.
If one is in the business of collating all the structured data in the world, as a data web search engine is, then providing custom alerts for both security or public image management is quite natural. This can be a very valuable service if it works well.
At OpenLink, we will now experiment with the Sindice/Zitgist/PingTheSemanticWeb content. This is a regular part of the productization of Virtuoso's cluster edition. We expect to release some results in the next 4 weeks.
]]>Following my return from WWW 2008 in Beijing, I will write a series of blog posts discussing diverse topics that were brought up in presentations and conversations during the week.
Linked data was our main interest in the conference and there was a one day workshop on this, unfortunately overlapping with a day of W3C Advisory Committee meetings. Hence Tim Berners-Lee, one of the chairs of the workshop, could not attend for most of the day. Still, he was present to say that "Linked open data is the semantic web and the web done as it ought to be done."For my part, I will draw some architecture conclusions from the different talks and extrapolate about the requirements on database platforms for linked data.
Chris Bizer predicted that 2008 would be the year of data web search, if 2007 was the year of SPARQL. This may be the case, as linked data is now pretty much a reality and the questions of discovery become prevalent. There was a birds-of-a-feather session on this and I will make some comments on what we intend to explore in bridging between the text index based semantic web search engines and SPARQL.
Andy Seaborne convened a birds-of-a-feather session on the future of SPARQL. Many of the already anticipated and implemented requirements were confirmed and a few were introduced. A separate blog post will discuss these further.
From the various discussions held throughout the conference, we conclude that plug-and-play operation with the major semantic web frameworks of Jena, Sesame, and Redland, is our major immediate-term deliverable. Our efforts in this direction thus far are insufficient and we will next have these done with the right supervision and proper interop testing. The issues are fortunately simple but doing things totally right require some small server side support and some JDBC/ODBC tweaks, so to the interested, we advise to wait for an update to be published on this blog.
I further had a conversation with Andy Seaborne about using Jena reasoning capabilities with Virtuoso and generally the issues of "impedance mismatch" between reasoning and typical database workloads. More on this later.
]]>Now that it is time to nail down the final database format and configuration steps for Virtuoso Cluster, we have to make a couple of small choices.
While fixed partitioning is good for testing, it is impractical even for measuring scalability, let alone deployment. So fixed partitioning is really a no-go. The best would be entirely dynamic partitioning where the data file split after reaching a certain size and where data files migrated between any number of heterogeneous boxes, so as to equalize pressure, like a liquid fills a container. It would have to be a sluggish cool liquid with a high surface tension, else we would get "Brownian motion" in trying to equalize the pressure too often. But a liquid still.
Sure. Now go implement this with generic RDBMS ACID semantics and the works. Feasible. But while I am a fair-to-good geek, I don't have time for this right now.
So something in between, then. The auto-migrating files are really a problem for keeping locks, keeping tractable roll forward, special cases in message routing etc. The papers by Google and Amazon allude to this and they stop far short of serializable transaction semantics.
So what is the design target? In practice, running a few hundred database nodes with fault tolerance. The nodes must be allowed to be of different sizes since a one time replace of hundreds of PC's fits ill with the economy.
Addition and removal of nodes must be without global downtime but putting some partitions in read only mode for restricted periods is OK. Assigning the percentages of the data mass to the nodes can be a DBA task with a utility making suggestions based on measured disk cache hit rates and the like. Partitions and the makeup of the cluster must be maintainable from a single place, no copying of config files or such, nobody can get this right and the screw ups from having cluster nodes disagree about some part of the partition map are so untractable you don't even want to go there.
So, how is this done? Divide the space of partitioning hashes into say 1K slices. For each slice, give the node numbers that hold this slice of the partitioning space. If there are more nodes, just divide the space in more slices. When a node comes new to the cluster, it manages no slices. Slices from other nodes can be offloaded to the new one by copying their contents. The slice holders put the slice in read only mode and the new node just does an insert-select of the partitions it is meant to take over, no logs or locks needed and all comes in order so insert is 100% in processor cache. When the copy is complete, the slice comes back in read write mode in its new location and the old holders of the slices delete theirs in the background. This is not as quick as shifting database files around but takes far less programming.
To remove a node, reassign its slices and have the assignees read the data, as above. When the copy is done, the node can go off line.
A slight modification of this is for cases where a slice always has more than one holder. Now, if we allocate slices to nodes on the go, keeping would-be replicas on different machines but otherwise equalizing load, we run into a problem with redundancy when we perform an operation that has no specific recipient: Suppose we count a table's rows, we should send the count request once per every slice. Now, the slice is not the recipient of the request, the cluster node hosting the slice is. We should either qualify the request with what slices it pertains to, which means extra reading and filtering or have it so that no matter the choice of replicas for any operation, there is no overlap between their contents, yet the contents cover every slice. The only simple way to enforce the latter is to have cluster nodes pair-wise as each others' replicas. Nodes will be adding nodes in pairs or threes or whatever number of replicas there will be. Google's GFS, the file system redundancy layer under Bigtable or Dynamo do not have this problem since they do not deal with generic DBMS semantics. The downside is that if a pair has two different types of boxes, the sizing should go according to the smaller one. No big deal.
If replicas are assigned box by box instead of slice by slice, life is also simpler in terms of roll forward and reconstituting lost nodes.
The complete list of cluster nodes and their slice allocations are kept on a master node. Each node also knows which slices it holds. With the loss of the master the situation can be reconstructed from the others. For normal boot, the nodes get the cluster layout, slice allocation and some config parameters from the master, so that if network addresses change they do not have to be written to each node. These are remembered by each node though, for the event of master failure.
When a Virtuoso 6 database is made, the system objects are partitioned from the get-go. On a single process this has no effect. But since all the data structures exist, the transition from a single process to cluster and back is smooth.
At any time, a database, whether single or cluster, is a self-contained unit. One server process serves one database. It is one-to-many from database to server process. A server process will not mount multiple databases. This could be done but then this would be more changes than fit in the time available so this will not be supported. One can always have multiple server processes and attached tables for genuinely inter-database operations. This said, a single database holds arbitrarily many catalogs and schemas and application objects of all sorts.
In terms of schedule, we do the single copy per partition right now. Duplicate and triplicate copies of partitions are needed as we do some of our web scale things in the pipeline. So a degree of this supported even now but without seamless recovery, so when a replica is offline, the remaining copy is read only. Making this read-write is a matter of a little programming. DDL operations will continue to require all nodes to be online.
As of this writing, we are making the regular test suite run on a cluster with the above described partitioning, a single copy per partition. After this, the database layout will stay constant. The first deployments will go out without replicated partitions. Replicated partitions will follow shortly, together with some of the optimizations mentioned in previous posts.
]]>A while back, a friend suggested he and I go check out the Singularity Summit, a conference where they talk of strong AI. Well, I am not a singularist. But since singularists are so brazenly provocative, I looked a little to see if there is anything there, engineering-wise.
So, for a change, we'll be visionary. Read on, I also mention RDF at the end.
I will not even begin with arguments about indefinitely continuing a trend. I will just say that nature has continuous and discontinuous features. Computing is at a qualitative bend and things are not as simple as they are blithely cracked up to be. Read HEC, the high end crusader; he has good commentary on architecture. Having absorbed and understood this, you can talk again about a billion-fold increase in computing price/performance.
When I looked further, about uploading, i.e., the sci-fi process of scanning a brain and having it continue its function inside a computer simulation, I actually found some serious work. Dharmendra Modha at IBM had made a rat-scale cortical simulator running on a 32K node IBM BlueGene, with a 9x slowdown from real time. This is real stuff and in no way meant to be associated with singularism by being mentioned in the same paragraph. Anyway, I was intrigued.
So I asked myself if I could do better. I gave it a fair try, as fair as can be without an actual experiment. The end result is that latency rules. To have deterministic semantics, one must sync across tens of thousands of processors and one cannot entirely eliminate multi-hop messages on the interconnect fabric. If one optimistically precalculates and spreads optimistic results over the cluster, rolling them back will be expensive. Local optimistic computation may have little point since most data points will directly depend on non local data. One needs two sets of wiring, a 3D torus for predominantly close range bulk and a tree for sync, just like the BlueGene has. Making the same from newer hardware makes it more economical but there's another 8 or so orders of magnitude to go before energy efficiency parity with biology. Anyway, a many-times-over qualitative gap. Human scale in real time might be just there somewhere in reach with the stuff now in pre-release if there were no bounds on cost or power consumption. We'd be talking billions at least and even then it is iffy. But this is of little relevance as long as the rat scale is at the level of a systems test for the platform and not actually simulating a plausible organism in some sort of virtual environment. Anyway, of biology I cannot judge and for the computer science, Modha et al have it figured out about as well as can.
Simulation workloads are not the same as database. Database is easier in a way since any global sync is very rare. 2PC seldom touches every partition and if it does, the time to update was anyway greater than the time to commit. Databases are generally multi-user, with a low level of sync between users and a lot of asynchrony. On the other hand, the general case of database does not have predictable cluster locality. Well, if one has an OLTP app with a controlled set of transactions and reports, then one can partition just for that and have almost 100% affinity between the host serving the connection and the data. With RDF for example, such is not generally possible or would require such acrobatics of data layout that a DBA would not even begin.
So, for database, there is a pretty much even probability for any connection between the node running the query and any other. True, function shipping can make the messages large and fairly async and latency tolerant.
On the simulation side, it would seem that the wiring can mimic locality of the simulated process. Messages to neighbors are more likely than messages to remote nodes. So a 3D torus works well there, complemented by a tree for control messages, like sending all nodes the count of messages to expect. Of course, the control wiring (reduce tree) must have far less latency than a long path through the torus wiring and steps with long message deliveries on the torus must be rare for this to bring any profit. Also, long distance messages can go through the tree wiring if the volume is not excessive, else the tree top gets congested.
So, looking past the multicore/multithread and a single level switched interconnect, what would the architecture of the total knowledge machine be? For the neural simulation, the above-described is the best I can come up with and IBM already has come up with it anyway. But here I am more concerned with a database/symbol processing workload than about physical simulation. For the record, I'll say that I expect no strong AI to emerge from these pursuits but that they will still be useful, basically as a support for a linked data "planetary datasphere." Like Google, except it supports arbitrary queries joining arbitrary data at web scale. This would be a couple of orders of magnitude above the text index of same. Now, in practice, most queries will be rather trivial, so it is not that the 2 orders of magnitude are always realized. This would also involve a bit more updating than the equivalent text index since reports would now and then include private materialized inference results. As a general rule, backward chaining would be nicer, since this is read only but some private workspaces for distilling data cannot be avoided.
So, given this general spec, where should architecture go? We can talk of processors, networks and software in turn.
For evaluating processors, the archetypal task is doing a single random lookup out of a big index. Whether it is a B tree or hash is not really essential since both will have to be built of fixed size pages and will have more than one level. Both need some serialization for checking if things are in memory and for pinning them in cache for the duration of processing. This is eternally true as long as updates are permitted, even if RAM were the bottom of the hierarchy. And if not, then there is the checking of whether a page is in memory and the logic for pinning it.
This means critical sections on shared data structures with small memory writes inside. This is anathema, yet true. So, what the processor needs is shared memory for threads and if possible an instruction for entry into a read-write lock. If the lock is busy, there should be a settable interval before the thread is removed from the core. With a multithread core, this should be just like a memory cache miss. Only if this were really prolonged would there be an interrupt to run the OS scheduler to vacate the thread and load something else, and not even this if the number of executable threads were less or equal the number of threads on the cores.
For thread sync structures, the transactional memory ideas on the SPARC ROC might be going in this direction. For on-core threading, I have not tried how well SPARC T2 does but with the right OS support, it might be in the ballpark. The X86_64 chips have nice thread speed but the OS, whether Solaris or Linux, is a disaster if a mutex is busy. Don't know why but so it is.
Things like IBM Cell, with multiple integrated distributed memory processors on a chip might be workable if they had hardware for global critical sections and if sub-microsecond tasks could be profitably dispatched to the specialized cores (synergistic unit in Cell terminology). If an index lookup, about 2-4 microseconds of real time, could be profitably carried out on a synergistic core without sinking it all in sync delays on the main core, there might be some gain to this. Still, operations like cache lookups tend to be pointer chasing and latency of small memory reads and sometimes writes is a big factor. I have not measured Cell for this but it is not advertised for this sort of workload. It might do nicely for the neuron simulator but not for generic database, I would guess.
The point at which distributed memory wins over shared determines the size of the single compute node. Problems of threads and critical sections fall off but are replaced by network and the troubles of scheduling and serializing and de-serializing messages. There are really huge shared memory systems (by Cray, for example) but even there, hitting the wrong address sends a high latency message over an internal switched fabric, a bit like a network page fault. Well, if one has millions of threads runnable for catching the slack of memory access over interconnect, this might not be so bad, but this is not really the architecture for a general purpose database. So, at some point, we have clearly demarcated distributed memory with affinity of data to processor, simple as that.
For now, the high end clustered database benchmarks run off 1 Gbit ethernet fabrics with some 30 compute nodes on a single switch. This is for shared nothing systems. For shared disk, cache fusion systems like Oracle RAC, we have more heavy duty networking like Infiniband, as one would expect. I have discussed the merits of cache fusion vs shared nothing partitioning on in a previous post.
As long as we are at a scale that fits on a single switch with even port to port latency, we are set. For an RDF workload, throughput is not really the issue but latency can be. With today's technology, nodes with 4-8 cores and 16G RAM are practical and their number is most often not really large. Adding two orders of magnitude, we get more questions. Let's say that 2 billion triples fit with relative comfort but not without disk on a 16G RAM node. This would make 500 nodes for a trillion triples.
This is an entirely relevant and reasonable scale, considering that loading all public biomedical sets plus a pharma company's in house data could approach this ballpark. Let alone anything on the scale of the social web, i.e., the online conversation space.
So how to manage clusters like this? The current cluster gear is oriented toward switch trees and is fairly expensive, about $1000 per node. To make this easy, we would need a wiring free, modular architecture. Picture a shelf with SATA drive size bays, each would get a compute node of 8 cores and 16G plus interconnect. To simplify the network, the module would have a laser on each side, for a cube network topology, the enclosure could connect the edges with fiber optics for the 3D torus. Mass storage would be in the same form factor, as disks or flashes to be interspersed in the proper ratio with compute nodes. All would have the same communication chip, something like a Cray SeaStar with 6 or 7 external ports. A seventh port could be used for a reduce tree. The rack would provide cooling on the shelves by circulating a coolant fluid. Reconfiguring and scaling would be a matter of adding shelves to the cabinet and laying out Lego bricks, blue for compute and red for storage. The network could achieve a latency for an arbitrary point to point message of no more than 10-20 microseconds by the right mix of cube and tree. This in a box of thousands of nodes.
This type of layout would accommodate web scale without needing rows and rows of rack cabinets.
The external interfaces for web requests and replies are no big deal after this, since the intra-cluster traffic is orders of magnitude higher than the result set traffic. After all, the end user does not want dumps of the database but highly refined and ranked results.
We have previously talked about dynamic partitions a la Google Bigtable or Amazon Dynamo. These techniques are entirely fine and will serve for the universal knowledge store.
But what about query logic? OK, having a consistent map of partitions shared over tens of thousands of nodes is entirely possible. So is replication and logic for using a spatially closer partition when multiple copies are know to exist. These things take some programming but there is nothing really new about them. These are a direct straightforward extension of what we do for clustering right now.
But look to windward. How to run complex queries and inference on the platform outlined above? There are some features of RDF querying like same-as that can be easily parallelized, backward-chaining style: Just proceed with the value at hand and initiate the lookup of synonyms and let them get processed when they become available. Same for subclass and sub-property. We already do this, but could do it with more parallelism.
No matter what advances in architecture take place, I do not see a world where every user materializes the entailment of their own same-as-es and rules over a web scale data set. So, backward chaining approaches to inference must develop. Luckily, what most queries need is simple. A rule-oriented language, like Prolog without cut will parallelize well enough. Some degree of memoization may be appropriate for cutting down on re-proving the same thing over and over. Memoization over a cluster is a problem though, since this involves messages. I should say that one should not go looking for pre-proven things beyond the node at hand and that computation should not spread too quickly or too promiscuously so as not to make long message paths. We must remember that a totally even round trip time on a large cluster just will not happen.
Query planning on any system critically depends on correct ballpark guesses on cardinality. If predicates are amplified with transitivity and same-as at run time, the cardinality guessing becomes harder. This can kill any plan no matter the hardware. Probably, an executing query must be annotated with the underlying cardinality assumptions. If these prove radically false, the execution may abort and a new plan be made to better match what was found out. This bears some looking into.
There are some network algorithms like shortest path, traveling salesman and similar that probably deserve a special operator in the query language. These can benefit from parallelization and a sequential implementation running on a cluster with latency will be a disaster. Expressing the message flow in a rule language is not really simple and pretty much no programmer will either appreciate the necessity or go to the trouble. Therefore such things should likely be offered by the platform and made by the few people who understand such matters.
For forward chaining, it seems that any results should generally go into their own graph so as not to pollute the base data. This graph, supposing it is small enough, can have different partitioning from the large base data set. If the data comes from far and wide but results are local, there is better usage of a RETE like algorithm for triggering inference when data comes in. RETE will parallelize well enough, also for clusters,results just have to be broadcast to the nodes that may have a use for them.
The programming model will typically be using a set of local overlays on a large shared data set. Queries will most often not be against a single graph. Strict transactionality will be the exception rather than the rule. At the database node level, there must be real transactionality even if the whole system most often did not run with strict ACID semantics. This is due to ACID requirements of some internal ops, e.g., some bitmap index operations, log checkpoints, DDL changes, etc.
For procedural tasks, map-reduce is OK. We have it even in our SQL. Map-reduce is not the basis for making a DBMS but it is a nice feature for some parts of query evaluation and application logic.
We have not talked about linking the data itself, but there is a whole workshop on this next week in Beijing; I will write about it separately. Let this just serve to state that we are serious about the platform for this, present and future.
The web scale database, the Google with arbitrary joining and inference, is one generation away, talking of economical implementation. Today, a price tag of $100K will buy some 50-100 billion triples worth with reasonable query response. Unlimited budget will take this a bit further, like one order of magnitude. and then, returns might be decreasing.
Of course, this is worth nothing if the software isn't there. Virtuoso with its cluster edition will allow one to use unlimited RAM and processors. The frontier is now at getting just the right parallelism for each task, including inference ones.
]]>We now have Virtuoso 6 running most of its test suite in single process and cluster modes. It is now time to finalize how it is to be configured and deployed. A bit more on this later.
We would have been done in about half time if we had not also redone the database physical layout with key compression. Still, if we get 3x more data in the same memory, using 64 bit ids for everything, the effort is justified. For any size above 2 billion triples, this means 3x less cost.
A good amount of the time and effort goes into everything except the core. Of course, we first do the optimizations we find appropriate and measure them. After all, the rest has no point if these do not run in the desired ballpark.
For delivering something, the requirements are quite the opposite: For example, when when defining a unique index, what to do when the billionth key turns out not to be unique? And then what if one of the processes is killed during the operation? Does it all come out right also when played from the roll forward log? See what I mean? There is no end of such.
So, well after we are done with the basic functionality, we have to deal with this sort of thing. Even if we limited ourselves to RDF workloads only in the first cut, we still would need to do this since maintaining this would simply not be possible without some generic DBMS functionality. So we get the full feature generic clustered RDBMS in the same cut, no splitting the deliverable.
The basic cluster execution model is described here.
There are some further optimizations that we will do at and around the time of first public cut.
These have to do mostly with execution scheduling. For example, a bitmap intersection join must be done differently from a single server when there is latency in getting the next chunk of bits. Value sub-queries, derived tables and existences must be started as batches, just like joined tables.
Having too many threads on an index is no good. But having a large batch of random lookups to work with, even when each of them does not have its own thread, gives some possibilities for IO optimization. When one would block for disk, start the disk asynchronously, like with a read ahead and do the next index lookup from the batch. This is specially so in cluster situations where the index lookups naturally come in "pre-vectored" batches. You could say that the loop join is rolled out. This is done anyhow for message latency reasons.
Do we optimize for the right stuff? Well, looking into the future, it does not look like regular RAM will be the bottom of the storage hierarchy, no matter how you look at it. With solid state disks, locality may not be so important but latency is there to stay. With everything now growing sideways, as in number of cores and core multithreading, we are just looking to deepening our already warm and intimate relationship with the Moira which cuts the thread, Atropos, Lady Latency. The attention of the best minds of the industry is devoted to thee. pouring forth of the effort of the best minds in the industry.
]]>I will here summarize the developments since the last Virtuoso 5 Open Source release.
On the RDF side, the bitmap intersection join has been improved quite a bit so that it is now almost always more than 2x more efficient than the equivalent nested loop join.
XML trees in the object position in RDF quads were in some cases incorrectly indexed, leading to failure to retrieve quads. This is fixed and should problems occur in existing databases, they can be corrected by simply dropping and re-creating an index.
Also the cost model has been further tuned. We have run the TPC-H queries with larger databases and have profiled it extensively. There are improvements to locking, especially for concurrency of transactions with large shared lock sets, as is the case in the TPC-H queries. The rules stipulate that these have to be run with repeatable read. There are also optimizations for decimal floating point.
A sampling of TPC-H queries translated into SPARQL comes with the new demo database. These show a live sample of the TPC-H schema translated into linked data, complete with SPARQL translations of the original queries. Some work is still ongoing there but the relational to RDF mapping is mature enough for real business intelligence applications now.
On the closed source side, we have some adjustments to the virtual database. When using Virtuoso as a front end to Oracle, using the TPC-H queries as a metric, the virtual database overhead is minimal. Previously, we had some overhead because some queries were rewritten in a way that Oracle would not optimize as well as the original TPC-H text. Specifically, turning an IN sub-query predicate into an equivalent EXISTS did not sit well with Oracle.
]]>We have a new demo online at http://demo.openlinksw.com/tpc-h. This takes the industry standard TPC-H benchmark data and presents it as linked data with a SPARQL end point and dereferenceable URIs.
This is an example of using Virtuoso's relational-to-RDF mapping for publishing business data, for browsing using the linked data principles and opening it to analytics queries in SPARQL.
As noted before, we have extended SPARQL with aggregation and nested queries, thus making it a viable SQL substitute for decision support queries.
The article at http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VOSTPCHLinkedData gives details and the source code for the implementation.
We are still working on some aspects of the more complex TPC-H queries, thus the demo is not complete with all the 22 queries. This is however enough to see a representative sample of how analytics queries work with SPARQL and Virtuoso's SQL-to-RDF mapping. The demo will be part of the next Virtuoso Open Source download, probably out next week.
]]>In the interest of participating in a community benchmark development process, I will here outline some desiderata and explain how we could improve on LUBM. I will also touch on the message such an effort ought to convey.
A blow-by-blow analysis of the performance of a complex system such as a DBMS is more than fits within the scope of human attention at one go. This is why this all must be abbreviated into a single metric. Only when thus abbreviated, can this information be used in context. The metric's practical value is relative to how well it predicts the performance of the system in some real task. This means a task not likely to be addressed by an alternative technology, unless the challenger clearly beats the incumbent.
A benchmark is promotional material, both well as for the technology being benchmarked as a whole. This is why the benchmark, whatever it does, should do something that the technology does well, surely better than any alternative technology. A case in point is that one ought not to take a pure relational workload and RDF-ize it, for then the relational variant is likely to come out on top.
In this regard LUBM is not so bad because its reliance on class and property hierarchies and the occasional transitivity or inference rule makes the workload typically RDF, a little ways apart from a purely relational implementation of the task.
RDF's claim to fame is linked data. This means giving things globally unique names and thereby making anything joinable with anything else, insofar there is agreement on the names. RDF is a key to a new class of problems, call it web scale database. Web scale here refers first to heterogeneity and multiplicity of independent sources and secondly to volume of data.
Now there are plenty of relational applications with very large volumes of data. On the non-relational side, there are even larger applications, such as web search engines. All these have a set schema and a specific workload they are meant to address. RDF versions of such are conceivable but hold no intrinsic advantage if considered in the specific niche alone.
The claim to fame of RDF is not to outperform these on their home turf but to open another turf altogether, allowing agile joining and composing of all these resources.
This is why a benchmark, i.e., an an advertisement for the RDF value proposition, should not just take a relational workload and RDF-ize it. The benchmark should carry some of the web in it.
If we just intend to measure how well an RDF store joins triples to other triples, LUBM is almost good enough. If it defined a query mix with different frequencies for short and long queries and a concurrent query metric, it would be pretty much there. Our adaptation of it is adequate for counting joins per second. But joins per second is not a value proposition.
So we have two questions:
If we just take the RDF model and SPARQL, how do we make a benchmark that fills in what LUBM does not cover?
The answers to the first are not very complex:
Add some optionals. Have different frequencies of occurrence for some properties.
Add different graphs. Make queries joining between graphs and drawing on different graphs. Querying against all graphs of the store is not a part of the language. Still this would be useful but leave it out for now.
Add some filters and arithmetic. Not much can be done there, though because expressions cannot be returned and there is no aggregation or grouping.
Split the workload into short and long queries. The short should be typical for online use and the long ones for analysis. Different execution frequencies for different queries is a must. Analysis is limited by lack of grouping, expressions or aggregation. Still, something can be contrived by looking for a pattern that does not exist or occurs extremely rarely. Producing result sets of millions of rows is not realistic.
Many of the LUBM queries return thousands of rows, even when scoped to a single university. This is not very realistic. No user interface displays that sort of quantity. Of course, the intermediate results can be large as you please but the output must be somehow ranked. SPARQL has order by and limit, so these will have to be used. TPC-H for example has almost always a group by/order by combination and sometimes a result rows limit.
The degree of inference in LUBM is about right, mostly sub-classes and sub-properties, nothing complex. We certainly regard this as a database benchmark more than a knowledge representation or rule system one.
LUBM does an OK job of defining a scale factor. I think that a concurrent query metric can just be so many queries per time at a given scale. The number of clients, I would say, can be decided by the test sponsor, taking whatever works best. A load balancer or web server can always be tuned to enforce some limit on concurrency. I don't think that a scale rule like in TPC C, where it says that only so many transactions per minute are allowed per warehouse is needed here. The effect of this is that when reporting a higher throughput, one has to automatically have a bigger database.
There is nothing to prevent these improvements from being put into a subsequent version of LUBM.
Building something that shows RDF at its best is a slightly different proposition. For this, we cannot be limited to the SPARQL recommendation and must allow custom application code and language extensions. Examples would be scripting similar to SQL stored procedures and extensions such as we have made for sub-queries and aggregation, explained a couple of posts back.
Maybe the Billion Triples challenge produces some material that we can use for this. We need to go for spaces that are not easily reached with SQL, have distributed computing, federation, discovery, demand driven import of data and such like.
I'll write more about ways of making RDF shine in some future post.
There are two kinds of workloads: online and offline. Online is what must be performed in an interactive situation, without significant human perceptible delay, i.e. within 500 ms. Anything else is offline.
Because this is how any online system is designed, this should be reflected in the benchmark. Ideally we would make two benchmarks.
]]>We have now run the LUBM benchmark on Virtuoso v6, with the same configuration as discussed last Friday.
We had a database of 8000 universities, and we ran 8 clients on slices of 100, 1000 and 8000 universities — same data but different sizes of working set.
100 universities: 35.3 qps 1000 universities: 26.3 qps 8000 universities: 13.1 qps
The 100 universities slice is about the same as with v5.0.5 (35.3 vs 33.1 qps).
The 8000 universities set is almost 3x better (13.1 vs. 4.8 qps).
This comes from the fact that the v6 database takes half of the space of the v5.0.5 one. Further, this is with 64-bit IDs for everything. If the 5.5 database were with 64-bit IDs, we'd have a difference of over 3x. This is worth something if it lets you get by with only 1 terabyte of RAM for the 100 billion triple application, instead of 3 TB.
In a few more days, we'll give the results for Virtuoso v6 Cluster.
]]>We have now taken a close look at the query side of the LUBM benchmark, as promised a couple of blog posts ago.
We load 8000 universities and run a query mix consisting of the 14 LUBM queries with different numbers of clients against different portions of the database.
When it is all in memory, we get 33 queries per second with 8 concurrent clients; when it is so I/O bound that 7.7 of 8 threads wait for disk, we get 5 qps. This was run in 8G RAM with 2 Xeon 5130.
We adapted some of the queries so that they do not run over the whole database. In terms of retrieving triples per second, this would be about 330000 for the rate of 33 qps, with 4 cores at 2GHz. This is a combination of random access and linear scans and bitmap merge intersections; lookups for non-found triples are not counted. The rate of random lookups alone based on known G, S, P, O, without any query logic, is about 250000 random lookups per core per second.
The article LUBM and Virtuoso gives the details.
In the process of going through the workload we made some cost model adjustments and optimized the bitmap intersection join. In this way we can quickly determine which subjects are, for example, professors holding a degree from a given university. So the benchmark served us well in that it provided an incentive to further optimize some things.
Now, what has been said about RDF benchmarking previously still holds. What does it mean to do so many LUBM queries per second? What does this say about the capacity to run an online site off RDF data? Or about information integration? Not very much. But then this was not the aim of the authors either.
So we still need to make a benchmark for online queries and search, and another for E-science and business intelligence. But we are getting there.
In the immediate future, we have the general availability of Virtuoso Open Source 5.0.5 early next week. This comes with a LUBM test driver and a test suite running against the LUBM qualification database.
After this we will give some numbers for the cluster edition with LUBM and TPC-H.
]]>Last time I said we had extended SPARQL for sub-queries. As a preview of the new functionality, let us look at a query from TPC-H.
Below is the Virtuoso SPARQL version of Q2.
sparql define sql:signal-void-variables 1 prefix tpcd: <http://www.openlinksw.com/schemas/tpcd#> prefix oplsioc: <http://www.openlinksw.com/schemas/oplsioc#> prefix sioc: <http://rdfs.org/sioc/ns#> prefix foaf: <http://xmlns.com/foaf/0.1/> select ?supp+>tpcd:acctbal, ?supp+>tpcd:name, ?supp+>tpcd:has_nation+>tpcd:name as ?nation_name, ?part+>tpcd:partkey, ?part+>tpcd:mfgr, ?supp+>tpcd:address, ?supp+>tpcd:phone, ?supp+>tpcd:comment from <http://example.com/tpcd> where { ?ps a tpcd:partsupp ; tpcd:has_supplier ?supp ; tpcd:has_part ?part . ?supp+>tpcd:has_nation+>tpcd:has_region tpcd:name 'EUROPE' . ?part tpcd:size 15 . ?ps tpcd:supplycost ?minsc . { select ?p min(?ps+>tpcd:supplycost) as ?minsc where { ?ps a tpcd:partsupp ; tpcd:has_part ?p ; tpcd:has_supplier ?ms . ?ms+>tpcd:has_nation+>tpcd:has_region tpcd:name 'EUROPE' . } } filter (?part+>tpcd:type like '%BRASS') } order by desc (?supp+>tpcd:acctbal) ?supp+>tpcd:has_nation+>tpcd:name ?supp+>tpcd:name ?part+>tpcd:partkey ;
Note the pattern
{ ?ms+>tpcd:has_nation+>tpcd:has_region tpcd:name 'EUROPE' }
which is a shorthand for
{ ?ms tpcd:has_nation ?t1 . ?t1 tpcd:has-region ?t2 . ?t2 tpcd:has_region ?t3 . ?t3 tpcd:name "EUROPE" }
Also note a sub-query is used for determining the lowest supply cost for a part.
The SQL text of the query can be found in the TPC-H benchmark specification, reproduced below:
select s_acctbal, s_name, n_name, p_partkey, p_mfgr, s_address, s_phone, s_comment from part, supplier, partsupp, nation, region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and p_size = 15 and p_type like '%BRASS' and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'EUROPE' and ps_supplycost = ( select min(ps_supplycost) from partsupp, supplier, nation, region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'EUROPE') order by s_acctbal desc, n_name, s_name, p_partkey;
For brevity we have omitted the declarations for mapping the TPC-H schema to its RDF equivalent. The mapping is straightforward, with each column mapping to a predicate and each table to a class.
This is now part of the next Virtuoso Open Source cut, due around next week.
As of this writing we are going through the TPC-H query by query and testing with mapping going to Virtuoso and Oracle databases.
Also we have been busy measuring Virtuoso 6. Even after switching from 32-bit to 64-bit IDs for IRIs and objects, the new databases are about half the size of the same Virtuoso 5.0.2 databases. This does not include any stream compression like gzip for disk pages. The load and query speeds are higher because of better working set. For all in memory, they are about even with 5.0.2. So now on an 8G box, we load 1067 million LUBM triples at 39.7 Kt/s instead of 29 Kt/s with 5.0.2. Right now we experimenting with clusters at Amazon EC2. We'll write about that in a bit.
]]>At this close of the year, I'll give a little recap over the past year in terms of Virtuoso development and a look at where we are headed for 2008.
A year ago, I was in the middle of redoing the Virtuoso database engine for better SMP performance. We redid the way traversal of index structures and cache buffers was serialized for SMP and generally compared Virtuoso and Oracle engines function by function. We had just returned from the ISWC 2006 in Athens, Georgia and the Virtuoso database was becoming a usable triple store.
Soon thereafter, we comfirmed that all this worked when we put out the first cut of Dbpedia with Chris Bizer et al and were working with Alan Ruttenberg on what would become the Banff health care and life sciences demo.
The WWW 2007 conference in Banff, Canada, was a sort of kick-off for the Linking Open Data movement, which started as a community project under SWEO, the W3C interest group for Semantic Web Education and Outreach, and has gained a life of its own since.
Right after WWW 2007 the Virtuoso development effort split on two tracks, one for enhancing the then new 5.0 release and one for building a new generation of Virtuoso, notably featuring clustering and double storage density for RDF.
The first track produced constant improvements to the relational to RDF mapping functionality, SPARQL enhancements, Redland, Jena and Sesame compatible client libraries with Virtuoso as as a triple store. These things have been out with testers for a while and are all generally available as of this writing.
The second track started with adding key compression to the storage engine, specifically with regard to RDF, even though there are some gains in relational applications as well. With RDF, the space consumption drops to about half, all without recourse to any non-random access compatible compression like gzip. Since the start of August, we turned to clustering and are now code complete, pretty much with all the tricks one would expect, of course full function SQL and taking advantage of colocated joins and doing aggregation and generally all possible processing where the data is. I have covered details of this along the way in previous posts. The key ppoint is that now the thing is written and works with test cases.
In late October, we were at the W3C workshop for mapping relational data to RDF. For us, this confirmed the importance of mapping and scalability in general. Ivan Herman proposed forming a W3C incubator group on benchmarking. Also a W3C incubator group of relational to RDF mapping is being formed.
Now, scalability has two sides. One is dealing with volume and the other is dealing with complexity. Volume alone will not help if interesting queries cannot be formulated. Hence we recently extended SPARQL with subqueries so that we can now express at least any SQL workloads, which was previously not the case. It is sort of a contradiction in terms to say that SPARQL is the universal language for information integration while not being able to express for example the TPC H queries. Well, we fixed this. A separate post will jhighlight how. The W3C process will eventually follow, as the necessity of these things is undeniable, on the unimpeachable authority of the whole SQL world. Anyway, for now, SPARQL as it is ought to become a recommendation and extensions can be addressed later.
For now, the only RDF benchmark that seems to be out there is the loading part of the LUBM. We did a couple of enhancements of our own for that just recently but much bigger things are on the way. Also, the billion triples challenge is an interesting initiative in the area. We all recognize that loading any number of triples is a finite problem with known solutions. The challenge is running interesting queries on large volumes.
Our present emphasis is demonstrating both RDF data warehousing and RDF mapping with complex queries and large data. We start with the TPC H benchmark and doing the queries both through mapping to SQL against any RDBMS, Oracle, DB2, Virtuoso or other, and by querying the physical RDF rendition of the data in Virtuoso. From there, we move to querying a collection of RDBMS's hosting similar data.
Doing this with performance at the level of direct SQL in the case of mappping and not very much slower with physical triples is an important milestone on the way to real world enterprise data web. Real life has harder and more unexpected issues than a benchmark but at any rate doing the benchmark without breaking a sweat is a step on the way. We sent a paper to ESWC 2008 about that but it was rather incomplete. By the time of the VLDB submissions deadline in March we'll have more meat.
Another tack soon to start is a rearchitecting of Zitgist around clustered Virtuoso. Aside matters of scale, we will make a number of qualitatively new things possible. Again, more will be released in the first quarter of 08.
Beyond these short and mid-term goals we have the introduction of entirely dynamic and demand driven partitioning, a la Google Bigtable or Amazon Dynamo. Now, regular partitioning will do for a while yet but this is the future when we move the the vision of linked dataeverywhere.
In conclusion, this year we have built the basis and the next year is about deployment. The bulk of really new development is behind us and now we start applying. Also, the community will find adoption easier due to our recent support of the common RDF API's.
benchmarkingsparqlrdfsem websemantic webdatabasedatabases ]]>As part of the recent conversation on benchmarking RDF stores, we re-ran the LUBM 8000 load test (1067 million triples) with the current Virtuoso.
We did it on two different machines, one with 2 Xeon 5130 2Ghz and 8G RAM and one with 2 Xeon 5330 2GHZ and 16G RAM. Both had 6 x 7800 rpm SATA-2 drives. The load rate on the 16G configuration was 36.8 Ktriples per second. The load rate on the 8G configuration was 29.7 Ktriples per second. Both loads were made using 6 concurrent load streams. Some small changes to the numbers may be released later as a result of changing tuning.
The Virtuoso version was 5.0, in the update to be released on the week of Dec 10, 2007. This is an incremental release of Virtuoso 5.0 and has the same engine as the prior 5.0s, with some optimizations for RDF loading and diverse bug fixes, notably in RDF mapping of relational data. This release will be further described in a separate post.
The load does not include forward chaining but then Virtuoso supports sub-class and sub-property without materializing the entailed triples.
Most of the LUBM entailed triples represent sub-classes and sub-properties. The LUBM query and forward chaining side deserves a separate treatment but this is for another time.
Most recent posts on this blog refer to Virtuoso 6, which is presently under development. We will publish results with the 6.0 engine later. Also, further enhancements to triple store performance will take place on the Virtuoso 6 platform.
]]>This post complements the previous post on a social web oriented RDF store benchmark. This is a draft outline of a benchmark for mapping relational data to RDF. This is meant for mapping technologies and relies on external relational databases for all storage.
The scenario is meant to capture a case of an existing database infrastructure and translating this into RDF for ad hoc query and integration. This may serve to publish some of the data, such as product catalog information and order status as an outside accessible SPARQL end point. The RDF modeling of the data may also facilitate in-house analytics. The tentative workload is intended to represent both aspects.
The below is very tentative, all imaginable caveats apply.
To avoid reinventing the wheel, we start with the TPC C and H databases. The C database is an online order processing system and the H database is a data warehouse of orders and shipments. Both databases have orders, products and customers but are clearly not meant to be joined together. The keys for the basic entities consist of different numbers of parts, for example.
Basing this benchmark on the TPC work saves the design of the schemas and use cases.
Given this, what could we measure? From the RDF angle, we are not concerned with update transactions.
The RDF mapping benchmark must expose any information that could be had from the relational sources by themselves plus demonstrate a layer of unification that facilitates querying without significant performance overhead.
First, let us see about preserving existing functionality:
The TPC-H queries cannot be represented with SPARQL since they all involve GROUP BY
. They can however be made into parameterized views of some sort and these views could be mapped into RDF. For example, Q1 is a report on order lines grouped by shipped and returned status over a period of time. The period of time does not figure in the result set. One cannot parameterize the view by just specifying a range for the time, as in
SELECT * FROM q1_view WHERE dt BETWEEN ? AND ?
There simply is no time in the output. If the time were added as a min and max of times, then one could have a condition on this and the query processor could infer that only order lines from between these times would be considered. In practice, we expect a parametric view of some vendor dependent type to be used.
How would one invoke a parameterized view from SPARQL? Like
{ ?r a q1_report . ?r1 start_time ?dt1 . r1 end_time dt2 . ?r1 returnflag ?rf . ?r1 extended_price ?price . }
So r1
would be instantiated to each row of the report view and the start and end times would be considered specially, requiring dt1
and dt2
to be bound.
In this way RDF could be used for accessing the TPC-H queries and for specifying values for the substitution parameters. How each implementation achieves this is left to their discretion.
Now, let us see what more is enabled by RDF.
As said, the C and H schemas were designed separately and are not intended for joining. A mapping between C and H is conceptually meaningful between customer, geographic area and product. Orders and order lines exist in both. For purposes of our benchmark, we consider the C database to be the recent history and the H database the longer term history, with possible overlap between the two.
To make things more interesting, we could have multiple separately populated C and H databases, for example split according to product type or geographical area. We may consider this later, for now let us just consider one C database and one H database. We will assume that these are hosted by different DBMS instances and that the the DBMS managing one cannot join with the other within a single SQL query. Thus, distributed joining is to be accomplished by the RDF layer.
The queries to be answered are among the following:
Browsing - Any primary key is to be an IRI. Any foreign key relation is to be navigable through a predicate representing this. Take
?customer has_order ?order
This is to be mapped to both customer to order relations, from the C as well as the H databases. Thus the query
?c has_name "ACME" . ?c has_order ?o . ?o has_line ?l . ?l has_product ?p . ?p has_name ?n .
should get all names of products ever ordered by ACME in either database.
The has_order
and other predicates will have sub-predicates specifically referring to the customer to order relationship in one of the databases.
Periodically any of the TPC-H queries. This is evaluated by the RDBMS and the RDF layer is not expected to affect the performance. This is included simply in order to demonstrate that this can be done.
Advertising - Given the customer's purchase history, the query selects advertising to show when the user connects to the E-commerce web site running off the C database.
Product family - The products in both databases are classified by a SKOS vocabulary. Both databases can list products based on queries using the SKOS terminology. The SKOS taxonomy is mapped into the different fields of the source product tables, to be specified.
Product recommendations - Find customers with a similar purchasing history and recommend products, excluding ones the user has already purchased
Ad-hoc describe of product or customer, fetching information from both databases. For example, get the shipping delay of orders of products of category c now and this time last year.
Faceted browsing of products and recent orders. Start with a category and drill down into individual orders and customers. This would represent a customer browsing the catalog and personal order information through a web user interface.
Extraction - Given a time interval, retrieve all orders and order lines placed within this interval as well as and customers, products, suppliers reachable from these orders and export as RDF text. The rationale is load the data into an RDF warehouse, presumably for off-line joining with other data sets.
We cannot here go to the details of the workload and ad hoc experimentation will be needed to see what interesting queries can be made against the data and what other data structures are needed for supporting these queries, for example a cube of pre-computed totals for faceted drill-down.
The databases should be scaled so that the H database has 1.5 x so many products, as there may exist discontinued products or versions that still figure in the history but are no longer stocked.
The H database should have the same number of customers as the C database.
The H database should have 18x the number of orders. If the C database covers the two last months, roughly the longest delay between order and shipping, the H database has 3 years history.
Since the primary keys in the databases are different, we use the customer names and phone numbers to establish identity.
Products are joined based on product name.
The data must be tweaked so that these will match for a sufficient number of rows.
We expect custom code to be required for most of the above tasks. All such code should be disclosed.
The metric should not recapture the TPC-H metric. Hence the H queries per se are not measured, they are rather something that must be possible through the RDF interface.
It is reasonable to publish the run time of each of the 22 queries when submitted as SQL and when submitted through the RDF view interface. We do not expect great divergence.
Two actual metrics can be thought of. One for the extraction and the other for short ad hoc lookups.
We expect that persistent structures not present in the C and H databases will be needed for supporting the queries. Implementing these should not be predicated on having write access to the relational databases past the stage of setting up the benchmark data sets.
The costing should not include the C and H databases and supporting systems. These will already be paid for in the usage scenario. Any extra hardware and software should be included. The major part of the cost is expected to be in terms of custom coding and configuration. These should be reported as lines of code, specifying lines per each language used. Monetary cost of these will depend on the proficiency of the parties with the technologies and cannot be accurately measured.
Arising from the recent W3C workshop on mapping relational data to RDF, there is some discussion on starting a benchmarking oriented experimental group under the W3C. I'll here make some comments on where this might fit and how this might serve our nascent industry.
To the public, basically any recipient of the semantic data web message, the benchmarking activity should communicate:
The semantic data web claims to
The benchmarking activity is to prove that this is not a pipe dream that Gartner Group forecast for 2027. Instead, there exists
To the general public, the message will be best delivered by the existence of online services that do interesting things with linked data, starting from search and going to more specialized derivative products of structured information on the web.
To those intending to apply some semantic data web things themselves, the benchmark activity should give a directory of products to look at. The reason why a benchmark suite backed by some industry consortium is useful is that it adds to the end user's confidence that the use case being measured is of somewhat general relevance and not just made to demonstrate any single product's strengths. Besides this, the TPC idea of disclosing scale, throughput, price per throughput and date is fine because it makes for easy tabulation of results. The intricacies in the full disclosure is effectively masked and it is my guess that very few read the actual full disclosures.
The inference that an evaluator draws from benchmark results is that some product figuring there consistently is somewhat serious and can be studied further. Being in the running is like a stamp of approval. The benchmarks are complex and the evaluator seldom goes to the trouble of really analyzing performance by individual query or transaction even if these are and must be given. It is a bit like Formula 1 viewers do not generally read the rules on car engine or aerodynamics, let alone understand their finer points.
For credibility to be thus given to products and hence the industry, we should just have a couple of well defined and agreed upon benchmarks, just like TPC.
The third public is the developer. As a DBMS developer, I am a great fan of TPC. The great benefit I derive from their work is that they give a test suite for measuring effects of code changes on performance. Also, assuming that the TPC workload mix is representative, it also allows ranking what optimizations are more important than others. Lastly, TPC gives a great way of describing results, e.g. changes resulting in x% improvement on throughput of y. In such usage, the benchmarks are pretty much never run by the rules but results obtained are still good for internal comparison.
Communication about IS should allow for short, simple messages: Release XX Halves Price per Throughput.
The existence of benchmarks is, if not absolutely necessary, then at least a great help for such communication. Besides, people are culturally used to all kinds of racing and sports results so this is even a familiar format.
Now the TPC is also not perfect. In the high end, the measured configurations are so large that one does not see them very often in practice. It is like the techno sports of Formula 1 or America's Cup. Interesting for the curiosity value but not immediately relevant to the regular car buyer or weekend yachtsman. Further, sponsoring a by-the-book audited TPC result is not so simple. Not as expensive as putting out an America's Cup challenge but still some trouble and expense.
So, for us to benefit by the benchmarking activity, we must find a group that can both agree and be somewhat representative. Then we must put out a simple message: This here is for integration of relational sources and this here for storage and query of RDF.
Furthermore, in so far we derive from relational or similar sources, the technology should not do less than the established alternative. This sends the wrong message.
Entering the running should not be overly difficult for vendors, hence we should not have too many benchmarks and the ones that there are should be representative and sufficiently varied workloads. The results should be compact and easy to state. One more reason why I like TPC's work is the fact that the benchmarks have an easy to understand, unified use case behind them. Approximately what is done in each becomes clear from a very short and succinct description even though the details can be complex. I suspect this is one side of their appeal. I would venture the guess that a single use case story is easier to sell than a composite metric of disparate tests. Also in the scientific computing world, we have use cases, like NAS for aerodynamics, so having a use case story is quite common and a factor for making a benchmark's relevance understandable.
Is this all possible?
To play the devil's advocate, I could say that the use cases are not as well settled as the relational ones hence formulating a generally representative benchmark is not possible. Now this is certainly not a message that this community wishes to send. Besides, there exists decades worth of history of the problems of information integration and a great deal of RDF data out there, , even a compilation of dozens of industry use cases by the SWEO, so we are not exactly in the dark here.
Can there be political agreement in reasonable time? If we look at the TPC as a precedent, judging by the rate of publication and revision, the process is not exactly quick. Now, for the TPC, it does not have to be. Judging by the frequency of published test results, hardware vendors are happy enough to have a forum to show off and do so at every turn.
Now we are not at this stage of maturity yet.
Composing a TPC style test spec is possible in a reasonable time for an individual but likely not for a committee. It is quite voluminous but also quite formulaic. While TPC's material is their own, I see no reason that we could not reference or link to it it where applicable.
Who would be motivated by such activity? How to pitch the activity to would be participants? I don't think that just talking about what to measure and how is interesting enough. This is covered ground. Vendors want to promote themselves and end users want to have vendors compete at solving their problems. Or so it would be in a simpler world.
Personally, I'd like to see a benchmark with a use case story people can relate to emerge in the next few months. Now I am not necessarily holding my breath waiting for this. For purposes of ongoing development, there is the real data out there and we can for example do the social web workload mix I suggested a couple of blog posts back on that and it is good enough for us. But that is not good enough for the industry's messaging.
I'd say that we have to assume that people play in good faith and simply ask who want to run and get an extra edge by being in on the design of the race track. By good faith I here mean a sincere wish to have the race take place in the first place.
The sport is exciting for the players and spectators alike if there is a use case story that they can relate to and an actual tournament. So this is what we should aim for. Because this is so far a niche public, we should not fragment the activity too much and we should consider how understandable and relevant the benchmark activity is to likely semantic data web adopters.
]]>Elaborating on my previous post, as food for thought for an RDF store benchmarking activity under the W3C, I present the following rough sketch. At the end of the below, I propose some common business questions that should be answered by a social web aggregator.
The problem with these is that it is not really possible to ask interesting questions over a large database without involving some sort of counting and grouping. I feel that we simply cannot make a representative benchmark without these, quite regardless of the fact that SPARQL in its present form does not have these features. Hence I have simply stated the questions and left any implementation open. If this seems like an interesting direction, the nascent W3C benchmarking XG (experimental group) can refine the business questions, relative query frequencies, exact data set composition, etc.
by Orri Erling
This benchmark model's use of RDF for representing and analyzing use of social software by user communities. The benchmark consists of a scalable synthetic data set, a feed of updates to the data set, and a query mix. The data set reflects the common characteristics of the social web, with realistic distribution of connections, user contributed content, commenting, tagging, and other social web activities. The data set is expressed in the FOAF and SIOC vocabularies. The query mix is divided between relatively short, dashboard or search engine style lookups, and longer running analytics queries.
The system being modeled is an an aggregator of social web content; we could liken it to an RDF-based Technorati with some extra features.
Users can publish their favorite queries or mesh-ups as logical views served by the system. In this manner, queries come to depend on other queries, somewhat like SQL VIEWs can reference each other.
There is a small qualification data set that can be tested against the queries to validate that the system under test (SUT) produces the correct results.
The benchmark is scaled by number of users. To facilitate comparison, some predefined scales are offered, i.e., 100K, 300K, 1M, 3M, 10M users. Each simulated user both produces and consumes content. The level of activity of users is unevenly divided.
There are two work mixes — the browsing mix, which consists of a mix of lookups and contributing content, and the analytics mix, which consists of long-running queries for tracking the state of the network. For each 100 browsing mixes, one analytics mix is performed.
A benchmark run is at least 1h real-time in duration. The metric is calculated by the number of browsing mixes completed during the test window. This simulates 10% of the users being online at any one time, thus for a scale of 1M users, 100K browsing mixes will be simultaneously proceeding.
The test driver submits the work via HTTP. What load balancing or degree of parallel serving of the requests is used is left up to the SUT.
The metric is expressed as queries per second, taking the total number of queries executed by completed browsing mixes and dividing this by the real time of the measurement window. The metric is called qpsSW, for queries per second, socialweb. The cost metric is $/qpsSW, calculated by the costing rules of the TPC. If compute-on-demand infrastructure is used, the costing will be $/qpsSW/day.
The test sponsor is the party contributing the result. The contribution consists of the metric and of a full disclosure report (FDR), written following a template given in the benchmark specification. The disclosure requirements follow the TPC practices, including publishing any configuration scripts, data definition language statements, timing for warm-up and test window, times for individual queries etc. All details of the hardware and software are disclosed.
The software consists of the data generator and of a test driver. The test driver calls functions supplied by the test sponsor for performing the diverse operations in the test. Source code for any modifications of the test driver is to be published as part of the FDR.
Any hardware/software combination — including single machines, clusters, clusters rented from computer providers like Amazon EC2 — is eligible.
The SUT must produce correct answers for the validation queries against the validation data set.
The implementation of the queries is not restricted. These can be any SPARQL or other queries, application server based logic, stored procedures or other, in any language, provided full source code is provided in the FDR.
The data set is provided as serialized RDF. The means of storage are left up to the SUT. The basic intention is to use a triple store of some form, but the specific indexing, use of property tables, materialized views, and so forth, is left up to the test sponsor. All tuning and configuration is to be published in the FDR.
For each operation of each mix, the specification shall present:
The logical intent of the operation, the business question, e.g., What is the hot topic among my friends?
The question or update expressed in terms of the data in the data set.
Sample text of a query answering the question or pseudo-code for deriving the answer.
Result set layout, if applicable.
The relative frequencies of the queries are given in the query mix summary.
The browsing mix consists of the following operations:
Make a blog post.
Make a blog comment.
Make a new social contact.
For one new social contact, there are 10 posts and 20 comments.
What are the 10 most recent posts by somebody in my friends or their friends? This would be a typical dashboard item.
What are the authoritative bloggers on topic x? This is a moderately complex ad-hoc query. Take posts tagged with the topic, count links to them, take the blogs containing them, show the 10 most cited blogs with the most recent posts with the tag. This would be typical of a stored query, like a parameterizable report.
How do I contact person x? Calculate the chain of common acquaintances best for reaching person x. For practicality, we do not do a full walk of anything but just take the distinct persons in 2 steps of the user and in 2 steps of x and see the intersection.
Who are the people like me? Find the top 10 people ranked by count of tags in common in the person's tag cloud. The tag cloud is the set of interests and the set of tags in blog posts of the person.
Who react to or talk about me? Count of replies to material by the user, grouped by the commenting user and the site of the comment, top 20, sorted by count descending.
Who are my fans that I do not know? Same as above, excluding people within 2 steps.
Who are my competitors? Most prolific posters on topics of my interest that do not cite me.
Where is the action? On forums where I participate, what are the top 5 threads, as measured by posts in the last day. Show count of posts in the last day and the day before that.
How do I get there? Who are the people active around both topic x and y? This is defined by a person having participated during the last year in forums of x as well as of y. Forums are tagged by topics. The most active users are first. The ranking is proportional to the sum of the number of posts in x and y.
These queries are typical questions about the state of the conversation space as a whole and can for example be published as a weekly summary page.
The fastest propagating idea - What is the topic with the most users who have joined in the last day? A user is considered to have joined if the user was not discussing this in the past 10 days.
Prime movers - What users start conversations? A conversation is the set of material in reply to or citing a post. The reply distance can be arbitrarily long, the citing distance is a direct link to the original post or a reply there to. The number and extent of conversations contribute towards the score.
Geography - Over the last 10 days, for each geographic area, show the top 50 tags. The location is the location of the poster.
Social hubs - For each community, get the top 5 people who are central to it in terms of number of links to other members of the same community and in terms of being linked from posts. A community is the set of forums that have a specific topic.
We have now arrived at the RDF specific parts of the Virtuoso cluster effort.
This has to do with special tricks for dealing with the mapping of IRIs and object values and their internal IDs. A quad will refer to the graph, subject, predicate, and object by an internal ID. The object, if short enough, can be inlined so that no join is needed to get its external form in most cases.
Now this is old stuff.
Clustering does not change this, except that now the tables for the mappings between IDs and their external form are partitioned. So more often than not, getting the string for the ID involves an RPC. The most recently used mappings are of course cached outside of the table but still, having a network round trip each time an IRI is returned for the first time in a while is no good.
The solution is the same as always, namely doing one round trip per partition, with all the IDs concerned in a single message. The same applies to reading documents, where strings are translated to IDs and new IDs are made, only now we have a distributed read-write transaction.
In terms of programming model we have a general purpose partitioned pipe. One might think of this in terms of a single control structure combining map and reduce.
Now consider returning results for a text search:
SELECT doc_id, summary (doc_id, 'search pattern') FROM (SELECT TOP 10 doc_id FROM docs WHERE CONTAINS (text, 'search pattern', score) ORDER BY relevance (doc_id, score, 'search pattern') DESC) f;
The summary
and relevance
are each a function of the doc_id
and search pattern
. The relevance
function would typically access some precomputed site rank and combine this with a hit score that is a function of the word frequencies in the individual hit.
Now we are not specifically in the business of text search but this well known example will serve to make a more general point.
How does a query like this work when everything is partitioned?
The text index (inverted file from word to positions in each document) can be split by word or by document ID or both. In either case, it will produce the hits fairly quickly, partitioning or not. The question is sorting the million hits according to criteria that involve joining with document or site metadata. To produce fast response, this loop must run in parallel but then each score is independent, so no problem. Finally, summaries have to be made only for the items actually returned and again each summary is independent.
A text search engine does not depend on a general purpose query processor for scheduling things in this way.
The case is different with RDF where we must do basically the same things with ad-hoc queries. As it happens, the above sample query will run as described if only the summary and relevance functions are properly declared.
So we have a special declaration for a partitioned function call. Further, the partitioned function call (in this case summary and relevance) will dispatch according to a given index, thus going to run on a node hosting the actual data. This is like the map part of map-reduce. But this is not all. The functions can return either a final value or a next step. This can be regarded as a second map or in some cases a reduce step. The next step is another partitioned function that gets the output of the previous one as its input and may use the same or different partitioning key.
Now the functions can be large or small, including very small, like a single index lookup, where the RPC delay is an order of magnitude greater than the time to perform the function. The partitioned pipe manages batching these together and overlapping processing on all nodes. Plus the work can be transactional, all bound in a single distributed transaction or it can be each task for itself with error retries etc, as in the usual map-reduce situation where relatively large tasks are sent around.
At present, we have all this implemented and we are running tests with large RDF data sets on clustered Virtuoso.
]]>I was recently in Boston for the Mapping Relational Data to RDF workshop of the W3C.
The common feeling was that mapping everything to RDF and querying it in terms of a generic domain ontology, mapped on demand into whatever line of business systems, would be very good if it only could be done. However, since this is not so easily done, the next best is to extract the data and then warehouse it as RDF.
The obstacles perceived were of the following types:
Lack of quality in the data. The different line of business systems do not in and of themselves hold enough semantics. If the meaning of data columns in relational tables were really known and explicit, these could be meaningfully used for joining across systems. But this is more complex than just mapping the metal lead to the chemical symbol Pb and back.
Lack of performance in RDF storage. Data sets even in the tens-of-millions of triples do not run very well in some stores. Well, we had the Banff life sciences demo with 450M triples in a small server box running Virtuoso, so this is not universal, plus of course we are coming up with a whole different order of magnitude, as often discussed on this blog.
Lack of functionality in mapping and possibly lack of pushing through enough of the query processing to the underlying data stores.
Personally, I am quite aware of what to do with regard to performance of mapping and storage, and see these as eminently solvable issues. After all, we have a great investment of talent in databases in general and it can be well deployed towards RDF, as we have been doing these past couple of years. So we talk about the promise of a 360-degree view of information, with RDF being the top layer. Everybody agrees that this is a nice concept. But this is a nice concept especially when it can do the things that are the most common baseline expectation of any regular DBMS, i.e., aggregation, grouping, sub-queries, VIEWs. Now, I would not go sell a DBMS that has no COUNT
operator to a data warehousing shop.
The fact that OpenLink and Oracle allow RDF inside SQL, and OpenLink even adds native aggregates and grouping to SPARQL, fixes the problem with regard to specific products, but leaves the standardization issue open. Of course, any vendor will solve these questions one way or another because a database with no aggregation is a non-starter.
I talked to Lee Feigenbaum, chair of the W3C DAWG, about the question of aggregates and general BI capabilities in SPARQL. He told me that, prior to his time with the DAWG, these were left out because they conflicted with the open-world assumption around RDF: You cannot count a set because by definition you do not know that you have all the members, the world being open and all that.
Say what? Talk about the road to hell being paved with good intentions. Now, this is in no way Lee's or the present day DAWG's fault; as a member myself, I can attest to the good work and would under no circumstances wish any delays or revisions to SPARQL at this point. I am just pointing out a matter that all implementations should address, as a sort of precondition of entry into the real world IS space. If this can be done interoperably, so much the better.
Now, out of the deliberations at the Boston workshop arose at least two ideas for follow-up activity.
The first was an incubator group for RDF store and mapping benchmarking. This is very appropriate in order to dispel the bad name RDF storage and querying performance has been saddled with. As a first step in this direction, I will outline a social web oriented benchmark on this blog.
The second activity was an incubator group for preparing standardization of mapping methodologies from relational schemas to RDF. We will be active on this as well.
The two offshoots appear logically separate but are not necessarily so in practice. A benchmark is after all something that is supposed to promote a technology to a user base. The user base seems to wish to put all online systems and data warehouses under a common top level RDF model and then query away, introducing no further replication of data or performance cost or ETL latencies.
Updating would also be nice but even query only would be very good. Personally, I'd say the RDF strength is all on the query side. Transactions are taken care of well enough by what there already is, RDF stands out in integration and the ad-hoc and discovery side of the matter. Given this, we expect the value to be consumed in a heterogeneous, multi-database, federated environment. Thus a benchmark should measure this aspect of the use-case. With the right mapping and queries, we could probably demonstrate the added cost of RDF to be very low, as long as we could push all queries that can be answered by a single source to the responsible DBMS. For distributed joins, we are back at the question of optimizing distributed queries but this is a familiar one and RDF is not the principal cost factor.
The subject does become quite complex at this point. We would have to take supposedly representative synthetic OLTP and BI data sets (like the ones in TPC-D, TPC-E, and TPC-H), and invent queries across them that would both make sense and be implementable in SPARQL extended with aggregates and sub-queries. Reliance on SPARQL extensions is simply unavoidable. Setting up the test systems would be non-trivial, even though there is a lot of industry experience in these matters on the database side.
So, while this is probably the benchmark most relevant to the target audience, we may have to start with a simpler one. I will next outline something to the effect.
]]>I recall a quote from a stock car racing movie.
"What is the necessary prerequisite for winning a race?" asked the racing team boss.
"Being the fastest," answered the hotshot driver, after yet another wrecked engine.
"No. It is finishing the race."
In the interest of finishing, we'll now leave optimizing the cluster traffic and scheduling and move to completing functionality. Our next stop is TPC-D. After this TPC-C, which adds the requirement of handling distributed deadlocks. After this we add RDF-specific optimizations.
This will be Virtuoso 6 with the first stage of clustering support. This is with fixed partitions, which is just like a single database, except it runs on multiple machines. The stage after this is Virtuoso Cloud, the database with all the space filling properties of foam, expanding and contracting to keep an even data density as load and resource availability change.
Right now, we have a pretty good idea of the final form of evaluating loop joins in a cluster, which after all is the main function of the thing. It makes sense to tune this to a point before going further. You want the pipes and pumps and turbines to have known properties and fittings before building a power plant.
To test this, we took a table of a million short rows and made one copy partitioned over 4 databases and one copy with all rows in one database. We ran all the instances in a 4 core Xeon box. We used Unix sockets for communication.
We joined the table to itself, like SELECT COUNT (*) FROM ct a, ct b WHERE b.row_no = a.row_no + 3
. The + 3
causes the joined rows never to be on the same partition.
With cluster, the single operation takes 3s and with a single process it takes 4s. The overall CPU time for cluster is about 30% higher, some of which is inevitable since it must combine results, serialize them, and so forth. Some real time is gained by doing multiple iterations of the inner loop (getting the row for b) in parallel. This can be further optimized to maybe 2x better with cluster but this can wait a little.
Then we make a stream of 10 such queries. The stream with cluster is 14s; with the single process, it is 22s. Then we run 4 streams in parallel. The time with cluster is 39s and with a single process 36s. With 16 streams in parallel, cluster gets 2m51 and single process 3m21.
The conclusion is that clustering overhead is not significant in a CPU-bound situation. Note that all the runs were at 4 cores at 98-100%, except for the first, single-client run, which had one process at 98% and 3 at 32%.
The SMP single process loses by having more contention for mutexes serializing index access. Each wait carries an entirely ridiculous penalty of up to 6µs or so, as discussed earlier on this blog. The cluster wins by less contention due to distributed data and loses due to having to process messages and remember larger intermediate results. These balance out, or close enough.
For the case with a single client, we can cut down on the coordination overhead by simply optimizing the code some more. This is quite possible, so we could get one process at 100% and 3 at 50%.
The numbers are only relevant as ballpark figures and the percentages will vary between different queries. The point is to prove that we actually win and do not jump from the frying pan into the fire by splitting queries across processes. As a point of comparison, running the query clustered just as one would run it locally took 53s.
We will later look at the effects of different networks, as we get to revisit the theme with some real benchmarks.
]]>I just read Google's Bigtable paper. It is relevant here because it talks about keeping petabyte scale (1024TB) tables on a variable size cluster of machines.
I have talked about partitioning versus distributed cache in the second to last post. The problem in short is that you do not expect a DBA to really know how to partition things, and even if the indices are correctly partitioned initially, repartitioning them is so bad that doing it online can be a problem. And repartitioning is needed whenever adding machines, unless the size increment is a doubling, which it will never be.
So Oracle has really elegantly stepped around the whole problem by not partitioning for clustering in the first place. So incremental capacity change does not require repartitioning. Oracle has partitioning for other purposes but this is not tied to their cluster proposition.
I did not go the cache fusion route because I could not figure a way to know with near certainty where to send a request for a given key value. In the case we are interested in, the job simply must go to the data and not the other way around. Besides, not being totally dependent on a microsecond latency interconnect and a SAN for performance enhances deployment options. Sending large batches of functions tolerates latency better than cache consistency messages which are a page at a time, unless of course you kill yourself with extra trickery for batching these too.
So how to adapt to capacity change? Well, by making the unit of capacity allocation much smaller than a machine, of course.
Google has done this in Bigtable by a scheme of dynamic range partitioning. The partition size is in the tens to hundreds of megabytes, something that can be moved around within reason. When the partition, called a tablet, gets too big, it splits. Just like a Btree index. The tree top must be common knowledge, as well as the allocation of partitions to servers but these can be cached here and there and do not change all the time.
So how could we do something of the sort here? I know for an experiential fact that when people cannot change the server memory pool size, let alone correctly set up disk striping, they simply cannot be expected to deal with partitioning. Besides, even if you know exactly what you are doing and why, configuring and refilling large numbers of partitions by hand is error prone, tedious, time consuming, and will run out of disk and require restoring backups and all sorts of DBA activity that will have everything down for a long time, unless of course you have MIS staff such as is not easily found.
The solution is not so complex. We start with a set number of machines and make a file group on each. A file group has a bunch of disk stripes and a log file and can be laid out on the local file system in the usual manner. The data goes into the file group, partitioned as defined. You still specify partitioning columns but not where each partition goes. The system will decide this by itself. When a server's file group gets too big, it splits. One half of each key's partition in the original stays where it was and the other half goes to the copy. The copies will hold rows that no longer belong there but these can be removed in the background. The new file group will be managed by the same server process and the partitioning information on all servers gets updated to reflect the existence of the new file group and the range of hash values that belong there.
If a file group is kept at some reasonable size, under a few GB, these can be moved around between servers, even dynamically.
If data is kept replicated, then the replicas have to split at the same time and the system will have to make sure that the replicas are kept on separate machines.
So what happens to disk locality when file groups split? Nothing much. Firstly, partitioning will be set up so that consecutive values go to the same hash value, so that key compression is not ruined. Thus, consecutive numbers will be on the same page. Imagine an integer key partitioned two ways on bits 10-20. Values 0-1K go together, values 1K-2K go another way, values 2K-3K go the first way etc.
Now let us suppose the first partition, the even K's splits. It could split so that multiples of 4 go one way and the rest another way. Now we'd have 0-1K in place, 2K-3K in the new partition, 4K-5K in place and so on. A sequential disk read, with some read ahead, would scan the partitions in parallel but the disk access would be made sequential by the read ahead logic — remember that these are controlled by the same server process.
For purposes of sending functions, the file group would be the recipient, not the host, per se. The allocation of file groups to hosts could change.
Now picture a transaction that touches multiple file groups. The requests going to collocated file groups can travel in the same batch and the recipient server process can run them sequentially or with a thread per file group, as may be convenient. Multiple threads per query on the same index make contention and needless thread switches. But since distinct file groups have their distinct mutexes there is less interference.
For purposes of transactions, we might view a file group as deserving a its own branch. In this way we would not have to abort transactions if file groups moved. A file group split would probably have to kill all uncommitted transactions on it so as not to have to split one branch in two or deal with uncommitted data in the split. This is hardly a problem, the event being rare. For purposes of checkpoints, logging, log archival, recovery, and such, a file group is its own unit. The Bigtable paper had some ideas about combining transaction logs and such, all quite straightforward and intuitive.
Writing the clustering logic with the file group, not the database process, as the main unit of location is a good idea and an entirely trivial change. This will make it possible to adjust capacity in almost real time without bringing everything to a halt by re-inserting terabytes of data in system wide repartitioning runs.
Implementing this on the current Virtuoso is not a real difficulty. There is already a concept of file group, although we use only two, one for the data and one for temp. Using multiple ones is not a big deal.
Supporting capacity allocation at the file group level instead of the server level can be introduced towards the middle of the clustering effort and will not greatly impact timetables.
]]>I wrote the basics of the Virtuoso clustering support over the past three weeks. It can now manage connections, decide where things go, do two phase commits, insert and select data from tables partitioned over multiple Virtuoso instances. It works about enough to be measured, of which I will blog more over the next two weeks.
I will in the following give a features preview of what will be in the Virtuoso clustering support when it is released in the fall of this year (2007).
A Virtuoso database consists of indices only, so that the row of a table is stored together with the primary key. Blobs are stored on separate pages when they do not fit inline within the row. With clustering, partitioning can be specified index by index. Partitioning means that values of specific columns are used for determining where the containing index entry will be stored. Virtuoso partitions by hash and allows specifying what parts of partitioning columns are used for the hash, for example bits 14-6 of an integer or the first 5 characters of a string. Like this, key compression gains are not lost by storing consecutive values on different partitions.
Once the partitioning is specified, we specify which set of cluster nodes stores this index. Not every index has to be split evenly across all nodes. Also, all nodes do not have to have equal slices of the partitioned index, accommodating differences in capacity between cluster nodes.
Each Virtuoso instance can manage up to 32TB of data. A cluster has no definite size limit.
When data is partitioned, an operation on the data goes where the data is. This provides a certain natural parallelism but we will discuss this further below.
Some data may be stored multiple times in the cluster, either for fail-over or for splitting read load. Some data, such as database schema, is replicated on all nodes. When specifying a set of nodes for storing the partitions of a key, it is possible to specify multiple nodes for the same partition. If this is the case, updates go to all nodes and reads go to a randomly picked node from the group.
If one of the nodes in the group fails, operation can resume with the surviving node. The failed node can be brought back online from the transaction logs of the surviving nodes. A few transactions may be rolled back at the time of failure and again at the time of the failed node rejoining the cluster but these are aborts as in the case of deadlock and lose no committed data.
The Virtuoso architecture does not require a SAN for disk sharing across nodes. This is reasonable since a few disks on a local controller can easily provide 300MB/s of read and passing this over an interconnect fabric that would also have to carry inter-node messages could saturate even a fast network.
A SQL or HTTP client can connect to any node of the cluster and get an identical view of all data with full transactional semantics. DDL operations like table creation and package installation are limited to one node, though.
Applications such as ODS will run unmodified. They are installed on all nodes with a single install command. After this, the data partitioning must be declared, which is a one time operation to be done cluster by cluster. The only application change is specifying the partitioning columns for each index. The gain is optional redundant storage and capacity not limited to a single machine. The penalty is that single operations may take a little longer when not all data is managed by the same process but then the parallel throughput is increased. We note that the main ODS performance factor is web page logic and not database access. Thus splitting the web server logic over multiple nodes gives basically linear scaling.
Message latency is the principal performance factor in a clustered database. Due to this, Virtuoso packs the maximum number of operations in a single message. For example, when doing a loop join that reads one table sequentially and retrieves a row of another table for each row of the outer table, a large number of the join of the inner loop are run in parallel. So, if there is a join of five tables that gets one row from each table and all rows are on different nodes, the time will be spent on message latency. If each step of the join gets 10 rows, for a total of 100000 results, the message latency is not a significant factor and the cluster will clearly outperform a single node.
Also, if the workload consists of large numbers of concurrent short updates or queries, the message latencies will even out and throughput will scale up even if doing a single transaction were faster on a single node.
There are SQL extensions for stored procedures allowing parallelizing operations. For example, if a procedure has a loop doing inserts, the inserted rows can be buffered until a sufficient number is available, at which point they are sent in batches to the nodes concerned. Transactional semantics are kept but error detection is deferred to the actual execution.
Each transaction is owned by one node of the cluster, the node to which the client is connected. When more than one node besides the owner of the transaction is updated, two phase commit is used. This is transparent to the application code. No external transaction monitor is required, the Virtuoso instances perform these functions internally. There is a distributed deadlock detection scheme based on the nodes periodically sharing transaction waiting information.
Since read transactions can operate without locks, reading the last committed state of uncommitted updated rows, waiting for locks is not very common.
Virtuoso uses TCP to connect between instances. A single instance can have multiple listeners at different network interfaces for cluster activity. The interfaces will be used in a round-robin fashion by the peers, spreading the load over all network interfaces. A separate thread is created for monitoring each interface. Long messages, such as transfers of blobs are done on a separate thread, thus allowing normal service on the cluster node while the transfer is proceeding.
We will have to test the performance of TCP over Infiniband to see if there is clear gain in going to a lower level interface like MPI. The Virtuoso architecture is based on streams connecting cluster nodes point to point. The design does not per se gain from remote DMA or other features provided by MPI. Typically, messages are quite short, under 100K. Flow control for transfer of blobs is however nice to have but can be written at the application level if needed. We will get real data on the performance of different interconnects in the next weeks.
Configuring is quite simple, with each process sharing a copy of the same configuration file. One line in the file differs from host to host, telling it which one it is. Otherwise the database configuration files are individual per host, accommodating different file system layouts etc. Setting up a node requires copying the executable and two configuration files, no more. All functionality is contained in a single process. There are no installers to be run or such.
Changing the number or network interface of cluster nodes requires a cluster restart. Changing data partitioning requires copying the data into a new table and renaming this over the old one. This is time consuming and does not mix well with updates. Splitting an existing cluster node requires no copying with repartitioning but shifting data between partitions does.
A consolidated status report shows the general state and level of intra-cluster traffic as count of messages and count of bytes.
Start, shutdown, backup, and package installation commands can only be issued from a single master node. Otherwise all is symmetrical.
The basics are now in place. Some code remains to be written for such things as distributed deadlock detection, 2-phase commit recovery cycle, management functions, etc. Some SQL operations like text index, statistics sampling, and index intersection need special support, yet to be written.
The RDF capabilities are not specifically affected by clustering except in a couple of places. Loading will be slightly revised to use larger batches of rows to minimize latency, for example.
There is a pretty much infinite world of SQL optimizations for splitting aggregates, taking advantage of co-located joins etc. These will be added gradually. These are however not really central to the first application of RDF storage but are quite important for business intelligence, for example.
We will run some benchmarks for comparing single host and clustered Virtuoso instances over the next weeks. Some of this will be with real data, giving an estimate on when we can move some of the RDF data we presently host to the new platform. We will benchmark against Oracle and DB2 later but first we get things to work and compare against ourselves.
We roughly expect a halving in space consumption and a significant increase in single query performance and linearly scaling parallel throughput through addition of cluster nodes.
The next update will be on this blog within two weeks.
]]>I recently read Oracle's papers about RAC, Real Application Clusters. This is relevant as we are presently working on the Virtuoso equivalent.
Caveat: The following is quite technical and not the final word on the matter.
Oracle's claim is roughly as follows: Take a number of machines with access to a shared pool of disks and get scalability in processing power and memory without having to explicitly partition the data or perform other complicated configuration.
This works through implementing a cache consistency protocol between the participating boxes and by parallelizing queries just as one would do on a shared memory SMP box. Each disk page has a box assigned to keep track of it and the responsibility migrates so that the box most often needing the page gets to be the page's guardian, so as not to have to ask anybody else for permission to write the page.
This is a compelling proposition. Surely, it must be unrealistic to expect people to manually partition databases. This would require some understanding of first principles which is scarce out there.
So, should we implement clustering a la Oracle?
Let's look at some basics. If we have an OLTP workload like TPC-C, we usually have affinity between clients and the data they access. This will make each client's pages migrate to be managed by the box the client is connected to. This will work pretty well, no worse than with a single box. If two clients are updating the same data but are connected to two different boxes, this is quite bad since the box that does not have responsibility for the page must ask the other box for write access. This is a round trip, at least tens of microseconds (µs). Consider in comparison that finding a row out of a million takes some 3µs.
Would it not be better to have each partition in a known place and leave all processing to that place? The write contention would be resolved in the box owning the partition and there would be a message but now for requesting the update, not dealing with cache consistency. At what level should one communicate between cluster nodes? Talk about disk pages or about logical operations? If there is complete affinity between boxes and data, the RAC style shared cache needs no messages, each box ends up managing the pages of its clients and all works just as with a local situation. If on the other hand any client will update any page at random, most updates must request the write permission from another node. I will here presume that index tree tops get eventually cached on all nodes. If this were not so, even index lookups would have to most often request each index page from a remote box. Never forget that with a tree index, it takes about 1µs to descend one level and 50µs for a message round trip between processes, not counting any transport latency.
What of RDF query workloads? After all, we are in the first instance concerned with winning the RDF storage race and after this the TPC ones. We design for both but do RDF first since this is our chosen specialty.
The disadvantage of having to specify partitioning is less weighty with RDF since there are only a few big tables and they will be at default settings, pretty much always. We do not expect the application developer to ever change these settings although it is in principle possible.
What about queries? The RDF workload is mostly random access and loop joins. How would these run on RAC? For now, let's make a thought experiment and compare cache fusion to a hash partitioned cluster. In the following, I do not describe how Oracle actually works but will just describe how I would do things if I implemented a RAC style clustering. With a RAC style cluster, I'd split the outer loop into equal partitions and run them in parallel on different boxes. Each would build a working set for its part of the query and pages that were needed by more than one box would be read once from disk and a second time from the node that had them in cache. The top nodes of index trees would end up cached on all boxes. It would seem that all boxes would fill their cache with the same data. Now it may be that RAC makes it so that a page is cached only on one box and other boxes wanting the page must go to that box to get access to the page. But this would be a disaster in index lookups. It is less than a microsecond per local index tree level, but if there is a round trip, it would be at least 50µs per level of the index tree. I don't know for sure about Oracle but if I did RAC, I'd have to allow duplicate read content in caches. This would have the effect that the aggregate cache size would be closer to the single cache size than to the sum of the cache sizes. A physically partitioned database would not ship pages so caches would not overlap and the aggregate cache would indeed be the sum of the sizes. Now this is good only insofar all boxes participate but with evenly distributed data this is a good possibility.
Of course, if RAC knew to split queries so that data and nodes had real affinity then the problem would be less. For indices one would need a map of key values to boxes. A little like a top level index shared among all nodes. The key value would give the node, just like with partitioning.
This would make partitioning on the fly. Joins that were made the most frequently would cause migration that would make these joins co-located.
We must optimize the number of messages needed to execute long series of loop joins. For parallelizing single queries, the most obvious approach would be to partition the first/outermost loop that is more than one iteration into equal size chunks. With RDF data, the join keys will mostly begin GS or GO with a possible P appended. If GO or GS specify the partition, partitioning by hash will yield the node that will provide the result.
The number of messages can be reduced to a minimum of the number of join steps times number of boxes minus one if the loops are short enough and multiple operations are carried by one message.
With RAC style clustering, each index lookup would have to be sent to the node most likely to hold the answer. If pages have to be fetched from other nodes, we have disastrous performance, at least 50µs for each non-local page. If there are two non-local pages in a lookup, the overhead will exceed the overhead of delegating the single lookup. Index page access in lookups cannot be easily batched, the way index lookups going to the same node can be batched. Batching multiple hopefully long operations into a single message is the only way to defeat the extreme cost of sending messages. An index lookup does not know what page it will need until it needs it. A way of batching these would be to run multiple lookups in parallel and to combine remote page requests grouping them by destination. This would not be impossible, simply we would have to run 100 index lookups in parallel on a thread, 100 first levels, 100 second levels and so forth. Suppose an outer loop that gives 100 rows and then an inner loop that retrieves 1 row for each. A query to get the email address of Mary would do this, supposing 100 Marys in the db. {?person firstName "Mary" . ?person email ?m .}
. Suppose a cluster of 10 nodes. The first node gets the 100 rows of the outer loop, splits these into 10x10, 10 on each node and then each node does 10 lookups in parallel, meaning 10 first levels, 10 second levels, 10 third levels. The index tree would be 4 deep, branching 300 ways. Running the query a second time would find all data in memory and run with only 18 messages after getting the 100 rows of the first loop. The first run would send lots of messages, almost two per page, for about 800 messages after getting the 100 rows of the first loop.
With partitioning, the situation would be sending 18 messages constant. 9 batches of 10 index lookups and their replies. The latency is 50µs and the lookup is 4µs. We would in fact gain in real time, counting 50µs for messages and 4µs per lookup, the time through the whole exercise of 10x10 random lookups would be 90µs.
If I did RAC style clustering, I'd have to allow replicating the tops of index trees to all caches, and I'd have to batch page request messages from index lookups, effectively doing the lookup vector processing style, meaning 100 first levels, 100 second levels etc. Given a key beginning, I'd have to know what node to send this to, meaning pretty much doing the first levels of the lookup before deciding where to send the lookup, only to have the lookup redone by the box ending up with the lookup. Doing things this way would make Oracle RAC style clustering work with the use case.
Given this, it appears that hash partitioning is easier to implement. Cache fusion clustering without the above mentioned gimmicks would be easiest of all but it would have a disastrous number of messages or it would fill all the caches with the same data. Avoiding this is possible but hard, as described above.
We will have to experiment with Oracle RAC itself a bit farther down the road. Deciding to use partitioning instead of cache fusion does bring along conversion cost and a very high cost for repartitioning.
Now let us look at the issue of co-location of joins. In a loop join this means that the node that holds the row from the outer loop also holds the row in the inner loop. For example, if order and order line are partitioned on order id, joining them on order id will be a co-located join. Such joins do not involve necessary messages in partitioned clusters. In RAC, they do not involve messages if the pages have migrated to be managed by the node doing the join, otherwise they do, up to 20 or so for the worst case.
Do we get any benefit from co-location with RDF? Supposing joins that go from S to O to S (e.g., population of the city where Mary lives), we do not get much guarantee of co-location.
Suppose the indices GSPO partitioned of GS and OGPS on OG, we know the box with Marys and then we'd know the box where the residence of each Mary was, based on GS. given the city as S, we would again know the box that had the population. All the three triples could be on different boxes. This cannot be helped at design time. At run time this can be helped by batching messages that go to the same node. Let's see how this fans out. 100 Marys from the first node. To get the city, we get 10 batches of 10 messages. We get the 100 cities and then we get their populations, again 10 batches of 10. In this scenario, the scheduling is centrally done by one thread. Suppose it were done by the 10 batches of 10 for getting the city of each Mary. For 10 cities, we'd get 10 lookups for population, each potentially to a different node. For this case, managing the execution by one thread instead of several makes bigger batches and less messages, as one would expect.
It seems that with the RDF case, one may as well forget co-location. In the relational case, one must take advantage of it when there is co-location and when not, try to compensate with longer batches of function shipping.
Excellent as some of the RAC claims are, it still seems that making it work well for an RDF workload would take such magical heuristics of location choice that implementing them would be hard and the result not altogether certain. I could get it to work eventually but hash partitioning seems by far the more predictable route. Also hash partitioning will work in shared nothing scenarios whereas RAC requires shared disks. Shared nothing will not require a SAN, which may make it somewhat lower cost. Also, if messages are grouped in large batches, the performance of the interconnect is not so critical, meaning that maybe even gigabit ethernet might do in cases. RAC style cache maintenance is more sensitive to interconnect latency than batched function shipping. Batched cache consistency is conceivable as discussed above but tough to do.
For recovery and hot software updates, things can be arranged if there is non-local disk access or if partitions are mirrored. A RAC type cluster could use a SAN with internal mirroring. A hash partitioned system could mirror partitions to more than one box with local disk, thus using no mirrored disks. Repartitioning remains the bane of partitioning and not much can be done about that, it seems. The only easy repartitioning is doubling the cluster size. So it seems.
]]>I have been away from the world for a few weeks, concentrating on technology.
We have now implemented an entirely new storage layout. With RDF data, we have now successfully doubled the working set.
This means that the number of triples that will fit in memory is doubled for any configuration. For any database in the hundreds of millions of triples, this is very significant. For LUBM data, we go from 75b to 35b per triple with the default indices.
This is obtained without using gzip or some other stream compression. Thus no decompression is needed at read time. Random access speeds are within 5% of those of Virtuoso v5.0.1, but the space requirement is halved and you can still locate a random triple in cache in a few microseconds.
What is better still, when using 8-byte IDs for IRIs instead of 4-byte ones, the space consumption stays almost the same since unique values are stored only once per page.
When applying gzip to the new storage layout, we usually get 3x compression. This means that 99% of 8K pages fit in 3K after compression. This is no real surprise since an index is repetitive pretty much by definition, even if the repeated sections are now shorter than in v5.0.1.
Gzip applied to pages does nothing for the working set since a page must remain random accessible for fast search but will cut disk usage to between half and a third. We will make this an option later. There are other tricks to be done with compression, like using a separate dictionary for non key text columns in relational applications. This would improve the working set in TPC-C and TPC-D quite a bit so we may do this also while on the subject.
Right now we are writing the clustering support, revising all internal APIs to run with batches of rows instead of single rows. We will most likely release clustering and the new storage layout together, towards the end of summer, at least in internal deployments.
I will blog about results as and when they are obtained, over the next few weeks.
]]>Semantic Web |Database |Databases |Virtuoso |SPARQL|RDF
We have a few new features that we did for the WWW 2007 conference that we will be shortly adding to the open source release.
IN
predicate. The IN predicate with a list of values will now use an index if available. This is useful for SPARQL queries with multiple FROM
graphs, for example.SELECT * FROM graph WHERE {?s ?p ?o}
into a UNION
of SELECT *
’s from multiple tables of different width. Each term of the UNION
will simply produce multiple 3 column result rows for each actual row while not having to run through the tables multiple times. Together with this, we have also fixed a number of things with the relational-to-RDF mapping. We have been testing this extensively with the Musicbrainz mapping by Fred Giasson. These changes are small and to be released shortly.
There are also some larger things in the works, to be released during this summer, the next post gives an overview of these.
]]>The topic of column-wise storage has not escaped us. We are not convinced that this is good for RDF. There is a point to this for business intelligence data warehouses, no doubt, although one could argue that one could get the same IO benefit with suitably selected covering indices but this is more design work. Column storage fits in less space and is more versatile For unexpected workloads.
But we can look at the RDF case in specific. You have a quad of G, S, P, O. You have a one part index on each and you have a unique row number for each quad. Given the row number, you must get the G, S, P, and O, and given any one of these, you must get the row numbers where this occurs. If there were multi-part keys, then this would be a row store with covering indices, like Virtuoso's RDF store.
Each datum is stored 8 times. What is nice is that one can use any combination of selection criteria with equal ease and in the same working set. With the RDF workload, you end up typically referencing all parts of each quad. It is not like in the business intelligence case where the typical query accesses 4 columns of the 15 column history table. Of the 4 RDF quad keys, at least 2 are generally given. So this becomes a merge intersection of two or three indices and random lookups for the unspecified columns. Complicated control path, even if the engine is meant to do this thing alone.
We'll have to try this. We could set up Virtuoso with 4 bitmap indices, each column to row ID and then a table with the 4 columns. Then we'd get bitmap ANDs for multi-column criteria and would have to get the row by row ID. As long as we run in memory, this should perform like a column store, close enough. We get the row with all the columns once, so we compensate for the fact that a column store has a special means for dereferencing the row ID for any column.
If we optimized this specially, which would not be so terribly hard, we'd have a column store. The main new thing would be making a special index by row ID that would have the ID just once per index leaf and a bitmap for dense allocation of row IDs. The rest is not too different.
For now, we will watch. If this is the next big thing, we can get there in little time.
databasedatabasesrdfsemanticwebhistoryvirtuoso ]]>We often get questions on clustering support, especially around RDF, where databases quickly get rather large. So we will answer them here.
But first on some support technology. We have an entire new disk allocation and IO system. It is basically operational but needs some further tuning. It offers much better locality and much better sequential access speeds.
Specially for dealing with large RDF databases, we will introduce data compression. We have over the years looked at different key compression possibilities but have never been very excited by them since thy complicate random access to index pages and make for longer execution paths, require scraping data for one logical thing from many places, and so on. Anyway, now we will compress pages before writing them to disk, so the cache is in machine byte order and alignment and disk is compressed. Since multiple processors are commonplace on servers, they can well be used for compression, that being such a nicely local operation, all in cache and requiring no serialization with other things.
Of course, what was fixed length now becomes variable length, but if the compression ratio is fairly constant, we reserve space for the expected compressed size, and deal with the rare overflows separately. So no complicated shifting data around when something grows.
Once we are done with this, this could well be a separate intermediate release.
Now about clusters. We have for a long time had various plans for clusters but have not seen the immediate need for execution. With the rapid growth in the Linking Open Data movement and questions on web scale knowledge systems, it is time to get going.
How will it work? Virtuoso remains a generic DBMS, thus the clustering support is an across the board feature, not something for RDF only. So we can join Oracle, IBM DB2, and others at the multi-terabyte TPC races.
We introduce hash partitioning at the index level and allow for redundancy, where multiple nodes can serve the same partition, allowing for load balancing read and replacement of failing nodes and growth of cluster without interruption of service.
The SQL compiler, SPARQL, and database engine all stay the same. There is a little change in the SQL run time, not so different from what we do with remote databases at present in the context of our virtual database federation. There is a little extra complexity for distributed deadlock detection and sometimes multiple threads per transaction. We remember that one RPC round trip Is 3-4 index lookups, so we pipeline things so as to move requests in batches, a few dozen at a time.
The cluster support will be in the same executable and will be enabled by configuration file settings. Administration is limited to one node, but Web and SQL clients can connect to any node and see the same data. There is no balancing between storage and control nodes because clients can simply be allocated round robin for statistically even usage. In relational applications, as exemplified by TPC-C, if one partitions by fields with an application meaning (such as warehouse ID), and if clients have an affinity to a particular chunk of data, they will of course preferentially connect to nodes hosting this data. With RDF, such affinity is unlikely, so nodes are basically interchangeable.
In practice, we develop in June and July. Then we can rent a supercomputer maybe from Amazon EC2 and experiment away.
We should just come up with a name for this. Maybe something astronomical, like star cluster. Big, bright but in this case not far away.
]]>We were at the WWW 2007 conference in Banff, Canada week before last. Virtuoso was a part of Alan Ruttenberg’s semantic web in health care and life sciences presentation. Alan had a database of 350M triples extracted from different biology and publication databases running on Virtuoso. We will also be experimenting on other biomedical datasets, both with real RDF and relational data mapped to RDF on demand.
Linking Open Data was a big thing at WWW 2007. There is quite a bit of momentum gathering around publishing publicly available data as RDF and making these data sets mutually joinable. Chris Bizer of the Free University of Berlin will be demonstrating Dbpedia linked with a number of other data sets such as Geonames and Musicbrainz and others at ESWC 2007 in a couple of weeks, also running on Virtuoso.
The last month or so has been spent mostly on the conference preparation and follow up, not to mention taking part in two EU project proposals. But now we are returning to normal operations and can do some technology for a change. More on this in the next post.
]]>Technorati Tags: database, databases, open-source, OpenLink, RDBMS, RDF, semantic web, Semantic Web, SPARQL, virtuoso
]]>The semantic web tech has on one hand SPARQL and on the other hand various visualization things such as faceted browsers like Longwell, Facet, and others.
But these sides do not meet very well, because SPARQL does not support aggregation and grouping, which is the very basic idea of faceted browsing.
So we have looked at possible solutions. The first and most obvious is to add SQL style aggregation to SPARQL. We do this already since one can embed SPARQL into SQL but a SPARQL-only syntax for this would be nice, so as to avoid the need for a SQL client login and so as to use the SPARQL protocol.
The first part of the solution is to allow aggregates in a SPARQL select. The aggregates are directly inherited from SQL, meaning count, min, max, sum, avg, with an optional distinct modifier. All terms of a select that are not aggregates are considered grouping columns, so one gets an implicit group by when combining variables and aggregates in a single select.
This is straightforward and is already a big improvement. But for very large result sets in a browsing situation, we can do something better.
Let's think about DBpedia as an example. We have data about people — things like names, birth/death dates, birthplaces, text descriptions. Some drill-down could be supported with a traditional OLAP cube. But in the semantic web situation, the number of dimensions can be high and the dimensions will usually be sparse. Besides, the number of dimensions is liable to change. Precomputed indices of everything are not the best choice here even though for well-defined analytics they are fine.
For an overview of the data, we can precompute some things. For example, the set of distinct graphs and their respective triple counts, also for each graph/predicate combination. These do not have to be absolutely up-to-date, and provide a quick first level of directory. Also, the count of instances of all classes is a candidate for precomputing. Same with count of triples grouped by class of S, graph.
Once we start talking about specific Ss or non-class Os, it should be either possible to count the triples or count them up to a maximum. A count aggregate function that stops the query when the count reaches a certain maxc might be useful. This would give small counts with precision but for larger cardinalities it would simply say that it is more than a given limit. In some cases this can be extrapolated to an actual count with fair precision but not always.
The more factors are given, the faster it is to count. For example, if we have a query that looks for an S with P1 = O1 and P2 = O2, this is a merge intersection which is very quick, especially if the intersection is a small fraction of the number of rows.
We can also look at the use of full text conditions with browsing. This makes precomputed counts pretty much useless once there is a text condition in the mix since a text condition is not something that can be mapped on a dimension of a cube. In this situation, it would seem appropriate to count to a certain maximum and then stop. Finding the first few matches with an AND of text and equalities on RDF graph objects or subjects is always quick. Depending on the case, one may even scale the count upwards to an estimate by looking at how far one got on the outermost loop of the joins before reaching the count ceiling. There is a little more to this since all joins are not nested loops but this is the general idea.
Relational DBMS often use random sampling of tables for optimization statistics. With RDF, this does not seem to work so well. Or rather, the usual types of stats are quite useless if all triples are in the same table. This is why we take actual samples at optimize time whenever leading key parts have literal values. Further, we take a small random sample of all triples and remember the distinct P and G values, as these are of low cardinality with highly uneven distribution. This gives us a fairly reliable idea of the relative frequencies of the more common Gs and Ps. This is good for query optimization but not necessarily for browsing. Good enough for sorting by frequency but not quite good enough for showing counts.
Anyway, we will start by introducing the basic SQL aggregates and then add a thing for limiting how much gets counted. This will be precise for small counts and give an order of magnitude for larger counts.
]]>We are a couple of days from releasing the Virtuoso Open Source 5.0 cut. This will make the technology that we are showing with Dbpedia and the various OpenLink web sites available to the public.
The updates involve:
Soon to follow are:
Existing databases will be automatically upgraded when started with the new Virtuoso 5.0 server. Note that after upgrade, the RDF data is not backward compatible.
We will be rolling out more Virtuoso hosted semantic web content in the Linking Open Data project, part of our participation in the Semantic Web Education and Outreach activity at W3C.
]]>Last time we talked about database engine and transactions. Now we have come to the realm of query processing in our revisiting of the DBMS side of Virtuoso.
Now the well established, respectable standard benchmark for the basics of query processing is TPC D with its derivatives H and R. So we have, for testing how different SQL optimizers manage the 22 queries, run a mini version of the D queries with a 1% scale database, some 30M of data, all in memory. This basically catches whether SQL implementations miss some of the expected tricks and how efficient in memory loop and hash joins and aggregation are.
When we get to our next stop, high volume I/O, we will run the same with D databases in the 10G ballpark.
The databases were tested on the same machine, with warm cache, taking the best run of 3. All had full statistics and were running with read committed isolation, where applicable. The data was generated using the procedures from the Virtuoso test suite. The Virtuoso version tested was 5.0, to be released shortly. The MySQL was 5.0.27, the PostgreSQL was 8.1.6.
Query | Query Times in Milliseconds | |||
---|---|---|---|---|
Virtuoso | PostgreSQL | MySQL | MySQL with InnoDB | |
Q1 | 206 | 763 | 312 | 198 |
Q2 | 4 | 6 | 3 | 3 |
Q3 | 13 | 51 | 254 | 64 |
Q4 | 4 | 16 | 24 | 60 |
Q5 | 15 | 22 | 64 | 68 |
Q6 | 9 | 70 | 189 | 65 |
Q7 | 52 | 143 | 211 | 84 |
Q8 | 29 | 31 | 13 | 11 |
Q9 | 36 | 114 | 97 | 61 |
Q10 | 32 | 51 | 117 | 57 |
Q11 | 16 | 9 | 12 | 10 |
Q12 | 8 | 21 | 18 | 130 |
Q13 | 18 | 74 | - | - |
Q14 | 7 | 21 | 418 | 1425 |
Q15 | 14 | 43 | 389 | 122 |
Q16 | 16 | 22 | 18 | 25 |
Q17 | 1 | 54 | 26 | 10 |
Q18 | 82 | 120 | - | - |
Q19 | 19 | 8 | 2 | 17 |
Q20 | 7 | 15 | 66 | 52 |
Q21 | 34 | 86 | 524 | 278 |
Q22 | 4 | 323 | 3311 | 805 |
Total (msec) | 626 | 2063 | 6068 | 3545 |
We lead by a fair margin but MySQL is hampered by obviously getting some execution plans wrong and not doing Q13 and Q18 at all, at least not under several tens of seconds; so we left these out of the table in the interest of having comparable totals.
As usual, we also ran the workload on Oracle 10g R2. Since Oracle does not like their numbers being published without explicit approval, we will just say that we are even with them within the parameters described above. Oracle has a more efficient decimal type so it wins where that is central, as on Q1. Also it seems to notice that the GROUP BY
s of Q18 are produced in order of grouping columns, so it needs no intermediate table for storing the aggregates. If we addressed these matters, we'd lead by some 15% whereas now we are even. A faster decimal arithmetic implementation may be in the release after next.
In the next posts, we will look at IO and disk allocation, and also return to RDF and LUBM.
]]>As previously said, we have a Virtuoso with brand new engine multithreading. It is now complete and passes its regular test suite. This is the basis for Virtuoso 5.0, to be available as the open source and commercial cuts as before.
As one benchmark, we used the TPC-C test driver that has always been bundled with Virtuoso. We ran 100000 new orders worth of the TPC-C transaction mix first with one client and then with 4 clients, each client going to its own warehouse, so there was not much lock contention. We did this on a 4 core Intel, the working set in RAM. With the old one, 1 client took 1m43 and 4 clients took 3m47. With the new one, one client took 1m30 and 4 clients took 2m37. So, 400000 new orders in 2m37, for 152820 new orders per minute as opposed to 105720 per minute previously. Do not confuse with the official tpmC metric, that one involves a whole bunch of further rules.
TPC-C has activity spread over a few different tables. With tests dealing with fewer tables, improvements in parallelism are far greater.
Aside from better parallelism, we have other features. One of them is a change in the read committed isolation, so that we now return the previous committed state for uncommitted changed rows instead of waiting for the updating transaction to terminate. This is similar to what Oracle does for read committed. Also we now do log checkpoints without having to abort pending write transactions.
When we have faster inserts, we actually see the RDF bulk loader run slower. This is really backwards. The reason is that while one thread parses, other threads insert and if the inserting threads are done they go to wait on a semaphore and this whole business of context switching absolutely kills performance. With slower inserts, the parser keeps ahead so there is less context switching, hence better overall throughput. I still do not get it how the OS can spend between 1.5 and 6 microseconds, several thousand instructions, deciding what to do next when there are only 3-4 eligible threads and all the rest is background which goes with a few dozen slices per second. Solaris is a little better than Linux at this but not dramatically so. Mac OS X is way worse.
As said, we use Oracle 10G2 on the same platform (Linux FC5 64 bit) for sparring. It is really a very good piece of software. We have written the TPC C transactions in SQL/PL. What is surprising is that these procedures run amazingly slowly, even with a single client. Otherwise the Oracle engine is very fast. Well, as I recall, the official TPC C runs with Oracle use an OCI client and no stored procedures. Strange. While Virtuoso for example fills the initial TPC C state a little faster than Oracle, the procedures run 5-10 times slower with Oracle than with Virtuoso, all data in warm cache and a single client. While some parts of Oracle are really well optimized, all basic joins and aggregates etc, we are surprised at how they could have neglected such a central piece as the PL.
Also, we have looked at transaction semantics. Serializable is mostly serializable with Oracle but does not always keep a steady count. Also it does not prevent inserts into a space that has been found empty by a serializable transaction. True, it will not show these inserts to the serializable transaction, so in this it follows the rules. Also, to make a read really repeatable, it seems that the read has to be FOR UPDATE. Otherwise one can not implement a reliable resource transaction, like changing the balance of an account.
Anyway, the Virtuoso engine overhaul is now mostly complete. This is of course an open ended topic but the present batch is nearing completion. We have gone through as many as 3 implementations of hash joins, some things have yet to be finished there. Oracle has very good hash joins. The only way we could match that was to do it all in memory, dropping any persistent storage of the hash. This is of course OK if the hash is not very large and anyway hash joins go sour if the hash does not fit in working set.
As next topics, we have more RDF and the LUBM benchmark to finish. Also we should revisit TPC-D.
Databases are really quite complicated and extensive pieces of software. Much more so than the casual observer might think.
]]>It's been a long and very busy time since the last blog post.
Now and then, circumstances call for a return to the contemplation of first principles. I have lately beheld the Platonic ideal of database-ness and translated it into engineering elegance. No quest is static and no objective is permanently achieved.
Accordingly, I have redone all Virtuoso core engine structures for control of parallel execution. As we now routinely get multiple cores per chip, this is more important than before. Aside from dramatic improvements in multiprocessor performance, there is also quite a bit of optimization for basic relational operations.
Of course, this is not for the pure pleasure of geek-craft; it serves a very practical purpose. RDF opens a new database frontier, where these things make a significant difference. In application scenarios involving either federated/virtual database or running typical web applications, the core concurrency of the DBMS is not really the determining factor. However, with RDF, we get a small number of very large tables and most processing goes to these tables. This is also often so with business intelligence but it is still more so with RDF. Thus the parallelism within a single index becomes essential.
We have also made a point by point comparison of Virtuoso and Oracle 10g for basic relational operations. Oracle is very good, certainly in the basic relational operations like table scans and different kinds of joins. As a matter of principle, we will at the minimum match Oracle in all these things, in single and multiprocessor environments. The Virtuoso cut forthcoming in January will have all this inside. We are also considering making and publishing a basic RDBMS performance checklist, aimed at comparing specific aspects of relational engine performance. While the TPC tests give a good aggregate figure, it is sometimes interesting to look at a finer level of detail. We may not be allowed to give out numbers in all cases due to license terms but we can certainly make the test available and publish numbers for those who do not object to this.
Of course, RDF is the direct beneficiary of all these efforts, since RDF loading and querying basically rests on the performance of very relational things, such as diverse types of indices and joins.
More information will be forthcoming in January.
Merry Christmas and productive new year to all.
]]>We have updated our article on Virtuoso scalability with two new platforms: A 2 x dual core Intel Xeon and a Mac Mini with an Intel Core Duo.
We have more than quadrupled the best result so far.
The best score so far is 83K transactions per minute with a 40 warehouse (about 4G) database. This is attributable to the process running in mostly memory, with 3 out of 4 cores busy on the database server. But even when doubling the database size and number of 3 clients, we stay at 49K transactions per minute, now with a little under 2 cores busy and am average of 20 disk reads pending at all times, split over 4 SATA disks. The measurement is the count of completed transactions during a 1h run. With the 80 warehouse database, it took about 18 minutes for the system to reach steady state, with a warm working set, hence the actual steady rate is somewhat higher than 49K, as the warm up period was included in the measurement.
The metric on the Mac Mini was 2.7K with 2G RAM and one disk. The CPU usage was about one third of one core. Since we have had rates of over 10K with 2G RAM, we attribute the low result to running on a single disk which is not very fast at that.
We have run tests in 64 and 32 bit modes but have found little difference as long as actual memory does not exceed 4g. If anything, 32 bit binaries should have an advantage in cache hit rate since most data structures take less space there. After the process size exceeds the 32 bit limit, there is a notable difference in favor of 64 bit. Having more than 4G of database buffers produces a marked advantage over letting the OS use the space for file system cache. So, 64 bit is worthwhile but only if there is enough memory. As for X86 having more registers in 64 bit mode, we have not specifically measured what effect that might have.
We also note that Linux has improved a great deal with respect to multiprocessor configurations. We use a very simple test with a number of threads acquiring and then immediately freeing the same mutex. On single CPU systems, the real time has pretty much increased linearly with the number of threads. On multiprocessor systems, we used to get very non-linear behavior, with 2 threads competing for the same mutex taking tens of times the real time as opposed to one thread. At last measurement, with a 64 bit FC 5, we saw 2 threads take 7x the real time when competing for the same mutex. This is in the same ballpark as Solaris 10 on a similar system. Mac OS X 10.4 Tiger on a 2x dual core Xeon Mac Pro did the worst so far, with two threads taking over 70x the time of one. With a Mac Mini with a single Core Duo, the factor between one thread and two was 73.
Also the proportion of system CPU on Tiger was consistently higher than on Solaris or Linux when running the same benchmarks. Of course for most applications this test is not significant but it is relevant for database servers, as there are many very short critical sections involved in multithreaded processing of indices and the like.
]]>We have been extensively working on virtual database refinements. There are many SQL cost model adjustments to better model distributed queries and we now support direct access to Oracle and Informix statistics system tables. Thus, when you attach a table from one or the other, you automatically getup to date statistics. This helps Virtuoso optimize distributed queries. Also the documentation is updated as concerns these, with a new section on distributed query optimization.
On the applications side, we have been keeping up with the SIOC RDF ontology developments. All ODS applications now make their data available as SIOC graphs for download and SPARQL query access.
What is most exciting however is our advance in mapping relational data into RDF. We now have a mapping language that makes arbitrary legacy data in Virtuoso or elsewhere in the relational world RDF query-able. We will put out a white paper on this in a few days.
Also we have some innovations in mind for optimizing the physical storage of RDF triples. We keep experimenting, now with our sights set to the high end of triple storage, towards billion triple data sets. We are experimenting with a new more space efficient index structure for better working set behavior. Next week will yield the first results.
]]>I just got a Das Keyboard. After my old IBM keyboard I'd had for probably 10 years broke, I had been going through a number of the crappy keyboards you buy at the corner PC store for 15 euros, also a newer IBM keyboard, but they don't make them anymore like they used to. Except for the Das Keyboard. This is at least as good as the old IBM thing. It does not miss or duplicate keystrokes, it clicks, it has a solid feel, everything that the squeaky travesties of keyboards you get for 15 euros don't have. It is not even expensive, only 99 euros or 89 dollars. It really makes a difference.
]]>This post presents some ideas and use cases for RDF store benchmarking.
An RDF benchmark suite should meet the following criteria:
The query load should illustrate the following types of operations:
If we take an application like LinkedIn as a model, we can get a reasonable estimate of the relative frequency of different queries. For the queries per second metric, we can define the mix similarly to TPC C. We count executions of the main query and divide by running time. Within this time, for every 10 executions of the main query there are varying numbers of executions of secondary queries, typically more complex ones.
The report contains basic TPC-like items such as:
These can go into a summary spreadsheet that is just like the TPC ones.
Additionally, the full report should include:
OpenLink has a multithreaded C program that simulates n web users multiplexed over m threads. For example, 10000 users with 100 threads, each user with its own state, so that they carry out their respective usage patterns independently, getting served as soon as the server is available, still having no more than m requests going at any time. The usage pattern is something like go check the mail, browse the catalogue, add to shopping cart etc. This can be modified to browse a social network database and produce the desired query mix. This generates HTTP requests, hence would work against a SPARQL end point or any set of dynamic web pages.
The program produces a running report of the clicks per second rate and statistics at the end, listing the min/avg/max times per operation.
This can be packaged as a separate open source download once the test spec is agreed upon.
For generating test data, a modification of the LUBM generator is probably the most convenient choice.
This area is somewhat more complex than triple storage.
At least the following factors enter into the evaluation:
The rationale for mapping relational data to RDF is often data integration. Even in simple cases like the OpenLink Data Spaces applications, a single SPARQL query will often result in a union of queries over distinct relational schemas, each somewhat similar but different in its details.
A test for mapping should represent this aspect. Of course, translating a column into a predicate is easy and useful, specially when copying data. Still, the full power of mapping seems to involve a single query over disparate sources with disparate schemas.
A real world case is OpenLink's ongoing work for mapping WordPress, Mediawiki, phpBB, Drupal, and possibly other popular web applications into SIOC.
Using this as a benchmark might make sense because the source schemas are widely known, there is a lot of real world data in these systems, and the test driver might even be the same as with the above proposed triple store benchmark. The query mix might have to be somewhat tailored.
Another "enterprise style" scenario might be to take the TPC C and TPC D databases — after all both have products, customers and orders — and map them into a common ontology. Then there could be queries sometimes running on only one, sometimes joining both.
Considering the times and the audience, the WordPress/Mediawiki scenario might be culturally more interesting and more fun to demo.
The test has two aspects: Throughput and coverage. I think these should be measured separately.
The throughput can be measured with queries that are generally sensible, such as "get articles by an author that I know with tags t1 and t2."
Then there are various pathological queries that work specially poorly with mapping. For example, if the types of subjects are not given, if the predicate is known at run time only, if the graph is not given, we get a union of everything joined with another union of everything and many of the joins between the terms of the different unions are identically empty but the software may not know this.
In a real world case, I would simply forbid such queries. In the benchmarking case, these may be of some interest. If the mapping is clever enough, it may survive cases like "list all predicates and objects of everything called gizmo where the predicate is in the product ontology".
It may be good to divide the test into a set of straightforward mappings and special cases and measure them separately. The former will be queries that a reasonably written application would do for producing user reports.
]]>SPARQL End Point Self Description
I was at the ISWC 2006 conference a week back. One of the items discussed there, at least informally, was the topic of SPARQL end point discovery. I have below put together a summary of points that were discussed and of my own views on their possible resolution.
This is intended as a start for conversation and as a summary of ideas.
Self-description of end points may serve at least the following purposes:
We will look at each one in turn.
The end point should give a ballpark cardinality for the following combinations of G, S, P, O.
Based on our experience, these are the most interesting questions but for completeness, the entry point might offer an API allowing specifying a constant or wildcard for each of the four parts of a quad. If the information is not readily available, "unknown" could be returned, together with the count of triples in the whole end point or the graph, if the graph is specified. Even if the end point does not support real time sampling of data for cardinality estimates, it would at least have an idea of the count of triples per graph, which is still far better than nothing.
Given the full SPARQL request, the end point could return the following data, without executing the query itself.
All these elements would be optional.
This somewhat overlaps with the optimization questions but it may still be the case that it is more efficient to support a special interface for the optimization related questions.
]]>We have lately been busy with RDF scalability. We work with the 8000 university LUBM data set, a little over a billion triples. We can load it in 23h 46m on a box with 8G RAM. With 16G we probably could get it in 16h.
The resulting database is 75G, 74 bytes per triple which is not bad. It will shrink a little more if explicitly compacted by merging adjacent partly filled pages. See Advances in Virtuoso RDF Triple Storage for an in-depth treatment of the subject.
The real question of RDF scalability is finding a way of having more than one CPU on the same index tree without them hitting the prohibitive penalty of waiting for a mutex. The sure solution is partitioning, would probably have to be by range of the whole key. but before we go to so much trouble, we'll look at dropping a couple of critical sections from index random access. Also some kernel parameters may be adjustable, like a spin count before calling the scheduler when trying to get an occupied mutex. Still we should not waste too much time on platform specifics. We'll see.
We just updated the Virtuoso Open Source cut. The latest RDF refinements are not in, so maybe the cut will have to be refreshed shortly.
We are also now applying the relational to RDF mapping discussed in Declarative SQL Schema to RDF Ontology Mapping to the ODS applications.
There is a form of the mapping in the VOS cut on the net but it is not quite ready yet. We must first finish testing it through mapping all the relational schemas of the ODS apps before we can really recommend it. This is another reason for a VOS update in the near future.
We will be looking at the query side of LUBM after the ISWC 2006 conference. So far, we find queries compile OK for many SIOC use cases with the cost model that there is now. A more systematic review of the cost model for SPARQL will come when we get to the queries.
We put some ideas about inferencing in the Advances in Triple Storage paper. The question is whether we should forward chain such things as class subsumption and subproperties. If we build these into the SQL engine used for running SPARQL, we probably can do these as unions at run time with good performance and better working set due to not storing trivial entailed triples. Some more thought and experimentation needs to go into this.
]]>We have made new benchmarks with loading the 47 million triples of the Wikipedia links data set. So far, our best result is 40 minutes with a dual core Xeon with 8G memory. This comes to about 18000 triples per second with between 1.2 and 2 CPU cores busy, slightly depending on configuration parameters. Our previous best result was with a dual 1.6GHz SPARC with 7700 triples per second on loading the 2M triple Wordnet data set.
These are memory based speeds. We have implemented an automatic background compaction for database tables and have tried the Wikipedia load with and without. The CPU cost of the compaction was about 10% with a slight gain in real time due to less IO.
But the real deal remains IO. With the compaction on, we got 91 bytes per triple, all included, i.e., two indices on the triples table, dictionaries from IRI IDs to URIs, etc. The compaction is rather simple — it just detects adjacent dirty pages about to be written to disk and sees if the set of contiguous dirty pages would fit on fewer pages than they now take. If so, it rewrites the pages and frees the ones left over. It does not touch clean pages. With some more logic it could also compact clean pages, provided the result did not have more dirty pages than the initial situation. With more aggressive compaction we will get about 75 bytes per triple. We will try this.
But the real gains will come from index compression with bitmaps. For the Wikipedia data set, this will cut one of the indices to about a third of its current size. This is also the index with the more random access, so the benefit is compounded in terms of working set. At that point we will be looking at about 50 bytes per triple. We will see next week how this works with the LUBM RDF benchmark.
]]>We have updated our article on Virtuoso scalability with two new platforms: A 2 x dual core Intel Xeon and a Mac Mini with an Intel Core Duo.
We have more than quadrupled the best result so far.
The best score so far is 83K transactions per minute with a 40 warehouse (about 4G) database. This is attributable to the process running in mostly memory, with 3 out of 4 cores busy on the database server. But even when doubling the database size and number of 3 clients, we stay at 49K transactions per minute, now with a little under 2 cores busy and am average of 20 disk reads pending at all times, split over 4 SATA disks. The measurement is the count of completed transactions during a 1h run. With the 80 warehouse database, it took about 18 minutes for the system to reach steady state, with a warm working set, hence the actual steady rate is somewhat higher than 49K, as the warm up period was included in the measurement.
The metric on the Mac Mini was 2.7K with 2G RAM and one disk. The CPU usage was about one third of one core. Since we have had rates of over 10K with 2G RAM, we attribute the low result to running on a single disk which is not very fast at that.
We have run tests in 64 and 32 bit modes but have found little difference as long as actual memory does not exceed 4g. If anything, 32 bit binaries should have an advantage in cache hit rate since most data structures take less space there. After the process size exceeds the 32 bit limit, there is a notable difference in favor of 64 bit. Having more than 4G of database buffers produces a marked advantage over letting the OS use the space for file system cache. So, 64 bit is worthwhile but only if there is enough memory. As for X86 having more registers in 64 bit mode, we have not specifically measured what effect that might have.
We also note that Linux has improved a great deal with respect to multiprocessor configurations. We use a very simple test with a number of threads acquiring and then immediately freeing the same mutex. On single CPU systems, the real time has pretty much increased linearly with the number of threads. On multiprocessor systems, we used to get very non-linear behavior, with 2 threads competing for the same mutex taking tens of times the real time as opposed to one thread. At last measurement, with a 64 bit FC 5, we saw 2 threads take 7x the real time when competing for the same mutex. This is in the same ballpark as Solaris 10 on a similar system. Mac OS X 10.4 Tiger on a 2x dual core Xeon Mac Pro did the worst so far, with two threads taking over 70x the time of one. With a Mac Mini with a single Core Duo, the factor between one thread and two was 73.
Also the proportion of system CPU on Tiger was consistently higher than on Solaris or Linux when running the same benchmarks. Of course for most applications this test is not significant but it is relevant for database servers, as there are many very short critical sections involved in multithreaded processing of indices and the like.
]]>We have been extensively working on virtual database refinements. There aremany SQL cost model adjustments to better model distributed queries and wenow support direct access to Oracle and Informix statistics system tables.Thus, when you attach a table from one or the other, you automatically getup to date statistics. This helps Virtuoso optimize distributed queries.Also the documentation is updated as concerns these, with a new section ondistributed query optimization.
On the applications side, we have been keeping up with the SIOC RDF ontologydevelopments. All ODS applications now make their data available as SIOCgraphs for download and SPARQL query access.
What is most exciting however is our advance in mapping relational data intoRDF. We now have a mapping language that makes arbitrary legacy data in Virtuoso or elsewhere in the relational world RDF queriable. We will putout a white paper on this in a few days.
Also we have some innovations in mind for optimizing the physical storage ofRDF triples. We keep experimenting, now with our sights set to the highend of triple storage, towards billion triple data sets. We areexperimenting with a new more space efficient index structure for betterworking set behavior. Next week will yield the first results.
SQL Database Databases Virtuoso Programming Semantic Web SPARQL ]]>We have released an update of Virtuoso Open Source Edition and the OpenLink Data Spaces suite.
This marks the coming of age of our RDF and SPARQL efforts. We have the new SQL cost model with SPARQL awareness, we have applications which present much of their data as SIOC, FOAF, ATOM OWL and other formats.
We continue refining these technologies. Our next roadmap item is mapping relational data into RDF and offering SPARQL access to relational data without data duplication. Expect a white paper about this soon.
]]>There is a new paper Implementing an RDF Triple Store using an ORDBMS at the Virtuoso wiki.
This paper summarizes how we have extended Virtuoso's SQL and database engine to better accommodate storing RDF triples and optimizing queries of RDF data. This is the first of a series. The next will concern mapping relational databases onto RDF ontologies for SPARQL access.
This paper concerns the next Virtuoso Open Source release, to be available for download a few days from this posting.
]]>Following from the post on a new multithreaded RDF loader, here are some intermediate results and action plans based on these.
The experiments were made on a dual 1.6GHz Sun SPARC with 4G RAM and 2 SCSI disks. The data sets were the 48M triple Wikipedia data set and the 1.9M triple Wordnet data set. 100% CPU means one CPU constantly active. 100% disk means one thread blocked on the read system call at all times.
Starting with an empty database, loading the Wikipedia set took 315 minutes, amounting to about 2500 triples per second. After this, loading the Wordnet data set with cold cache and 48M triples already in the table took 4 minutes 12 seconds, amounting to 6838 triples per second. Loading the Wikipedia data had CPU usage up to 180% but over the whole run CPU usage was around 50% with disk I/O around 170%. Loading the larger data set was significantly I/O bound while loading the smaller set was more CPU bound, yet was not at full 200% CPU.
The RDF quad table was indexed on GSPO and PGOS. As one would expect, the bulk of I/O was on the PGOS index. We note that the pages of this index were on the average only 60% full. Thus the most relevant optimization seems to be to fill the pages closer to 90%. This will directly cut about a third of all I/O plus will have an additional windfall benefit in the form of better disk cache hit rates resulting from a smaller database.
The most practical way of having full index pages in the case of unpredictable random insert order will be to take sets of adjacent index leaf pages and compact the rows so that the last page of the set goes empty. Since this is basically an I/O optimization, this should be done when preparing to write the pages to disk, hence concerning mostly old dirty pages. Insert and update times will not be affected since these operations will not concern themselves with compaction. Thus the CPU cost of background compaction will be negligible in comparison with writing the pages to disk. Naturally this will benefit any relational application as well as free text indexing. RDF and free text will be the largest beneficiaries due to the large numbers of short rows inserted in random order.
Looking at the CPU usage of the tests, locating the place in the index where to insert, which by rights should be the bulk of the time cost, was not very significant, only about 15%. Thus there are many unused possibilities for optimization,for example writing some parts of the loader current done as stored procedures in C. Also the thread usage of the loader, with one thread parsing and mapping IRI strings to IRI ID's and 6 threads sharing the inserting could be refined for better balance, as we have noted that the parser thread sometimes forms a bottleneck. Doing the updating of the IRI name to IRI id mapping on the insert thread pool would produce some benefit.
Anyway, since the most important test was I/O bound, we will first implement some background index compaction and then revisit the experiment. We expect to be able to double the throughput of the Wikipedia data set loading.
]]>Continuing on from the previous post... If Microsoft opens the right interfaces for independent developers, we see many exciting possibilities for using ADO.NET 3.0 with Virtuoso.
Microsoft quite explicitly states that their thrust is to decouple the client side representation of data as .NET objects from the relational schema on the database. This is a worthy goal.
But we can also see other possible applications of the technology when we move away from strictly relational back ends. This can go in two directions: Towards object oriented database (OODBMS) and towards making applications for the semantic web.
In the OODBMS direction, we could equate Virtuoso table hierarchies with .NET classes and create a tighter coupling between client and database, going as it were in the other direction from Microsoft's intended decoupling. For example, we could do typical OODBMS tricks such as pre-fetch of objects based on storage clustering. The simplest case of this is like virtual memory, where the request for one byte brings in the whole page or group of pages. The basic idea is that what is created together probably gets used together and if all objects are modeled as subclasses of (sub-tables) of a common superclass, then, regardless of instance type, what is created together (has consecutive IDs) will indeed tend to cluster on the same page. These tricks can deliver good results in very navigational applications like GIS or CAD. But these are rather specialized things and we do not see OODBMS making any great comeback.
But what is more interesting and more topical in the present times is making clients for the RDF world. There, the OWL ontology could be used to make the .NET classes and the DBMS could, when returning URIs serving as subjects of triple include specified predicates on these subjects, enough to allow instantiating .NET instances as "proxies" of these RDF objects. Of course, only predicates for which the client has a representation are relevant, thus some client-server handshake is needed at the start. What data could be pre-fetched is like the intersection of a concise bounded description and what the client has classes for. The rest of the mapping would be very simple, with IRIs becoming pointers, multi-valued predicates lists, and so on. IRIs for which the RDF type is not known or inferable could be left out or represented as a special class with name-value pairs for its attributes, same with blank nodes.
In this way, .NET's considerable UI capabilities could directly be exploited for visualizing RDF data, only given that the data complies reasonably well with a known ontology.
If a SPARQL query returned a result-set, IRI type columns would be returned as .NET instances and the server would pre-fetch enough data for filling them in. For a CONSTRUCT, a collection object could be returned with the objects materialized inside. If the interfaces allow passing an Entity SQL string, these could possibly be specialized to allow for a SPARQL string instead. LINQ might have to be extended to allow for SPARQL type queries, though.
Many of these questions will be better answerable as we get more details on Microsoft's forthcoming ADO .NET release. We hope that sufficient latitude exists for exploring all these interesting avenues of development.
]]>I have recently read some of Microsoft's ADO .NET 3 papers. I am reminded of the distant past when I designed Kubl, which later became OpenLink Virtuoso. So I will reminisce and speculate a little.
So now is the time when polymorphic queries and mixing relational style joins and object style navigation become politically acceptable and even recommended and there finally is a workable solution to having a foreign key in the database and a pointer or set of pointers in the client application. Not to mention change tracking so as to be able to update in-memory data structures and commit a delta against the database without explicit update statements.
All these questions existed already in the mid 90s and earlier. Since I was coming from OO and LISP into the database world, I even felt these questions to be important. The solution in the earliest Kubl was to have inheritance between tables, what became the SQL 2K UNDER
clause, and a virtual column called _ROW
that would select a serialization of the primary key entry. Then there was the function row_key()
, which when applied to a _ROW
virtual column would return a database-wide unique identifier of the row, containing the key info and the key part values plus which subtable of the table was at hand. Then there was a function for dereferencing a row_key
for getting the _ROW
. And one could store row_keys
into columns and dereference these in queries. Within SQL, one could use the row_column
function to extract individual column values from a row_key
or _ROW
.
This was all fine server side. But we also had a client for Franz Inc.'s Allegro Common Lisp that talked to Kubl's ODBC listener. This client had the basic statements and prepared statements and result sets, parameters and array parameters, a little like JDBC does now. But the extra was that we could do a mapping between a Lisp struct or object and a database key, so the _ROW
would automatically materialize into the Lisp struct or class instance. And the mapping between these materializations and the row_keys
identifying them in the database were kept in a thread environment called object space. Updates could be relational-style UPDATEs
or consist of putting a _ROW
serialization in database format back into the Kubl store with a single SQL function.
This was different from just storing object serializations into LOB columns, as is often done, insofar as the object classes and data members were really database tables and columns, thus native to the DBMS, not just opaque data to be processed client-side only.
So it was then possible to program a little like is shown in the ADO .NET 3 demos today, some ten years later.
Some of these functions still exist in Virtuoso, albeit in a deprecated state, and there is no client that can use these to any advantage. Indeed, we dropped this line of work when Kubl became Virtuoso, mostly because there was no standard and no client applications that would use such features. Instead, we concentrated on virtual RDBMS, transparently accessing any third party data via ODBC.
Now however, as objects, both native SQL and Java and .NET, have become mainstream citizens of relational databases in general, Virtuoso and otherwise, and as Microsoft has legitimized accessing whole objects and not only scalar columns in result sets as part of ADO .NET 3, these things might be worth a second look.
]]>We have been playing with the Wikipedia3 RDF data set, 48 million triples or so. We have for a long time foreseen the need for a special bulk loader for RDF but this brought this into immediate relevance.
So I wrote a generic parallel extension to Virtuoso/PL and SQL. This consists of a function for creating a queue that will feed async requests to be served on a thread pool of configurable size. Each of the worker threads has its own transaction and the owner of the thread pool can look at or block for return states of individual request . This is a generic means for delegating work to async threads from Virtuoso/PL. Of course this can also be used at a lower level for parallelizing single SQL queries, for example aggregation of a large table or creating an index on a large table. Many applications, such as the ODS Feed Manager will also benefit, since this makes it more convenient to schedule parallel downloads from news sources and the like. This extension will make its way into the release after next.
But back to RDF. We presently have the primary key of the triple store as GSPO and a second index as PGOS. Using this mechanism, we will experiment with different multithreaded loading configurations. One thread translates from the IRI text representation to the IRI IDs, one thread may insert into the GSPO index, which is typically local and a few threads will share the inserting into the PGOS key. The latter key is inserted in random order, whereas the former is inserted mainly in ascending order when loading new data. In this way, we should be able to keep full load on several CPUs and even more disks.
It turns out that the new async queue plus thread pool construct is very handy for any pipeline or symmetric parallelization. When this is well tested, I will update the documents and maybe do a technical article about this.
Transactionality is not an issue in the bulk load situation. The graph being loaded will anyway be incomplete until it is loaded, other graphs will not be affected and no significant amount of locks will be held at any time by the bulk loader threads.
Also later, when looking at within-query and other parallelization, we have many interesting possibilities. For example, we may measure the CPU and IO load and adjust the size of the shareable thread pool accordingly. All SQL or web requests get their thread just as they now do, and extra threads may be made available for opportunistic parallelization up until we have full CPU and IO utilization. Still, this will not lead to long queries preempting short ones, since all get at least one thread. I may post some results of parallel RDF loading later on this blog.
]]>The last couple of weeks have been very busy, dealing with updates to the Virtuoso SQL and SPARQL compiler cost model.
The new SQL compiler takes samples of index population on demand, thus always works with up-to-date statistics. Further, when there are constant leading key parts, it can get an estimate of the selectivity of the constant criteria with a single lookup.
This is especially important for processing RDF. Since all triples go to one table unless otherwise declared, normal SQL statistics are not very useful for determining the join order for a SPARQL query. However, nearly always, SPARQL queries have a constant graph, constant predicate, sometimes constant subjects and objects. For example, using the index P, G, O, S, the compiler can know how many triples will have a given predicate within a given graph. This is done with a single lookup, without needing to count the actual triples, which would defeat the purpose. Also there is no need to do periodic statistics collection runs or to maintain counts of distinct combinations for multiple key parts. This makes for virtual certainty of getting reasonable join orders even for recently inserted or fast changing data sets.
This will be part of the next Virtuoso Open Source release, probably in the next couple of weeks. There will also be a technical article with examples of how the dynamic statistics feature helps with RDF queries.
]]>Virtuoso Open Source Edition has been updated to version 4.5.2. We have added a binary distribution for Windows and the source distribution comes with Visual Studio project files for building on Win32. The new source distribution is available from the Virtuoso wiki. Make sure that you get the distribution with version 4.5.2. Win64 support will be in the next update.
This release also enhances the SPARQL support with better inlining of SPARQL in SQL and other features and fixes.
]]>We have a new technical article, benchmarking Virtuoso on different hardware configurations.
This is useful reading for anyone interested in using Virtuoso as a database back end for online applications or simply anyone interested in relational database scalability, no matter what specific DBMS.
We use an adaptation of the well known TPC-C benchmark to see what hardware configuration will give the best price/performance. We also explain how to tune Virtuoso and how and why different parameters affect the throughput.
]]>I am Orri Erling, program manager for Virtuoso at OpenLink Software. This blog is about any and all aspects of technology that have to do with Virtuoso.
The launch of Virtuoso Open Source Edition (VOS) marks a new period in our participation in the database world. We will henceforth be much more active, publish much more material, have a faster release cycle and actively reach out to the various areas of the open source community.
We have years worth of demos, white papers, articles, a suite of Virtuoso based applications, and much more that we will be unveiling over the following months.
We will track different aspects of Virtuoso work on this and related blogs. In the middle term, we will talk about the following:
There is a whole suite of next generation file server features to be unveiled. These include items such as automatic metadata extraction and logical views on content based on its metadata, permissions etc.
In the immediate future, we will:
The VOS development CVS will be updated at high frequency, in some areas even weekly. Stable snapshots will be made available 3 or 4 times a year.
We will have a very exciting spring, with radically more participation in the database and open source worlds than ever. Look for frequent updates on this blog.
]]>Enjoy!
]]>Here are a few links that resolve any confusion about this matter:
Or simple google on PHP and ODBC or PHP and iODBC ...
]]>There are a whopping 44,000 SAP customers running on Oracle databases, and IBM wants them. To get them, for the first time ever, it's optimized its enterprise database for a specific vendor's applications. The new version of DB, 8.2.2, will include a slew of SAP-optimized features, including self-tuning, self-configuration, silent install, dynamic storage allocation and more.
Wouldn't SAP be better served by simply making their application database independent via ODBC? This process really could have commenced years ago and prevented today's dilema: Your Partner has become Your most aggressive Competitor!
SAP tuned for specifically for DB2 or SAP tuned likewise for Microsoft SQL simply reeks of: "Same Sh*t different Pile". Microsoft and IBM will emulate Oracle in due course regarding their assault on SAP's market if DBMS specificity remains the SAP data access API strategy (this is a simple fact).
SAP should be using its quest for DBMS independence to stimulate or contribute ODBC enhancements (should ODBC be lacking in areas critical to its application needs; it is available in Open Source form and across all major platforms). Should the ODBC API not be the problem, then it can push ODBC Driver vendors (DBMS vendors such as IBM included) to get their Drivers in shape (should they be lacking, I know our ODBC Drivers are absolutely fine for this kind of task).
Database specificity gets application vendors nowhere. You can only control your business development destiny by being database independent. When applications are database independent the intellectual capital that drives your applications is preserved. This is akin to building physical and logical firewalls around the ecosystem created by your products. This is much better that being a pseudo DBMS engine reseller for a future competitor.
]]>
Advertising in RSS is just starting now, for all practical purposes. If we wanted to, as an industry, reject the idea, we could.
Here goes:
Blog Editing
I can use any editor that supports the following Blog Post APIs:
- Moveable Type
- Meta Weblog
- Blogger
Typically I use Virtuoso (which has an unreleased WYSIWYG blog post editor), Newzcrawler, ecto, Zempt, or w.bloggar for my posts. If a post is of interest to me, or relevant to our company or customers I tend to perform one of the following tasks:
- Generate a post using the "Blog This" feature of my blog editor
- Write a new post that was triggered by a previously read post etc.
Either way, the posts end up in our company wide blog server that is Virtuoso based (more about this below). The internal blog server automatically categorizes my blog posts, and automagically determines which posts to upstream to other public blogs that I author (e.g http://kidehen.typepad.com ) or co-author (e.g http://www.openlinksw.com/weblogs/uda and http://www.openlinksw.com/weblogs/virtuoso ). I write once and my posts are dispatched conditionally to multiple outlets.
RSS/Atom/RDF Aggregation & Reading
I discover, subscribe to, and view blog feeds using Newzcrawler (primarily), and from time to time for experimentation and evaluation purposes I use RSS Bandit, FeedDemon, and Bloglines. I am in the process of moving this activity over to Virtuoso completely due to the large number of feeds that I consume on a daily basis (scalability is a bit of a problem with current aggregators).
Blog Publishing
When you visit my blog you are experiencing the soon to be released Virtuoso Blog Publishing engine first hand, which is how WebDAV, SQLX, XQuery/XPath, and Free Text etc. come into the mix.
Each time I create a post internally, or subscribe to an external feed, the data ends up in Virtuoso's SQL Engine (this is how we handle some of the obvious scalability challenges associated with large subscription counts). This engine is SQL2000N based, which implies that it can transform SQL to XML on the fly using recent extensions to SQL in the form of SQLX (prior to the emergence of this standard we used the FOR XML SQL syntax extensions for the same result). It also has its own in-built XSLT processor (DB Engine resident), and validating XML parser (with support for XML Schema). Thus, my RSS/RDF/Atom archives, FOAF, BlogRoll, OPML, and OCS blog syndication gems are all live examples of SQLX documents that leverage Virtuoso's WebDAV engine for exposure to Blog Clients.
Blog Search
When you search for blog posts using the basic or advanced search features of my blog, you end up interacting with one of the following methods of querying data hosted in Virtuoso: Free Text Search, XPath, or XQuery. The result sets produced by the search feature uses SQLX to produce subscription gems (RSS/Atom/RDF/ blog home page exists as a result of Virtuoso's Virtual Domain / Multi-Homing Web Server functionality. The entire site resides in an Object Relational DBMS, and I can take my DB file across Windows, Solaris, Linux, Mac OS X, FreeBSD, AIX, HP-UX, IRIX, and SCO UnixWare without missing a single beat! All I have to do is instantiate my Virtuoso server and my weblog is live.
]]>I also hope that Oracle will support Mono -off the bat- rather than taking the typical "we will port to Mono sometime in the future..." type message which will not be acceptable, especially as we pulled this off first time around in 2002 (as atop Mono then). Thus, I am sure they can do it in 2005 :-)
Hopefully we should be able to add Oracle 10g Release 2 and DB2 to our SQL CLR hosting features comparison document that currently only covers SQL Server 2005 and Virtuoso.
]]>
Why Is Every Information Leak Worse Than Originally Thought? While there have been an incredible number of stories about data leaks over the past couple of months, one interesting thing is that in so many cases, the companies involved later come out and admit that the problem was much worse than they first admitted. That happened with ChoicePoint and LexisNexis, who both had to come out a second time and admit that the original data breach they discussed wasn't as limited as they had believed. The latest is that the DSW Shoe Warehouse database that was stolen included information (including credit cards) on many, many more people than originally stated. So rather than 100,000 credit cards out there, we're talking 1.4 million. What's unclear, however, is why this is happening. Is it that these companies are so clueless and unable to manage their own data that they don't realize how badly they've leaked data until they do further investigations? Or is that the companies are still trying to hide the nature of the losses until later (maybe spreading them out a bit)? Either way, you'll notice that no one ever seems to correct the damages in the other direction...
The Internet Archive initiative is building up an amazing collection of content that includes this "must watch" movie about the somewhat forgotten hypercard development environment.
As I watched the hypercard movie I obtained clear reassurance that my vision of Web 2.0 as critical infrastructure for a future Semantic Web isn't unfounded. The solution building methodology espoused by hypercard is exactly how Semantic Web applications will be built, and this will be done by orchestrating the componentary of Web 2.0.
When watching this clip make the following mental adjustments:
Web 2.0 is a reflection of the web taking its first major step out of the technology stone age (certainly the case relative to the hypercard movie and "pre web" application development in general).
]]>
What You'll Wish You'd Known Paul's advice to high school students.
It finally dawned on me what OpenSearch does. Basically you tell it about different search engines by showing it how to query something in each, and get back an RSS return. Then when you search for some term, say foo+bar, it performs the search in all the engines you have configured it for. So it's a way to group a bunch of search engines together and command them all to look for the same thing. It is clever. It is something that hasn't been done before, to my knowledge. That's the good news. The bad news is that Amazon is a leading patent abuser. So as good as this idea is, it's bad for all the rest of us, unless they tell us that they're granting us some kind of license to use the idea. [via Scripting News]
When putting together a post yesterday about "Virtualization", I instinctively looked to Gurunet's "answers.com" service for information on the subject: Enterprise Information Integration (EII). Woe and behold! Here is what I found at the tail end of the answers.com article on this subject:
Now, I knew this was Wikipedia content repurposed by "answers.com", and I proceeded to clean up the article. The wikified article took a while to complete, because true to the "Wikipedia" ethos, I had to contribute knowledge as opposed to the original weenie marketing gunk. Its naturally easier to cut and paste marketing fluff for a misguided quick win attempt than it is to embed links, add knowledge, and discern Wiki Markup (but "Wiki" don't play that!).
This little exercise has broader implications for marketing as a whole, especially for the IT sector. The end of days for "Misinformation based Marketing" are nigh! Wikis, Blogs, Search Engines, Web Services, and Social Networking are rapidly destroying the historically prohibitive costs associated with customer pursuit of facts.
I am very confident that product quality will soon overshadow market share as the key determinant for both product selection on the part of customers (this is no longer a pipe dream!). I also have increased hope that IT product development and associated product marketing by technology vendors will veer in the same direction.
]]>The article discusses most of the key issues, but it should also have included and discussed he following question: "should Microsoft benefit from the mess that we let them create?". By "we" I mean the extensive pool of Microsoft product consumers, developers, and partners etc.
I have worked with Microsoft products (as a developer and user) for more years than I would like to remember; I have personally experienced the journey from Windows 2.0 to Windows XP (and played around with Longhorn).
I added my question to this dialog as without it's resultant perspective, history will simply repeat itself. If IT technology decision makers don't change their product selection and acquisition habits, then why should Microsoft or any other vendor change their ways? Especially when a perpetual promise-under deliver-repromise cycle works absolutely fine. This isn't rocket science, it basic common sense (but we know that common sense ain't that common).
Microsoft like most software companies seek significant portions of their revenue growth from product upgrades. In a sense, it inherently implies that these products will always be millions of miles away from the "silver bullet" promises espoused in the pre product release marketing and PR hype. Sadly, there was a time when Marketing and PR hype used to be about new features; a time when there was a clear line between a new feature and a fundamental product bug.
Buying products from any company simply because they have the largest market share is dumb! All it does is encourage other vendors to focus on product market share rather than product quality, which ultimately results in the following:
Microsoft isn't a unique source of this problem, but hey! They are the largest Software Company (the one with the vital market share), and their software products are on some 80-90% of desktops on this planet, and the planet isn't at its most productive at the current time, and no matter how you look at it, this loss of productivity has something to do with the increased nuisance of desktop computing.
If Microsoft could just focus on its core competence (BTW - I can't quite pint point this anymore since they are in every software market that exists today), it would have at least have an iota of a chance in hell of cleaning up this mess.
]]>
The Information Machine Check out this charming movie from the late 50's, developed for the IBM Pavilion at the 1958 World Fair in Brussels.
It's been a while since I've seen punched cards (which reminds me, I still have the first program I'd ever written, on punched cards written for the IBM 1130).
Google Pollutes Links Stream With Evil Precedent For Market Censorship
AMD set to detail multi-OS plan Will its "Pacifica" virtualization technology be compatible with Intel's? If not, that's a potential headache for some software makers.
Udell to event promoters on leveraging folksonomy: 'Pick a tag' I'm now trying to figure out why InfoWorld's Jon Udell is a journalist and not a millionaire technologist (or maybe he is). Udell keeps coming up with one brilliant idea after another. The first of these -- which I thought was just plain obvious -- was Udell's idea for vendors ...
I do know Jon (albeit primarily via emails and phone interviews), he even put me forward for an innovators award in 2003 re. Virtuoso etc.
Great Business Strategy or Dumb Luck Interesting read here today at ZDNet -- Open Solaris and strategic consequences. Here's a bit of the conclusion:
Friendster befriends blogs--and fees Two Web trends converge as the social networking site prepares to launch blogs through partnership with Six Apart.
The coming crackdown on blogging Federal Election Commissioner Bradley Smith says that the freewheeling days of political expression on the Internet may be about to end.
Today is one of those days where one topic appears to be on the mind of many across cyberspace. You guessed right! Its that Web 2.0 thing again.
Paul Bausch brings Yahoo!'s most recent Web 2.0 contribution to our broader attention in this excerpt from his O'Reilly Network article:
I browse news, check stock prices, and get movie times with Yahoo! Even though I interact with Yahoo! technology on a regular basis, I've never thought of Yahoo! as a technology company. Now that Yahoo! has released a Web Services interface, my perception of them is changing. Suddenly having programmatic access to a good portion of their data has me seeing Yahoo! through the eyes of a developer rather than a user.
The great thing about this move by Yahoo! is two fold (IMHO):
The great thing about the Platform oriented Web 2.0 is the ability to syndicate your value proposition (aka products and services) instead of pursuing fallable email campaigns. It enables the auto-discovery of products and services by user agents (the content aspect). Web 2.0 also provides an infrastructure for user agents to enter into a consumptive interactions with discrete or composite Web Services via published endpoints exposed by a platform (the execution aspect).
A scenario example:
You can obtain RSS feeds (electronic product catalogs) from Amazon today, although you have to explicitly locate these catalog-feeds since Amazon doesn't exploit feed auto-discovery within their domain.
If you use Firefox or another auto-discovery supporting RSS/Atom/RDF user agent; visit this URL ; Firefox users should simply click on the little orange icon bottom right of the browser's window to its RSS feed auto-discovery in action.
Anyway, once you have the feeds the next step is execution endpoints discovery within the Amazon domain (the conduits to Amazon's order processing system in this example). At the current time there isn't broad standardization of Web Services auto-discovery but it's certainly coming; WSIL is a potential front runner for small scale discovery while UDDI provides a heavier duty equivalent for larger scale tasks that includes discovery and other related functionality realms.
Back to the example trail, by having the RSS/Atom/RDF feed data within the confines of a user agent (an Internet Application to be precise) nothing stops the extraction of key purchasing data from these feeds, plus your consumer data en route to assembling an execution message (as prescribed by the schema of the service in question)for Amazon's order processing/ shopping cart service. All of this happens without ever seeing/eye-balling the Amazon site (a prerequisite of Web 1.0 hence the dated term: Web Site).
To summarize: Web 2.0 enables you to syndicate your value proposition and then have it consumed via Web Services, leveraging computer, as opposed to human interaction cycles. This is how I believe Web 2.0 will ultimately impact the growth rates (in most cases exponentially) of those companies that comprehend its potential.
]]>Payroll hole exposes dozens of companies Flaw in PayMaxx Web site exposed the financial information of customers' workers, the payroll-services firm acknowledges.
It is clear that in comparison to the Web of the last century, the nature of data on the Web later in this decade will be very different in the following aspects:
- Volume of data is growing by orders of magnitudes every year
Multimedia and sensor data are becoming more and more common.
- Spatio-temporal attributes of data are important.
- Different data sources provide information to form the holistic picture.
- Users are not concerned with the location of data source, as long as its quality and credibility is assured. They want to know the result of the data assimilation (the big picture of the event).
- Real-time data processing is the only way to extract meaningful information
Exploration, not querying, is the predominant mode of interaction, which makes context and state critical.
- The user is interested in experience and information, independent of the medium and the source.
Effectively, the nature of the knowledge on the Web is changing very fast. It used to be mostly static text documents; now it will be a combination of live and static multimedia, including text, data and documents with spatio-temporal attributes. Considering these changes, can the search engines developed for static text documents be able to deal with the needs of the Web? [via E M E R G I C . o r g]
No, but this doesn't render them useless since we wouldn't be at this point without the likes of Google, Yahoo! et al. But building upon the data substrate that web data oriented search engines provide is where the next batch of Information access and Knowledge discovery solutions will carve out their space. The symbiotic relationship between Google (data) and Gurunet's Answers.com (Information and Knowledge) is one interesting example.
The Web is a distributed collection of databases that implement variety of data storage models but are commonly accessible via protocols that rely on HTTP for transport (in-bound and out-bound messages) services. These databases increasingly using well-formed XML for query result (data contextualization) persistence and URIs for permenant reference. 'What Database?" you might ask, "What you once called your Web Site, Blog, Wiki, etc.." my time-less reply.
When you have the database that I describe above, and a collection of entry points from which discrete or composite Web Services can be invoked available from one or more internet domains, you end up with what I prefer to call "Web 2.0" presence, or what Richard McManus describes as: "The Web as a Platform".
Here is a collection of posts I have made in the past relating to Web 2.0, note that this list is dynamic since this blog is Virtuoso based (predictably):
Free Text Search with XHTML results page (with Virtuoso generated URIs for RSS, Atom, and RDF): http://www.openlinksw.com/blog/search.vspx?blogid=127&q=web+2.0&type=text&output=html
It's also no secret that I believe that Virtuoso is a bleeding edge Web 2.0 technology platform (and more..). The URIs that I am exposing provide the foundation layer for other complimentary Web initiatives such as the Semantic Web (Web 2.0 provides infrastructure for the Semantic Web as time will show). They are also completely usable outside the realm of this blog.
BTW - Jon Udell is writing, experimenting with, and demonstrating similar concepts across feeds within his Web 2.0 domain.
These are indeed fun times!
]]>
Fred Wilson writes:
I was talking to an entrepreneur today and advised him not to surrender to "analysis paralysis".It's tempting to want to analyze every option and figure out exactly the best approach before jumping in.
But it's the wrong way to go in most cases.
As a contrast, I attended a board meeting today where the CEO presented the board with a post-mortem on some decisions he made that turned out to be suboptimal. That was a stand up thing to do and the board appreciated it. But I am not sure that the CEO in question did the wrong thing.
Because I believe that Teddy Roosevelt (one of my favorite Presidents) had it right when he said: "In any moment of decision the best thing you can do is the right thing, the next best thing is the wrong thing, and the worst thing you can do is nothing."
I think action and risk taking is what separates great entrepreneurs from the pack. I am not advocating blind risk taking, but I am advocating making a decision based on less than perfect information and going for it. More often than not, you will be rewarded for doing that.
Have RSS feeds killed the email star? silicon.com Feb 28 2005 12:58PM GMT
DB2 users of PeopleSoft and IBM (the DB2 developer and vendor) suspect that Oracle will obviously try to use its ownership of PeopleSoft to covertly coerce DB2 users into becoming Oracle DBMS users. This strategy would take the form of new features and fixes discrimination as somewhat echoed in these excerpts:
"..In the crescendo surrounding the Oracle-PeopleSoft merger, one question has been repeatedly drowned out: What happens to users of PeopleSoft's DB2 database? Oracle chief Larry Ellison has repeatedly assured DB2 users--and IBM--that Oracle will continue to support DB2 and PeopleSoft's interfaces to IBM's WebSphere platform. But IBM isn't taking any chances, announcing an initiative to alter DB2 to work with products from Oracle rival SAP."
"..IBM has good reason to be concerned. Oracle vies with SAP as the leading vendor for enterprise applications, but it's under pressure to show concrete benefits from the merger by combining assets and pumping up revenue. One obvious tactic will be to use the PeopleSoft applications to steer enterprise customers toward the Oracle database by optimizing performance and features toward the Oracle back end."
If PeopleSoft's application core was ODBC based, the vulnerability to this predictable competitive tactic would at the very least be significantly alleviated. DB2 end-users and IBM the product vendor would have a much stronger basis for countering Oracle by taking them to task about their claimed inability to implement new application functionality enhancements against DB2 etc. especially as this would have morphed into a generic database issue as opposed to a DB2 specific issue -- by virtue of the application and data access layer seperation provided by ODBC's architecture.
]]>
Anyway, back to cognitive dissonance. Could this be the reason for the following?
And more...
]]>When XQuery first came across my radar (late 90s even before "XQuery" became the moniker for an XML Query Language) I arrived at the following conclusions using the steps listed above:
As indicated in an earlier post: IBM is clearly validating what we have done with Virtuoso (as was the case initially with their Virtual / Federated DBMS initiative ala DB2 Integrator). Here is an excerpt from today's eWeek article supporting this position:
To achieve maximum XML performance, bolstered indexing attributes in the technology will enable advanced search functions and a higher degree of filtering. IBM is also adding support for XPath and XQuery data models. This will allow users to create views that involve SQL and XQuery by sending the protocol through DB2's query optimizer for a unified query plan.
Virtuoso has been doing this since 2000; unfortunately a lot of
]]>Heterogeneous Joins Heterogeneous joins sound complicated if not obscene. And the latter is what representatives of a major software development shop thought when asked about their development tools ability to support heterogeneous joins. "Huh .. who in the heck would want to do that?" was implied if not explicitly stated. Well, now that EAI-Enterprise Application Integration and M&A-Mergers and Acquisitions are all the rage again, my bet is that more developers have the need ...
..So, using relational storage is inadequate for one reason or another, and IBM has concluded that another approach is necessary. The companyâs next generation database will therefore have two storage engines: one relational store and one native XML store. And let me be quite clear about this: these engines will be completely separate, with separate tablespaces, separate indexes (Btrees and so forth on the one hand, and hierarchical on the other), and so on...
Hold on here! IBM only recently released DB2 upgrades (ala Stinger) and newer versions of DB2 Integrator (once a Virtual Database and now a Integration Platform), with both of these products leveraging output from IBM's Xperanto project. Its a little mind boggling to me that IBM is finally acknowledging the concept of multiple engines in a single DBMS server.
I wonder what will happen when they seperate the SQL and XML Storage Engines, and then realize that there is also a need to make Web Services (SOAP, WSDL, UDDI, WS-*), BPEL orchestration, WebDAV, HTTP, and other critical protocols a part of a single server offering? Will they call such a beast a "Universal Server"?
I wonder :-)
]]>
Amazon's Invisible Innovations Fortune Nov 11 2004 9:42PM GMT
..He told the crowd that most of Amazon's improvements over the years had not even been directly visible to users of its site. "About 90% of innovation has been on the back end," he explained.
The company has five giant fulfillment centers in the U.S. alone, each of which is 600,000 to 700,000 square feet. It is able to get any two items in those facilities, however disparate, into the same shipping box with great efficiency, using software and processes it developed entirely itself. (Of course, it doesn't always choose to pack, say, a microwave and a music CD together. But it can if it wants to.) Another innovation we've never seen as customers (except in improved service): Now 28% of the company's products are "drop-shipped" by companies other than Amazon, from their own warehouses or affiliated stores. He explained how hard the company works to ensure that these third parties provide service that seems to come from Amazon. (Amazon provides them its own boxes to complete the illusion of a seamless fulfillment system.)He believes that what the Internet is best at distributing is low margin products to a small group of customers. Those are exactly the kinds of things that, pre-Web, were so difficult to find and acquire. "People use Amazon disproportionately to find things they couldn't find any other way," he said.
Really?
Amazon is exploding becuase its is a poster child for the power of Web 2.0, unfortunately this doesn't seem to be that visible to Jeff :-)
Amazon's Web Services initiatives are creating a situation where computer programs (web services consumers) are iteratively consuming Amazons core value proposition. The exponential growth that comes from the obvious: computers are faster than human beings (especially when eyeballs are the prime invocation tool used by the beings in question) at processing anything, and this includes value consumption.
]]>]]>
The site has an RSS Feed (but not RSS auto-discovery), its data is available in Excel format (why not XML? This is really a "Save As" issue these days from Excel).
Great site! But it could even be better if XML was used as the data format as opposed to Excel. It would then become a major data source for a myriad of innovative XML data consumption and repurposing demos etc.
]]>The other day I was
]]>Speaking of Channel 9, today we put up a video of Robert Green, of the Visual Basic team. He demos the new data features in the next version. Cool stuff. I've been noticing a trend that our viewers seem to like demos and tours. So, I'll try to get more of those up.
That reminds me, would it be interesting for the five guys on the Channel 9 team to give you a walking tour of Microsoft's main campus?
..
If this "Internet Operating System" and Web 2.0 stuff is really happening, I think I've just found the filesystem we'll all be using--in one form or another.
Nigel's 10 commandments are listed below (do read the complete article for perspective):
Funnily enough (as is the case with the old and new biblical testaments), you can condense the 10 commandments into one. In the case of the bible, the 10 commandments have become the single tenet: Love Thy Neighbor as Thy Self.
If you treat others as you would like to be treated, then you would never violate any of the original 10 commandments in the first place.
In the case of the Real-Time Enterprise, I believe there is really only one commandment for the commercial enterprise (to be specific): Attain leadership in your chosen market place.
If you understand that this is the basis of any commercial enterprise, since no CEO worth his/her salt aims to finish second in their market place. Then investments in IT would be increasingly oriented towards this goal. Thus, the items listed in the 10 commandments will simply become second nature en route to realization of the Real-Time Enterprise vision.
Reality check! The old testament and new testament address different eras in history (primitive and less primitive times respectively) relative to current times. In a sense I think the same applies to IT (the message has to match the time). We are still in the later stages of the old testament with specificity that is perceived to be unavoidable across the following realms:
So maybe we do need the 10 commandments after all, as the message needs to be simpler in these primitive IT times :-)
Looking forward, I see Web 1.0 (first coming of the Web) as duly placing the role of John the Baptist (we know what happened to him), and the real thing being Web 2.0+ (a web and internet specific spin on the Real-Time Enterprise vision), which is currently pretty much in its infancy (I wonder who is King Herold, Hmm..).
]]>
This
]]>]]>In section, 4.1 Human-friendly Syntax, you say
I have little to add to this matter as our
]]>By David Mertz, IBM developerWorks
In Part 2 of a serial article on GUIs and XML configuration data, David discusses how XML is used in the configuration of GUI interfaces. He looks at Mozilla's XML-based User Interface Language (XUL) which allows you to write applications that run without any particular dependency on the choice of underlying operating system. This may seem strange at first, but you'll soon see that this Mozilla project offers powerful tools for GUI building that allow you to develop for an extensive base of installed users. Mozilla is now much more than a browser: it is a whole component and GUI architecture. Indeed, Mozilla is more cross-platform and more widely installed on user systems than probably any other GUI library you are likely to consider. What you might think of as general purpose GUI/widget libraries -- Qt, wxWindows, GTK, FOX, MFC, .NET, Carbon, and so on -- have various advantages and disadvantages. But none of them can be assumed to be already installed across user systems. Many of them are only available on a subset of the platforms Mozilla supports, and most are relatively difficult to install or have licensing issues. Mozilla is worth installing just because it is such a great browser; once you have it, you have a free platform for custom applications. To be completely cross-platform in your Mozilla/XUL applications, you need to restrict yourself to configuring GUIs in XUL and programming their logic in JavaScript.
http://www-106.ibm.com/developerworks/library/x-matters35/
See also XUL References: http://xml.coverpages.org/xul.html
]]>Their manifesto as presented on their web site:
"OSVB is an independent and open source database created by and for the community. Our goal is to provide accurate, detailed, current, and unbiased technical information".
They also have an XML-RPC based service for programmatic interaction at: http://www.osvdb.org/xmlrpc-server-client-documentation.php
]]>As I completed this post a bell went off! Why not use this for a quick live demo of Virtuoso's hosting capabilities? Starting off with something simple like PHP for instance?
So, I quickly did the following:
viola!
PHP version of program running from a Virtuoso Server on Windows.
Ditto on Linux.
More to follow....
]]>
So we all buy and deploy copies of InfoPath, and then get rid of our non SQL Server and ACCESS databases? Wow!
How about InfoPath emitting XForms compliant forms? Even better, what about becoming a full blown XForms Engine (player)?
Customer demand for a ubiquitous InfoPath runtime
The last time I asked Microsoft why there's no plan to make the InfoPath runtime ubiquitous, the answer I got was: "We don't hear customers asking for it." Well, I do. Here's a typical rant from one customer who, because his company has a relationship with Microsoft that he doesn't want to jeopardize, asked me to anonymize his comments:
I believe a primary requirement of a forms application is to make it possible for the form to be completed by a wide audience of people from whom I wish to gather data. A key driver, at least in the world of my customers, is to be able to distribute the form widely to people who aren't necessarily connected to the network and get them to fill it in and return it. I don't want to authenticate these people in my network. They won't install software on their computers just to fill out my form. They don't want to learn a new application.
It seems InfoPath has completely ignored the question of how the form will actually be filled in by the responder. There is no free viewer as there is with Adobe Acrobat. There is no ability to save the form template as an ASP.NET web form. It appears that Microsoft expects everyone to purchase a full copy of InfoPath--the complete form design application--just so they can fill out a form. They can't possibly believe the product will gain any traction with this licensing and deployment model, can they? [1] What are they thinking? [2]
So my main question is, is there any way to deploy InfoPath forms without putting full InfoPath on every desktop? [3] Do you know whether Microsoft understands this issue and are planning anything to address it? [4] The two applications that are widely available on everyone's desktop are a web browser and Adobe Acrobat, and it seems like it would be a good idea for InfoPath to support forms deployment via one of those means. Am I missing something here? [5]My answers were "I don't know" [1], "I don't know" [2], "No" [3], "Apparently they don't see a problem and aren't planning to do anything" [4], and "We're in the same boat: I don't get it either." [5]
So we all buy and deploy copies of InfoPath, and then get rid of our non SQL Server and ACCESS databases? Wow!
How about InfoPath emitting XForms compliant forms? Even better, what about becoming a full blown XForms Engine (player)?
Customer demand for a ubiquitous InfoPath runtime
The last time I asked Microsoft why there's no plan to make the InfoPath runtime ubiquitous, the answer I got was: "We don't hear customers asking for it." Well, I do. Here's a typical rant from one customer who, because his company has a relationship with Microsoft that he doesn't want to jeopardize, asked me to anonymize his comments:
I believe a primary requirement of a forms application is to make it possible for the form to be completed by a wide audience of people from whom I wish to gather data. A key driver, at least in the world of my customers, is to be able to distribute the form widely to people who aren't necessarily connected to the network and get them to fill it in and return it. I don't want to authenticate these people in my network. They won't install software on their computers just to fill out my form. They don't want to learn a new application.
It seems InfoPath has completely ignored the question of how the form will actually be filled in by the responder. There is no free viewer as there is with Adobe Acrobat. There is no ability to save the form template as an ASP.NET web form. It appears that Microsoft expects everyone to purchase a full copy of InfoPath--the complete form design application--just so they can fill out a form. They can't possibly believe the product will gain any traction with this licensing and deployment model, can they? [1] What are they thinking? [2]
So my main question is, is there any way to deploy InfoPath forms without putting full InfoPath on every desktop? [3] Do you know whether Microsoft understands this issue and are planning anything to address it? [4] The two applications that are widely available on everyone's desktop are a web browser and Adobe Acrobat, and it seems like it would be a good idea for InfoPath to support forms deployment via one of those means. Am I missing something here? [5]My answers were "I don't know" [1], "I don't know" [2], "No" [3], "Apparently they don't see a problem and aren't planning to do anything" [4], and "We're in the same boat: I don't get it either." [5]
So we all buy and deploy copies of InfoPath, and then get rid of our non SQL Server and ACCESS databases? Wow!
How about InfoPath emitting XForms compliant forms? Even better, what about becoming a full blown XForms Engine (player)?
Customer demand for a ubiquitous InfoPath runtime
The last time I asked Microsoft why there's no plan to make the InfoPath runtime ubiquitous, the answer I got was: "We don't hear customers asking for it." Well, I do. Here's a typical rant from one customer who, because his company has a relationship with Microsoft that he doesn't want to jeopardize, asked me to anonymize his comments:
I believe a primary requirement of a forms application is to make it possible for the form to be completed by a wide audience of people from whom I wish to gather data. A key driver, at least in the world of my customers, is to be able to distribute the form widely to people who aren't necessarily connected to the network and get them to fill it in and return it. I don't want to authenticate these people in my network. They won't install software on their computers just to fill out my form. They don't want to learn a new application.
It seems InfoPath has completely ignored the question of how the form will actually be filled in by the responder. There is no free viewer as there is with Adobe Acrobat. There is no ability to save the form template as an ASP.NET web form. It appears that Microsoft expects everyone to purchase a full copy of InfoPath--the complete form design application--just so they can fill out a form. They can't possibly believe the product will gain any traction with this licensing and deployment model, can they? [1] What are they thinking? [2]
So my main question is, is there any way to deploy InfoPath forms without putting full InfoPath on every desktop? [3] Do you know whether Microsoft understands this issue and are planning anything to address it? [4] The two applications that are widely available on everyone's desktop are a web browser and Adobe Acrobat, and it seems like it would be a good idea for InfoPath to support forms deployment via one of those means. Am I missing something here? [5]My answers were "I don't know" [1], "I don't know" [2], "No" [3], "Apparently they don't see a problem and aren't planning to do anything" [4], and "We're in the same boat: I don't get it either." [5]
By Lars Marius Garshol, Ontopia Technical Report
Information Architecture is the discipline dealing with the modern version of this problem: how to organize web sites so that users actually can find what they are looking for. Information architects have so far applied known and well-tried tools from library science to solve this problem, and now topic maps are sailing up as another potential tool for information architects. This raises the question of how topic maps compare with the traditional solutions. The paper argues that topic maps go beyond the traditional solutions in the sense that it provides a framework within which they can be represented as they are, but also extended in ways which significantly improve information retrieval. The paper tries to show that topic maps provide a common reference model that can be used to explain how to understand many common techniques from library science and information architecture.
http://www.ontopia.net/topicmaps/materials/tm-vs-thesauri.html
See also (XML) Topic Maps: http://xml.coverpages.org/topicMaps.html
]]>From the industry hall of fame article about Dan is the following excerpt:
Bricklin is the father of the modern PC business software marketplace. The old master of the strange mix of art, science and commerce that is software development. It is no mistake Bricklin's first company was called Software Arts.
Besides creating the spreadsheet, Bricklin developed or played a role in the development of the early Digital Equipment Corp. word processor; a PC demo program that is still considered a breakthrough in corporate America for its prototyping capabilities; a pen-based spreadsheet that was 10 years ahead of its time; a snazzy, software print utility; and, most recently, a product, Trellix, that makes it easier to create, post and edit Internet documents.
Dan has moved on from Trellix.
]]>]]>Whether it is attempting to buy a Mig jet fighter, building a $40- million house modeled on a medieval Japanese village or turning the industry on its head with his latest idea, Ellison is the antithesis of the gray-suited execs or antiseptic yuppies that seem to proliferate in Silicon Valley. What better man to start the database industry? Or the relational database industry, to be exact.
There always have been databases, of course. But they were unwieldy, hierarchical, flat-file-based creatures that depended on a team of programmers to extract meaningful information.
Ted Codd, an IBM Corp. researcher, had published a seminal paper in 1970 describing a "relational database" whereby data was separated out from applications and arranged in tables and columns and could be queried and joined though a variety of dimensions (the 12 rules of Codd). The new database described would, for example, allow queries into sales of a product by region sorted by month, without having to write a separate program.
Codd's paper, heavy on algebraic formulas, did not exactly set the industry on fire. It was six years before IBM and a team at Berkeley decided to start building a relational database.
It may have been six years before a product was available if not for Ellison and a company he started called Relational Software Inc. (RSI).
]]>"The enormity of the impact that this company has had on the way everything that is printed is produced cannot be measured," said Christopher Galvin, an analyst with Hambrecht & Quist LLC in San Francisco. And Geschke himself points out that this sphere of influence now includes Hollywood, television and, of course, the Internet.
Armed with his childhood penchant for disassembling the family's appliances and a trio of college degrees, he wound up at Xerox Corp.'s famed Palo Alto Research Center (PARC), a breeding ground for inventions that seemingly made billions for everyone except Xerox. He hired Warnock in 1978 and in 1980 founded PARC's Imaging Sciences Laboratory with the mission of marrying computer technology to Xerox's legacy printing products. The duo's Interpress page-description language became Xerox's internal standard, but the company refused to license it to others.
Frustrated with the inability to publicly showcase their creation, Geschke and Warnock left PARC and started Adobe in 1982, naming the fledgling company after the creek running behind Warnock's house. The original mission, Warnock recalled, was to go into a service business, "kind of like what Kinko's is today."
]]>Kahn, former chairman and chief executive of Borland International Inc., first shook the industry like a gale force wind in 1984 with SideKick for $49. He took on what he called the software robber barons who overcharged for their software. Later, he applied similar pricing principles and guerrilla marketing to languages, compilers and spreadsheets.
]]>Today, technology areas that catch Stonebraker's eye include wireless and data integration on the Web.
Started Ingres project in early 1970s at Berkeley to develop relational databases. Ingres Corp. formed in 1980.
Another Berkeley project, Postgres, yielded object relational databases and spawned Illustra Information Technologies in 1992.
Became Informix's CTO in 1996, holding that post until September 2000.
Launched Cohera, a maker of federated databases, in 1999, based on a Berkeley research project, Miraposa.
Today was a very good day Busy, busy, busy. To start things off, the SEC filing for my purchase of shares in Mamma.com hit the tape.
I think mamma.com has that potential. It's not Google or Yahoo, nor will it be a top 5 search engine anytime soon. But it is a good metasearch tool that I use and have used. Google and Yahoo have become carbon copies of each other, and for me, other than usenet and news searches, it's too big. I like the way Mamma.com organizes websearches, and I use it for picture searches. I'm not going to make a big investment in a company just because I use its product. I invested in the company because it generates cash. I'm not into PE ratios, Price to Sales, etc., etc. I'm into good ole fashioned cash.
The company has a simple business proposition: sell its web traffic and keep expenses very low. As long as it can continue to grow its traffic and keep costs down, it will do what I expect of it -- put money in the bank at a rate of 15 pct or more of sales.
Hopefully, I will be able to help it along by cross-promoting it with other businesses I have, and providing technical and marketing support for their management team. Nothing in the business world is a sure thing, and please don't invest in this company because I did, but I obviously like the company's prospects.
[via Blog Maverick]
]]>The search engine war between Google and MSN is generating some nasty tactics reminiscent of the Microsoft vs. Netscape battle of the mid '90's. Those who remember that battle will recall the almost surgical methods used by Microsoft to all but destroy Netscape. Today, Netscape is a shell of its former self, kept in a dull corner of the Time Warner empire and denied the attention or funding it needs to reemerge as a viable entity in the browser market. Many will also remember the tactics used by Microsoft to destroy Netscape generated years of anti-trust litigation and almost led to the break-up of the world's richest corporation and largest software maker. At the end of the day of course, Microsoft got off with a wrist slap and the knowledge that the US Government will not kill a goose that lays golden eggs (and whose products run much of the national infrastructure). Microsoft is obviously feeling free to resort to some its old tricks and the search engine wars are about to go mainstream, possibly becoming public entertainment. Remember the film, Pirates of Silicone Valley? This script promises to be even more interesting.
Search is the fastest growing sector of the Internet and the advertising industry. Currently considered a $2 - 2.5Billion industry, industry experts expect search and search technology to generate over $8Billion per annum by 2007. As a yardstick to measure by, the logging industry in British Columbia is valued at approximately $5Billion per year. Search, in other words, is a serious global business that is projected to generate staggering revenues and growth over the next half-decade. That much money tends to generate a great deal of motivation.
According to yesterday's New York Times, Microsoft has officially turned its great eye on Google and is specifically targeting Google and its employees. Microsoft recruiters are said to be calling Google staff at home, telling them that MSN's new search tool will bury Google and that they had better defect north to Redmond Washington as soon as possible before their jobs and soon to be stock options are worthless. Executives from both companies were seen watching each other like hawks at last week's World Economic Forum in Davos Switzerland. Wherever a Google representative went, a MSN exec was steps behind, and vica versa. Meanwhile, back in the United States, Microsoft employees are examining Google patents looking for potential weaknesses to exploit. Microsoft is obviously playing for keeps and appears to be preparing to head off the inevitable legal battles that will stem from the introduction of Microsoft's new operating system, Longhorn, currently in development and scheduled for release early next year.
]]>Okay, it turns out that I was less wrong than I thought a little while ago. I'd like to quote an article on Instant Messaging Planet here:
"Since 1999, when AOL served 100 percent of IM users, AOL confronted two major new IM entrants, Yahoo! and Microsoft, as well as numerous smaller entrants," the application continues, citing figures from industry researcher Media Metrix, now part of comScore Networks. "As a result, AOL has experienced a substantial decline in its IM share. Its share of unduplicated, all-location users has fallen from 100 percent to 58.5 percent in just three and one-half years."
There we have it. AOL is a bit over half the IM market. That means Yahoo and Microsoft probably have something close to 25% each. Those numbers are from April 2003, so it's anybody's guess as to which direction they've gone since then.
Thanks to Jim for the pointer to newer stats.
Update: He also IM'd me a a CNet article from August which says:
Although AOL's AIM and ICQ together make up the largest IM network, MSN and Yahoo are making strides. In March 2003, AIM had 31.9 million unique users while ICQ had 28.3 million, according to ComScore Media Metrix. MSN Messenger reached 23.1 million unique users while Yahoo Messenger reached 19 million. Both Microsoft and Yahoo launched IM clients with virtually zero market share.
So there we go. It's really a four horse race.
Another Update: Based on the international feedback rolling in, it would seem that the "A" in "AOL" really does mean America. The Microsoft Monopoly is indeed strong overseas. Interesting.
Okay, it turns out that I was less wrong than I thought a little while ago. I'd like to quote an article on Instant Messaging Planet here:
"Since 1999, when AOL served 100 percent of IM users, AOL confronted two major new IM entrants, Yahoo! and Microsoft, as well as numerous smaller entrants," the application continues, citing figures from industry researcher Media Metrix, now part of comScore Networks. "As a result, AOL has experienced a substantial decline in its IM share. Its share of unduplicated, all-location users has fallen from 100 percent to 58.5 percent in just three and one-half years."
There we have it. AOL is a bit over half the IM market. That means Yahoo and Microsoft probably have something close to 25% each. Those numbers are from April 2003, so it's anybody's guess as to which direction they've gone since then.
Thanks to Jim for the pointer to newer stats.
Update: He also IM'd me a a CNet article from August which says:
Although AOL's AIM and ICQ together make up the largest IM network, MSN and Yahoo are making strides. In March 2003, AIM had 31.9 million unique users while ICQ had 28.3 million, according to ComScore Media Metrix. MSN Messenger reached 23.1 million unique users while Yahoo Messenger reached 19 million. Both Microsoft and Yahoo launched IM clients with virtually zero market share.
So there we go. It's really a four horse race.
Another Update: Based on the international feedback rolling in, it would seem that the "A" in "AOL" really does mean America. The Microsoft Monopoly is indeed strong overseas. Interesting.
I have some commentary relating to what is currently achievable re. XQuery in response to Don's blog post which is excerpted below:
Dare just replied to David Orchard?s initial missive on remotingXQuery.
I don?t think either party is giving security sufficient weight.
How many web sites got hacked by SQL insertion attacks?
SQL insertion attacks were primary SQL Server based, and even in this case fronting SQL Server with ODBC Drivers that are equipped with security enhancements would have prevented this, but unfortunately FREE rules, and does so until the ultimate realization that there are "No Free Lunches".
In general, allowing user-defined programs (even functional programs like XQuery or SQL select statements) to be submitted to your service requires non-trivial support from the underlying runtime. Can I put an upper-bound on compute resources for a given query? Can I limit access to extension functions that my engine exposes, perhaps based on security priviledges? How about limiting visibility to portions of my underlying data?
Without solving the security problem first, this idea is a non-starter.
These security concerns aren't of the blanket kind, what I mean by this, is that one shouldn't take a single or given implementation scenario and then apply across the board to XQuery as a whole.
I know how to solve these for the .NET implementation of XPath 1.0. I don?t know how to solve these for J. Random XQuery implementation.
[via Don Box's Spoutlet]
The point I am making above. Don has a solution for a given implamenation scenario, and so if this path was, or is, taken his concerns would be alleviated. Likewise, at OpenLink we have XQuery support, and our implementation doesn't suffer from the security concerns raised by Don.
Dare is correct that to date, there is no DML aspect to XQuery, however, that sound you hear is the sound of inevitability. There are proposals floating around the net and it?s just a matter of time before it happens.
Assuming one could secure such a beast, I?m more optimistic than Dare about the utility of exposing an XQuery head for straight queries (that is, no-DML).
Views are the data-centric version of encapsulation. The RDBMS world has shown that views give people insulation from database schema and for read-only access, they allow users to more flexible access to the store without getting a DBA to install yet another SPROC.
Thus, if XML data is exposed/published based on a SQL-XML basis, then it's pretty obvious that what is good for the SQL-VIEW is also good for the XQuery data source since it's derived from a VIEW, or is defined to behave like a SQL VIEW albeit within the context of XML.
This Blog is driven by an OpenLink Virtuoso Engine, and everything you read is coming out of a SQL Engine that transforms data to XML in a variety of ways (leveraging the fact that this DB Engine has an in-build XSLT processor amongs other things).
I can expose any portion of my blog to the public for XQuery or XPath queries, and control resource consumption and security via the underlying definition of the XML docs (i.e. scope of data covered, and the manner in which the data is materialized) that I choose to expose.
At the end of the day these issue are architecture and implementation specific.
I firmly believe that XQuery is critical inflection technology that ushers in the iminent replacement of today's Single-Protocol and Single-Function Web Servers, by Multi-Protocol and Multi-Function Web Servers.
All of today's HTTP urls are capable of becoming Query Execution entry points and more.
]]>
Planet RDF is an aggregate of the weblogs of software developers in and around the semantic web community. We hope both to take advantage of the community that exists, and also to foster more collaboration between independent developers.
Although by nature not always 100% focused on semantic web content, it provides a great snapshot of the work being done and new web sites of interest to those working on the semantic web.
The participant weblogs are sourced from Dave Beckett's Semantic Web bloggers list, http://journal.dajobe.org/journal/2003/07/semblogs/ , with a bit of additional editorial control to keep the web site focused loosely on topic. Send mail to Dave, dave.beckett@bristol.ac.uk, if you think you have a blog (with a valid RSS 1.0 feed, naturally) that we'd be interested in, and we'll check it out.
For the technically curious: web standards are used as much as possible and the usual electically invalid input of HTML from weblogs has been cleaned up to be as near XHTML-valid as we could muster, both in the web page and the aggregated RDF, http://planetrdf.com/index.rdf
Planet RDF was developed by Matt Biddulph, Dave Beckett and Phil McCarthy.
]]>
Databases get a grip on XML
From Inforworld.
The next iteration of the SQL standard was supposed to arrive in 2003. But SQL standardization has always been a glacially slow process, so nobody should be surprised that SQL:2003 ? now known as SQL:200n ? isn?t ready yet. Even so, 2003 was a year in which XML-oriented data management, one of the areas addressed by the forthcoming standard, showed up on more and more developers? radar screens.
]]>XForms Freebie First Eric van der Vlist makes his RELAX NG book freely available, and now Micah Dubinko has done the same re XForms.
RELAX NG is a book in progress written by Eric van der Vlist for O'Reilly and submitted to an open review process. The result of this work will be freely available on the World Wide Web under a Free Documentation Licence (FDL).
The subject of this book, RELAX NG (http://relaxng.org), is a XML schema language developped by the OASIS RELAX NG Technical Committee and recently accepted as Draft International Standard 19757-2 by the Document Description and Processing Languages subcommittee (DSDL) of the ISO/IEC Joint Technical Committee 1 (ISO/IEC JTC 1/SC 34/WG 1).
[via Lost Boy]
I've been talking a lot about Mono.Security but until today I didn't realize that it was never officially introduced - at least in my blog.
The only existing introduction is the Mono's Crypto status page - which BTW is a great place to learn what's in and/or out Mono's cryptography.
<lazy-geek:copy-n-paste>
Rational: This assembly provides the missing pieces to .NET security. On Windows CryptoAPI is often used to provide much needed functionalities (like some cryptographic algorithms, code signing, X.509 certificates). Mono, for platform independence, implements these functionalities in 100% managed code.
</ lazy-geek:copy-n-paste>
The most important piece of information is 100% managed code. This means that Mono.Security isn't tied to the Mono runtime and/or specific class library - you're free (really it's MIT X11 licensed) to use it on any runtime you choose.
StructuresSystem.Security.Cryptography.Pkcs
in .NET 1.2;02 Dec 2003: Mono 0.29 has been released
This release took us a long time to go out, but it is pretty exciting, with PPC supported. The best Mono release ever! [via Monologue]
This time last year Mono enabled us to deliver a release of Virtuoso that unveiled the power of .NET integration as a database extension mechanism on Windows and Linux along the following lines; User Defined Types, User Defined Functions, and Stored Procedures using any .NET bound language. It also enabled the deployment of ASP.NET applications on Linux, and on Windows without IIS. One item missing from my check list at the time was a Virtuoso release for Mac OS X with identical functionality.
This announcement implies we are within striking distance of a Virtuoso 3.2 release that enables .NET classes and frameworks utilization (along the lines described above) on Mac OS X.
]]>I hope other diagrams will be are clear as this, especially the ones relating to actual storage :-)
]]>
This is further illuminates the content of my earlier post on this subject.
]]>The Mono Roadmap and Mono Hackers Roadmap have been released.
Reading the Longhorn SDK docs is a disorienting experience. Everything's familiar but different. Consider these three examples:
[Full story: Replace and defend via Jon's Radio]
"Replace & Defend" is certainly a strategy that would have awakened the entire non Microsoft Developer world during the recent PDC event. I know these events are all about preaching to the choir (Windows only developers), but as someone who has worked with Microsoft technologies as an ISV since the late 80's there is something about this events announcements that leave me concerned.
Ironically these concerns aren't about the competitive aspects of their technology disruptions, but more along the lines of how
]]>Yukon's top 30 features are now available for perusal.
As I read through the top 10 Developer features I realized that I might as well link each feature to an existing Live Virtuoso Tutorial/Demo. Virtuoso and Yukon compete in some realms, but more importantly they espouse a common strategic vision re. Unified Data Storage and the next Database Technology Frontiers (basically demonstrating the universality of science).
Feature | Description |
.NET Framework Hosting |
With SQL Server "Yukon," you will be able to create database objects using familiar languages such as Microsoft Visual C# |
]]>Every year, as new hard disks get bigger and faster, applications catch up by producing more data. Hard disks are commonly used to store personal information: correspondence, personal contacts, and work documents. These items are currently treated as separate entities, yet they are interrelated on some level; and it's no surprise that e-mail comes from your personal contacts list and influences the work that you should be doing and hence determines the documents that you'll create. When you have a large number of items, it is important to have a flexible and efficient mechanism to search for particular items based on their properties and content. Up until now, storage mechanisms like Outlook
I finally have two live servers that demonstrate Virtuoso
]]>There is a new HOWTO document that addresses an area of frequent confusion on Mac OS X, which is how do you build PHP with an ODBC data access layer binding ( iODBC variant) using Mac OS X Frameworks as opposed to Darwin Shared Libraries.
]]>NETWORK WORLD NEWSLETTER: MARK GIBBS ON WEB APPLICATIONS
Today's focus: A Virtuoso of a server
By Mark Gibbs
One of the bigger drags of Web applications development is that building a system of even modest complexity is a lot like herding cats - you need a database, an applications server, an XML engine, etc., etc. And as they all come from different vendors you are faced with solving the constellation of integration issues that inevitably arise.
If you are lucky, your integration results in a smoothly functioning system. If not, you have a lot of spare parts flying in loose formation with the risk of a crash and burn at any moment.
An alternative is to look for all of these features and services in a single package but you'll find few choices in this arena.
One that is available and looks very promising is OpenLink's Virtuoso (see links below).
Virtuoso is described as a cross platform (runs on Windows, all Unix flavors, Linux, and Mac OS X) universal server that provides databases, XML services, a Web application server and supporting services all in a single package.
OpenLink's list of supported standards is impressive and includes .Net, Mono, J2EE, XML Web Services (Simple Object Application Protocol, Web Services Description Language, WS-Security, Universal Description, Discovery and Integration), XML, XPath, XQuery, XSL-T, WebDav, HTTP, SMTP, LDAP, POP3, SQL-92, ODBC, JDBC and OLE-DB.
Virtuoso provides an HTTP-compliant Web Server; native XML document creation, storage and management; a Web services platform for creation, hosting and consumption of Web services; content replication and synchronization services; free text index server, mail delivery and storage and an NNTP server.
Another interesting feature is that with Virtuoso you can create Web services from existing SQL Stored Procedures, Java classes,
C++ classes, and 'C' functions as well as create dynamic XML
documents from ODBC and JDBC data sources.
This is an enormous product and implies a serious commitment on the part of adopters due to its scope and range of services.
Surprisingly given the scale of the product its price seems very reasonable starting at $5,000.
RELATED EDITORIAL LINKS
OpenLink Software
OpenLink Virtuoso
http://www.openlinksw.com/virtuoso/
Vituoso pricing
]]>By Bryce Curtis and Jim Hsu, IBM developerWorks
Many portable devices let mobile users send and receive e-mail over a wireless network. These portable devices include Short Message Service (SMS)-enabled devices, two-way pagers, cellular phones with e-mail service, and portable networked laptops or Personal Data Assistants (PDA) with e-mail.
Although these devices can send and receive e-mail messages, they cannot yet access and run Web applications and Web services. The Web application client is the predominant browser. However, as these portable devices become increasingly popular, using their e-mail capabilities to access the growing number of Web services and Web applications becomes increasingly beneficial. In this article, we detail an e-mail user interface that can interact with a Web application in a similar manner to that of a Web browser. In the architecture we propose, the HTML model combines with e-mail technology by routing incoming e-mails to a Web application server.
http://www-106.ibm.com/developerworks/webservices/library/wi-email/
]]>Strangely enough, when there is a reference to my blog the urls are broken because they actually point to articles from my internal blog (which is part of a private net behind a firewall). Now, I do actually blog behind my corporate or home firewall (depending on my location at time of blogging), and I when I blog the actual typing and editing occurs within a single Blog Editor (typically using Zempt, w.bloggar, or Newzcrawler). My blog posts are propagated (conditionally using upstream rules via the Virtuoso Blog Engine) to many surrogate blogs such as the ones listed above which may or may not be Virtuoso based, they just need to support Blog Post APIs such as Moveable Type, Meta-Weblog, or Blogger.
Anyway, I need to know if there is something about this my blog that is tripping up Feedster.
]]>The thing that most surprised me today in the SoftEdge panel on Social Software was the reaction to RSS. I should be clear that I am an RSS true believer. It seems to me that metadata as a byproduct of social software engines (be it blogging or social networking or whatever) is not only enviable, it is inevitable. RSS and FOAF and other yet-to-be-determined social software data protocols will become standards because it simply makes good sense for them to be standardized. Anyone paying attention to the unbelievable development and adoption curve of wireless can appreciate the immense value driven by standards -- and, in particular, standards that are truly standard. So it came as a bit of a shock to me that when I questioned the panelists on the implications of RSS and the Semantic Web, they were less sold on the inevitability of it all.
When asked the question of whether the proliferation of RSS and FOAF might make it possible for reader technology to be the next killer application in knowledge management, I got very strong reactions from both Reid Hoffman and Meg Hourihan. Reid stated that he did not believe that RSS was sufficiently robust to provide significant value an any level. Meg followed up with a general indictment of the semantic web, which she views merely as a geek utopia. I will admit that I'm a fan of Candide (particularly at the hands of Bernstein), but I hardly view myself as Panglos. One need look no further than, for example, the tools that Oddpost has incorporated into its web email client to allow an integrated email and blog experience. Better yet, through a relatively simple web service, Oddpost can deliver an RSS feed of a particular Google News search so that you can keep track of keywords that are of interest to you without having to visit Google repeatedly to find out if your company or candidate or favorite band has been mentioned in today's news. The same is true of watch lists on Technorati. Rather than periodically check to see if someone has linked to your blog, Technorati will do the work for you and deliver the info to your inbox only when there is information to be delivered. These examples are just the tip of the iceberg but the demonstrate the nascent power of RSS and related standards. I'll have to wait for another panel to have that argument with Reid and Meg.