OpenLink Virtuoso (Product Blog)

#ods_bar { margin: 0; padding: 0; width: 100%; float: left; clear: both; color: #444; font-size: 9pt; font-family: Arial, Helvetica, sans-serif; background-color: #ddeff9} #ods_bar ul { list-style-type: none} #ods_bar ul li { display: inline} #ods_bar a { text-decoration: none; color: inherit} #ods_bar img { float: none; border: 0; margin: 0} #ods_bar input { margin-right: 8px; font-size: 7pt; color: #555;} #ods_bar_handle { width: 10px; float: left} #ods_bar_content { float: left; width: 100%; background-color: #ddeff9} #ods_bar_top { float: left; width: 100%; background-color: #fff} #ods_bar_bot { float: left; clear: left; width: 100%; padding-top: 2px; padding-bottom: 2px; background-color: #85b9d2} #ods_bar_top_cmds { font-size: 7.5pt; margin-top: 4px; color: #42abc4; background-color: #fff; float: right; padding-right: 8px} #ods_bar_top_cmds img { vertical-align: middle;} #ods_bar_top_cmds a { text-decoration: none} #ods_bar_top_cmds a.user_profile_lnk { text-transform: none} #ods_bar_first_lvl { float: left; padding: 0; margin: 0; color: #fff; background: #0075A8 url("/ods/images/navlv1default.png")} #ods_bar_first_lvl li { padding: 0; padding-left: 4px; margin: 0} #ods_bar_first_lvl li a { margin-top: 0px; padding: 6px 3px 6px 3px; vertical-align: middle; color: #fff; /* Required due to buggy CSS in IE */} #ods_bar_first_lvl li a img { margin-top: 2px; margin-bottom: 5px; vertical-align: middle;} #ods_bar_first_lvl li.sel a { color: #455; background: #b1d4e5 url("/ods/images/navlv1sel.png")} #ods_bar_second_lvl { width: 100%; height: 20px; float: left; clear: left; margin: 0; padding: 0; padding-top: 4px; background: #ddeff9 url("/ods/images/navlv2default.png")} #ods_bar_second_lvl li { margin-right: 5px} #ods_bar_second_lvl li:first-child { margin-left: 27px;} #ods_bar_second_lvl li a { vertical-align: middle; color: #444; /* Required by buggy IE CSS implementation */ } #ods_bar_home_path { margin: 2px 0px 0px 36px; padding: 0; font-size: 8pt} .popup { position: absolute; background-color: #fff; border: 1px dotted #4800F4; padding: 0.5em; font-size: 80%; } #ods_bar_odslogin { font-size: 7.5pt; margin-top: 4px; color: #42abc4; background-color: #fff; float: right; padding-right: 8px; } #ods_bar_odslogin img { vertical-align: middle; margin-left: 8px; } #ods_bar_odslogin a { margin-left: 3px; color: inherit; text-decoration: none; }

Entries: [ 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 ]

Details

Virtuoso Data Space Bot

Burlington, United States

FOAF

Full profile

OCS 0.5

OPML 1.0

XBEL

Multimedia

Videos

Audio

Images

iTunes Subscription

Post Categories

ALL

Enterprise Application Integration

Enterprise Information Integration

HTTP & WebDAV

SQL Database

SQL/XML (SQLX)

Universal Server

Virtual Database Technology

Web Services Platform

Weblog Technology

XML Database (XSL-T, XPath, XQuery, and XML Schema)

Display Settings

articles per page.

order.

Benchmarks, Redux (part 14): BSBM BI Mix

In this post, we look at how we run the BSBM-BI mix. We consider the 100 Mt and 1000 Mt scales with Virtuoso 7 using the same hardware and software as in the previous posts. The changes to workload and metric are given in the previous post.

Our intent here is to look at whether the metric works, and to see what results will look like in general. We are as much testing the benchmark as we are testing the system-under-test (SUT). The results shown here will likely not be comparable with future ones because we will most likely change the composition of the workload since it seems a bit out of balance. Anyway, for the sake of disclosure, we attach the query templates. The test driver we used will be made available soon, so the interested may still try a comparison with their systems. If you practice with this workload for the coming races, the effort will surely not be wasted.

Once we have come up with a rules document, we will redo all that we have published so far by-the-book, and have it audited as part of the LOD2 service we plan for this (see previous posts in this series). This will introduce comparability; but before we get that far with the BI workload, the workload needs to evolve a bit.

Below we show samples of test driver output; the whole output is downloadable.

100 Mt Single User

bsbm/testdriver   -runs 1   -w 0 -idir /bs/1  -drill  \  
   -ucf bsbm/usecases/businessIntelligence/sparql.txt  \  
   -dg http://bsbm.org http://localhost:8604/sparql

0: 43348.14ms, total: 43440ms

Scale factor:           284826
Explore Endpoints:      1
Update Endpoints:       1
Drilldown:              on
Number of warmup runs:  0
Seed:                   808080
Number of query mix runs (without warmups): 1 times
min/max Querymix runtime:    43.3481s / 43.3481s
Elapsed runtime:        43.348 seconds
QMpH:                   83.049 query mixes per hour
CQET:                   43.348 seconds average runtime of query mix
CQET (geom.):           43.348 seconds geometric mean runtime of query mix
AQET (geom.):           0.492 seconds geometric mean runtime of query
Throughput:             1494.874 BSBM-BI throughput: qph*scale
BI Power:               7309.820 BSBM-BI Power: qph*scale (geom)

100 Mt 8 User

Thread 6: query mix 3: 195793.09ms, total: 196086.18ms
Thread 8: query mix 0: 197843.84ms, total: 198010.50ms
Thread 7: query mix 4: 201806.28ms, total: 201996.26ms
Thread 2: query mix 5: 221983.93ms, total: 222105.96ms
Thread 4: query mix 7: 225127.55ms, total: 225317.49ms
Thread 3: query mix 6: 225860.49ms, total: 226050.17ms
Thread 5: query mix 2: 230884.93ms, total: 231067.61ms
Thread 1: query mix 1: 237836.61ms, total: 237959.11ms
Benchmark run completed in 237.985427s

Scale factor:           284826
Explore Endpoints:      1
Update Endpoints:       1
Drilldown:              on
Number of warmup runs:  0
Number of clients:      8
Seed:                   808080
Number of query mix runs (without warmups): 8 times
min/max Querymix runtime:    195.7931s / 237.8366s
Total runtime (sum):    1737.137 seconds
Elapsed runtime:        1737.137 seconds
QMpH:                   121.016 query mixes per hour
CQET:                   217.142 seconds average runtime of query mix
CQET (geom.):           216.603 seconds geometric mean runtime of query mix
AQET (geom.):           2.156 seconds geometric mean runtime of query
Throughput:             2178.285 BSBM-BI throughput: qph*scale
BI Power:               1669.745 BSBM-BI Power: qph*scale (geom)

1000 Mt Single User

0: 608707.03ms, total: 608768ms

Scale factor:           2848260
Explore Endpoints:      1
Update Endpoints:       1
Drilldown:              on
Number of warmup runs:  0
Seed:                   808080
Number of query mix runs (without warmups): 1 times
min/max Querymix runtime:    608.7070s / 608.7070s
Elapsed runtime:        608.707 seconds
QMpH:                   5.914 query mixes per hour
CQET:                   608.707 seconds average runtime of query mix
CQET (geom.):           608.707 seconds geometric mean runtime of query mix
AQET (geom.):           5.167 seconds geometric mean runtime of query
Throughput:             1064.552 BSBM-BI throughput: qph*scale
BI Power:               6967.325 BSBM-BI Power: qph*scale (geom)

1000 Mt 8 User

bsbm/testdriver   -runs 8 -mt 8  -w 0 -idir /bs/10  -drill  \
   -ucf bsbm/usecases/businessIntelligence/sparql.txt   \
   -dg http://bsbm.org http://localhost:8604/sparql

Thread 3: query mix 4: 2211275.25ms, total: 2211371.60ms
Thread 4: query mix 0: 2212316.87ms, total: 2212417.99ms
Thread 8: query mix 3: 2275942.63ms, total: 2276058.03ms
Thread 5: query mix 5: 2441378.35ms, total: 2441448.66ms
Thread 6: query mix 7: 2804001.05ms, total: 2804098.81ms
Thread 2: query mix 2: 2808374.66ms, total: 2808473.71ms
Thread 1: query mix 6: 2839407.12ms, total: 2839510.63ms
Thread 7: query mix 1: 2889199.23ms, total: 2889263.17ms
Benchmark run completed in 2889.302566s

Scale factor:           2848260
Explore Endpoints:      1
Update Endpoints:       1
Drilldown:              on
Number of warmup runs:  0
Number of clients:      8
Seed:                   808080
Number of query mix runs (without warmups): 8 times
min/max Querymix runtime:    2211.2753s / 2889.1992s
Total runtime (sum):    20481.895 seconds
Elapsed runtime:        20481.895 seconds
QMpH:                   9.968 query mixes per hour
CQET:                   2560.237 seconds average runtime of query mix
CQET (geom.):           2544.284 seconds geometric mean runtime of query mix
AQET (geom.):           13.556 seconds geometric mean runtime of query
Throughput:             1794.205 BSBM-BI throughput: qph*scale
BI Power:               2655.678 BSBM-BI Power: qph*scale (geom)

Metrics for Query:      1
Count:                  8 times executed in whole run
Time share              2.120884% of total execution time
AQET:                   54.299656 seconds (arithmetic mean)
AQET(geom.):            34.607302 seconds (geometric mean)
QPS:                    0.13 Queries per second
minQET/maxQET:          11.71547600s / 148.65379700s

Metrics for Query:      2
Count:                  8 times executed in whole run
Time share              0.207382% of total execution time
AQET:                   5.309462 seconds (arithmetic mean)
AQET(geom.):            2.737696 seconds (geometric mean)
QPS:                    1.34 Queries per second
minQET/maxQET:          0.78729800s / 25.80948200s

Metrics for Query:      3
Count:                  8 times executed in whole run
Time share              17.650472% of total execution time
AQET:                   451.893890 seconds (arithmetic mean)
AQET(geom.):            410.481088 seconds (geometric mean)
QPS:                    0.02 Queries per second
minQET/maxQET:          171.07262500s / 721.72939200s

Metrics for Query:      5
Count:                  32 times executed in whole run
Time share              6.196565% of total execution time
AQET:                   39.661685 seconds (arithmetic mean)
AQET(geom.):            6.849882 seconds (geometric mean)
QPS:                    0.18 Queries per second
minQET/maxQET:          0.15696500s / 189.00906200s

Metrics for Query:      6
Count:                  8 times executed in whole run
Time share              0.119916% of total execution time
AQET:                   3.070136 seconds (arithmetic mean)
AQET(geom.):            2.056059 seconds (geometric mean)
QPS:                    2.31 Queries per second
minQET/maxQET:          0.41524400s / 7.55655300s

Metrics for Query:      7
Count:                  40 times executed in whole run
Time share              1.577963% of total execution time
AQET:                   8.079921 seconds (arithmetic mean)
AQET(geom.):            1.342079 seconds (geometric mean)
QPS:                    0.88 Queries per second
minQET/maxQET:          0.02205800s / 40.27761500s

Metrics for Query:      8
Count:                  40 times executed in whole run
Time share              72.126818% of total execution time
AQET:                   369.323481 seconds (arithmetic mean)
AQET(geom.):            114.431863 seconds (geometric mean)
QPS:                    0.02 Queries per second
minQET/maxQET:          5.94377300s / 1824.57867400s

The CPU for the multiuser runs stays above 1500% for the whole run. The CPU for the single user 100 Mt run is 630%; for the 1000 Mt run, this is 574%. This can be improved since the queries usually have a lot of data to work on. But final optimization is not our goal yet; we are just surveying the race track. The difference between a warm single user run and a cold single user run is about 15% with data on SSD; with data on disk, this would be more. The numbers shown are with warm cache. The single-user and multi-user Throughput difference, 1064 single-user vs. 1794 multi-user, is about what one would expect from the CPU utilization.

With these numbers, the CPU does not appear badly memory-bound, else the increase would be less; also core multi-threading seems to bring some benefit. If the single-user run was at 800%, the Throughput would be 1488. The speed in excess of this may be attributed to core multi-threading, although we must remember that not every query mix is exactly the same length, so the figure is not exact. Core multi-threading does not seem to hurt, at the very least. Comparison of the same numbers with the column store will be interesting since it misses the cache a lot less and accordingly has better SMP scaling. The Intel Nehalem memory subsystem is really pretty good.

For reference, we show a run with Virtuoso 6 at 100Mt.

0: 424754.40ms, total: 424829ms

Scale factor:           284826
Explore Endpoints:      1
Update Endpoints:       1
Drilldown:              on
Number of warmup runs:  0
Seed:                   808080
Number of query mix runs (without warmups): 1 times
min/max Querymix runtime:    424.7544s / 424.7544s
Elapsed runtime:        424.754 seconds
QMpH:                   8.475 query mixes per hour
CQET:                   424.754 seconds average runtime of query mix
CQET (geom.):           424.754 seconds geometric mean runtime of query mix
AQET (geom.):           1.097 seconds geometric mean runtime of query
Throughput:             152.559 BSBM-BI throughput: qph*scale
BI Power:               3281.150 BSBM-BI Power: qph*scale (geom)

and 8 user

Thread 5: query mix 3: 616997.86ms, total: 617042.83ms
Thread 7: query mix 4: 625522.18ms, total: 625559.09ms
Thread 3: query mix 7: 626247.62ms, total: 626304.96ms
Thread 1: query mix 0: 629675.17ms, total: 629724.98ms
Thread 4: query mix 6: 667633.36ms, total: 667670.07ms
Thread 8: query mix 2: 674206.07ms, total: 674256.72ms
Thread 6: query mix 5: 695020.21ms, total: 695052.29ms
Thread 2: query mix 1: 701824.67ms, total: 701864.91ms
Benchmark run completed in 701.909341s

Scale factor:           284826
Explore Endpoints:      1
Update Endpoints:       1
Drilldown:              on
Number of warmup runs:  0
Number of clients:      8
Seed:                   808080
Number of query mix runs (without warmups): 8 times
min/max Querymix runtime:    616.9979s / 701.8247s
Total runtime (sum):    5237.127 seconds
Elapsed runtime:        5237.127 seconds
QMpH:                   41.031 query mixes per hour
CQET:                   654.641 seconds average runtime of query mix
CQET (geom.):           653.873 seconds geometric mean runtime of query mix
AQET (geom.):           2.557 seconds geometric mean runtime of query
Throughput:             738.557 BSBM-BI throughput: qph*scale
BI Power:               1408.133 BSBM-BI Power: qph*scale (geom)

Having the numbers, let us look at the metric and its scaling. We take the geometric mean of the single-user Power and the multiuser Throughput.

 100 Mt: sqrt ( 7771 * 2178 ); = 4114

1000 Mt: sqrt ( 6967 * 1794 ); = 3535

Scaling seems to work; the results are in the same general ballpark. The real times for the 1000 Mt run are a bit over 10x the times for the 100Mt run, as expected. The relative percentages of the queries are about the same on both scales, with the drill-down in Q8 alone being 77% and 72% respectively. The Q8 drill-down starts at the root of the product hierarchy. If we made this start one level from the top, its share would drop. This seems reasonable.

Conversely, Q2 is out of place, with far too little share of the time. It takes a product as a starting point and shows a list of products with common features, sorted by descending count of common features. This would more appropriately be applied to a leaf product category instead, measuring how many of the products in the category have the top 20 features found in this category, to name an example.

Also there should be more queries.

At present it appears that BSBM-BI is definitely runnable, but a cursory look suffices to show that the workload needs more development and variety. We remember that I dreamt up the business questions last fall without much analysis, and that these questions were subsequently translated to SPARQL by FU Berlin. So, on one hand, BSBM-BI is of crucial importance because it is the first attempt at doing a benchmark with long running queries in SPARQL. On the other hand, BSBM-BI is not very good as a benchmark; TPC-H is a lot better. This stands to reason, as TPC-H has had years and years of development and participation by many people.

Benchmark queries are trick questions: For example, TPC-H Q18 cannot be done without changing an IN into a JOIN with the IN subquery in the outer loop and doing streaming aggregation. Q13 cannot be done without a well-optimized HASH JOIN which besides must be partitioned at the larger scales.

Having such trick questions in an important benchmark eventually results in everybody doing the optimizations that the benchmark clearly calls for. Making benchmarks thus entails a responsibility ultimately to the end user, because an irrelevant benchmark might in the worst case send developers chasing things that are beside the point.

In the following, we will look at what BSBM-BI requires from the database and how these requirements can be further developed and extended.

BSBM-BI does not have any clear trick questions, at least not premeditatedly. BSBM-BI just requires a cost model that can guess the fanout of a JOIN and the cardinality of a GROUP BY; it is enough to distinguish smaller from greater; the guess does not otherwise have to be very good. Further, the queries are written in the benchmark text so that joining from left to right would work, so not even a cost-based optimizer is strictly needed. I did however have to add some cardinality statistics to get reasonable JOIN order since we always reorder the query regardless of the source formulation.

BSBM-BI does have variable selectivity from the drill-downs; thus these may call for different JOIN orders for different parameter values. I have not looked into whether this really makes a difference, though.

There are places in BSBM-BI where using a HASH JOIN makes sense. We do not use HASH JOINs with RDF because there is an index for everything and making a HASH JOIN in the wrong place can have a large up-front cost, so one is more robust against cost model errors if one does not do HASH JOINs. This said, a HASH JOIN in the right place is a lot better than an index lookup. With TPC-H Q13, our best HASH JOIN is over 2x better than the best INDEX-based JOIN, both being well tuned. For questions like "count the hairballs made in Germany reviewed by Japanese Hello Kitty fans," where two ends of a JOIN path are fairly selective doing the other as a HASH JOIN is good. This can, if the JOIN is always cardinality-reducing, even be merged inside an INDEX lookup. We have such capabilities since we have been for a while gearing up for the relational races, but are not using any of these with BSBM-BI, although they would be useful.

Let us see the profile for a single user 100 Mt run.

The database activity summary is --

select db_activity (0, 'http');

161.3M rnd 210.2M seq 0 same seg 104.5M same pg 45.08M same par 0 disk 0 spec disk 0B / 0 messages 2.393K fork

See the post "What Does BSBM Explore Measure" for an explanation of the numbers. We see that there is more sequential access than random and the random has fair locality with over half on the same page as the previous and a lot of the rest falling under the same parent. Funnily enough, the explore mix has more locality. Running with a longer vector size would probably increase performance by getting better locality. There is an optimization that adjusts vector size on the fly if locality is not sufficient but this is not being used here. So we manually set vector size to 100000 instead of the default 10000. We get --

172.4M rnd 220.8M seq 0 same seg 149.6M same pg 10.99M same par 21 disk 861 spec disk 0B / 0 messages 754 fork

The throughput goes from 1494 to 1779. We see more hits on the same page, as expected. We do not make this setting a default since it raises the cost for small queries; therefore the vector size must be self-adjusting -- besides, expecting a DBA to tune this is not reasonable. We will just have to correctly tune the self-adjust logic, and we have again clear gains.

Let us now go back to the first run with vector size 10000.

The top of the CPU oprofile is as follows:

722309   15.4507  cmpf_iri64n_iri64n
434791    9.3005  cmpf_iri64n_iri64n_anyn_iri64n
294712    6.3041  itc_next_set
273488    5.8501  itc_vec_split_search
203970    4.3631  itc_dive_transit
199687    4.2714  itc_page_rcf_search
181614    3.8848  dc_itc_append_any
173043    3.7015  itc_bm_vec_row_check
146727    3.1386  cmpf_int64n
128224    2.7428  itc_vec_row_check
113515    2.4282  dk_alloc
97296     2.0812  page_wait_access
62523     1.3374  qst_vec_get_int64
59014     1.2623  itc_next_set_parent
53589     1.1463  sslr_qst_get
48003     1.0268  ds_add
46641     0.9977  dk_free_tree
44551     0.9530  kc_var_col
43650     0.9337  page_col_cmp_1
35297     0.7550  cmpf_iri64n_iri64n_anyn_gt_lt
34589     0.7399  dv_compare
25864     0.5532  cmpf_iri64n_anyn_iri64n_iri64n_lte
23088     0.4939  dk_free

The top 10 are all index traversal, with the key compare for two leading IRI keys in the lead, corresponding to a lookup with P and S given. The one after that is with all parts given, corresponding to an existence test. The existence tests could probably be converted to HASH JOIN lookups to good advantage. Aggregation and arithmetic are absent. We should probably add a query like TPC-H Q1 that does nothing but these two. Considering the overall profile, GROUP BY seems to be around 3%. We should probably put in a query that makes a very large number of groups and could make use of streaming aggregation, i.e., take advantage of a situation where aggregation input comes already grouped by the grouping columns.

A BI use case should offer no problem with including arithmetic, but there are not that many numbers in the BSBM set. Some code sections in the queries with conditional execution and costly tests inside ANDs and ORs would be good. TPC-H has such in Q21 and Q19. An OR with existences where there would be gain from good guesses of a subquery's selectivity would be appropriate. Also, there should be conditional expressions somewhere with a lot of data, like the CASE-WHEN in TPC-H Q12.

We can make BSBM-BI more interesting by putting in the above. Also we will have to see where we can profit from HASH JOIN, both small and large. There should be such places in the workload already so this is a matter of just playing a bit more.

This post amounts to a cheat sheet for the BSBM-BI runs a bit farther down the road. By then we should be operational with the column store and Virtuoso 7 Cluster, though, so not everything is yet on the table.

Benchmarks, Redux Series

bookmark it! submit digg.com

digg it!

reddit!

# PermaLink Comments [0]

03/22/2011 18:31 GMT-0500

Modified: 03/22/2011 17:04 GMT-0500

Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs

In the context of database benchmarks we cannot ignore I/O, as pretty much has been done so far by BSBM.

There are two approaches:

run twice or otherwise make sure one runs from memory and forget about I/O, or
make rules and metrics for warm-up.

We will see if the second is possible with BSBM.

From this starting point, we look at various ways of scheduling I/O in Virtuoso using a 1000 Mt BSBM database on sets of each of HDDs (hard disk devices) and SSDs (solid-state storage devices). We will see that SSDs in this specific application can make a significant difference.

In this test we have the same 4 stripes of a 1000 Mt BSBM database on each of two storage arrays.

Storage Arrays
Type	Quantity	Maker	Size	Speed	Interface speed	Controller	Drive Cache	RAID
SSD	4	Crucial	128 GB	N/A	6Gbit SATA	RocketRaid 640	128 MB	None
HDD	4	Samsung	1000 GB	7200 RPM	3Gbit SATA	Intel ICH on Supermicro motherboard	16 MB	None

We make sure that the files are not in OS cache by filling it with other big files, reading a total of 120 GB off SSDs with `cat file > /dev/null`.

The configuration files are as in the report on the 1000 Mt run. We note as significant that we have a few file descriptors for each stripe, and that read-ahead for each is handled by its own thread.

Two different read-ahead schemes are used:

With 6 Single, if a 2MB extent gets a second read within a given time after the first, the whole extent is scheduled for background read.
With 7 Single, as an index search is vectored, we know a large number of values to fetch at one time and these values are sorted into an ascending sequence. Therefore, by looking at a node in an index tree, we can determine which sub-trees will be accessed and schedule these for read-ahead, skipping any that will not be accessed.

In either model, a sequential scan touching more than a couple of consecutive index leaf pages triggers a read-ahead, to the end of the scanned range or to the next 3000 index leaves, whichever comes first. However, there are no sequential scans of significant size in BSBM.

There are a few different possibilities for the physical I/O:

Using a separate read system call for each page. There may be several open file descriptors on a file so that many such calls can proceed concurrently on different threads; the OS will order the operations.
A thread finds it needs a page and reads it.
Using Unix asynchronous I/O, aio.h, with the aio_* and lio_listio functions.
Using single-read system calls for adjacent pages. In this way, the drive sees longer requests and should give better throughput. If there are short gaps in the sequence, the gaps are also read, wasting bandwidth but saving on latency.

The two latter apply only to bulk I/O that are scheduled on background threads, one per independently-addressable device (HDD, SSD, or RAID-set). These bulk-reads operate on an elevator model, keeping a sorted queue of things to read or write and moving through this queue from start to end. At any time, the queue may get more work from other threads.

There is a further choice when seeing single-page random requests. They can either go to the elevator or they can be done in place. Taking the elevator is presumably good for throughput but bad for latency. In general, the elevator should have a notion of fairness; these matters are discussed in the CWI collaborative scan paper. Here we do not have long queries, so we do not have to talk about elevator policies or scan sharing; there are no scans. We may touch on these questions later with the column store, the BSBM BI mix, and TPC-H.

While we may know principles, I/O has always given us surprises; the only way to optimize this is to measure.

The metric we try to optimize here is the time it takes for a multiuser BSBM run starting from cold cache to get to 1200% CPU. When running from memory, the CPU is around 1350% for the system in question.

This depends on getting I/O throughput, which in turn depends on having a lot of speculative reading since the workload itself does not give any long stretches to read.

The test driver is set at 16 clients, and the run continues for 2000 query mixes or until target throughput is reached. Target throughput is deemed reached after the first 20 second stretch with CPU at 1200% or higher.

The meter is a stored procedure that records the CPU time, count of reads, cumulative elapsed time spent waiting for I/O, and other metrics. The code for this procedure (for 7 Single; this file will not work on Virtuoso 6 or earlier) is available here.

The database space allocation gives each index a number of 2MB segments, each with 256 8K pages. When a page splits, the new page is allocated from the same extent if possible, or from a specific second extent which is designated as the overflow extent of this extent. This scheme provides for a sort of pseudo-locality within extents over random insert order. Thus there is a chance that pre-reading an extent will get key values in the same range a the ones on the page being requested in the first place. At least the pre-read pages will be from the same index tree. There are insertion orders that do not create good locality with this allocation scheme, though. In order to generally improve locality, one could shuffle pages of an all-dirty subtree before writing this out so as to have physical order match key order. We will look at some tricks in this vein with the column store.

For the sake of simplicity we only run 7 Single with the 1000 Mt scale.

The first experiment was with SSDs and the vectored read-ahead. The target throughput was reached after 280 seconds.

The next test was with HDDs and extent read-ahead. One hour into the experiment, the CPU was about 70% after processing around 1000 query mixes. It might have been hours before HDD reads became rare enough for hitting 1200% CPU. The test was not worth continuing.

The result with HDDs and vectored read-ahead would be worse since vectored read-ahead leads to smaller read-ahead batches and to less contiguous read patterns. The individual read times here, are over twice the individual read times with per-extent read-ahead. The fact that vectored read-ahead does not read potentially unneeded pages makes no difference. Hence this test is also not worth running to completion.

There are other possibilities for improving HDD I/O. If only 2MB read requests are made, a transfer will be about 20 ms at a sequential transfer speed of 50 MB/s. Then seeking to the next 2MB extent will be a few ms, most often less than 20, so the HDD should give at least half the nominal throughput.

We note that, when reading sequential 8K pages inside a single 2MB (256 page) extent, the seek latency is not 0 as one would expect but an extreme 5 ms. One would think that the drive would buffer a whole track, and a track would hold a large number of 2MB sections, but apparently this is not so.

Therefore, now if we have a sequential read pattern that is more dense than 1 page out of 10, we read all the pages and just keep the ones we want.

So now we set the read-ahead to merge reads that fall within 10 pages. This wastes bandwidth, but supposedly saves on latency. We will see.

So we try, and we find that read-ahead does not account for most pages since it does not get triggered. Thus, we change the triggering condition to be the 2nd read to fall in the extent within 20 seconds of the first.

The HDDs were in all cases 700% busy for 4 HDDs. But with the new setting we get longer requests, most often full extents, which gets a per-HDD transfer rate of about 5 MB/s. With the looser condition for starting read-ahead, 89% of all pages were read in a read-ahead batch. We see the I/O throughput decrease during the run because there are more single-page reads that do not trigger extent read-ahead. So HDDs have 1.7 concurrent operations pending, but the batch size drops, dropping the throughput.

Thus with the best settings, the test with 2000 query mixes finishes in 46 minutes, and the CPU utilization is steadily increasing, hitting 392% for the last minute. In comparison, with SSDs and our worst read-ahead setting we got 1200% CPU in under 5 minutes from cold start. The I/O system can be further tuned; for example, by only reading full extents as long as the buffer pool is not full. In the next post we will measure some more.

BSBM Note

We look at query times with semi-warm cache, with CPU around 400%. We note that Q8-Q12 are especially bad. Q5 runs at about half speed. Q12 runs at under 1/10th speed. The relatively slowest queries appear to be single-instance lookups. Nothing short of the most aggressive speculative reading can help there. Neither query nor workload has any exploitable pattern. Therefore if an I/O component is to be included in a BSBM metric, the only way to score in this is to use speculative read to the maximum.

Some of the queries take consecutive property values of a single instance. One could parallelize this pipeline, but this would be a one-off and would make sense only when reading from storage (whether HDD, SSD, or otherwise). Multithreading for single rows is not worth the overhead.

A metric for BSBM warm-up is not interesting for database science, but may still be of practical interest in the specific case of RDF stores. Specially reading large chunks at startup time is good, so putting a section in BSBM that would force one to implement this would be a service to most end users. Measuring and reporting such I/O performance would favor space efficiency in general. Space efficiency is generally a good thing, especially at larger scales, so we can put an optional section in the report for warm-up. This is also good for comparing HDDs and SSDs, and for testing read-ahead, which is still something a database is expected to do. Implementors have it easy; just speculatively read everything.

Looking at the BSBM fictional use case, anybody running such a portal would do this from RAM only, so it makes sense to define the primary metric as running from warm cache, in practice 100% from memory.

Benchmarks, Redux Series

bookmark it! submit digg.com

digg it!

reddit!

# PermaLink Comments [0]

03/07/2011 14:17 GMT-0500

Modified: 03/14/2011 17:56 GMT-0500

Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire

Below is a questionnaire I sent to the BSBM participants in order to get tuning instructions for the runs we were planning. I have filled in the answers for Virtuoso, here. This can be a checklist for pretty much any RDF database tuning.

Threading - What settings should be used (e.g., for query parallelization, I/O parallelization [e.g., prefetch, flush of dirty], thread pools [e,.g. web server], any other thread related)? We will run with 8 and 32 cores, so if there are settings controlling number of read/write (R/W) locks or mutexes or such for serializing diverse things, these should be set accordingly to minimize contention.

The following three settings are all in the [Parameters] section of the virtuoso.ini file.
- AsyncQueueMaxThreads controls the size of a pool of extra threads that can be used for query parallelization. This should be set to either 1.5 * the number of cores or 1.5 * the number of core threads; see which works better.
- ThreadsPerQuery is the maximum number of threads a single query will take. This should be set to either the number of cores or the number of core threads; see which works better.
- IndexTreeMaps is the number of mutexes over which control for buffering an index tree is split. This can generally be left at default (256 in normal operation; valid settings are powers of 2 from 2 to 1024), but setting to 64, 128, or 512 may be beneficial.
  
  A low number will lead to frequent contention; upwards of 64 will have little contention. We have sometimes seen a multiuser workload go 10% faster when setting this to 64 (down from 256), which seems counter-intuitive. This may be a cache artifact.
In the [HTTPServer] section of the virtuoso.ini file, the ServerThreads setting is the number of web server threads, i.e., the maximum number of concurrent SPARQL protocol requests. Having a value larger than the number of concurrent clients is OK; for large numbers of concurrent clients a lower value may be better, which will result in requests waiting for a thread to be available.

Note — The [HTTPServer] ServerThreads are taken from the total pool made available by the [Parameters] ServerThreads. Thus, the [Parameters] ServerThreads should always be at least as large as (and is best set greater than) the [HTTPServer] ServerThreads, and if using the closed-source Commercial Version, [Parameters] ServerThreads cannot exceed the licensed thread count.
File layout - Are there settings for striping over multiple devices? Settings for other file access parallelism? Settings for SSDs (e.g., SSD based cache of hot set of larger db files on disk)? The target config is for 4 independent disks and 4 independent SSDs. If you depend on RAID, are there settings for this? If you need RAID to be set up, please provide the settings/script for doing this with 4 SSDs on Linux (RH and Debian). This will be software RAID, as we find the hardware RAID to be much worse than an independent disk setup on the system in question.

It is best to stripe database files over all available disks, and to not use RAID. If RAID is desired, then stripe database files across many RAID sets. Use the segment declaration in the virtuoso.ini file. It is very important to give each independently seekable device its own I/O queue thread. See the documentation on the TPC-C sample for examples.

in the [Parameters] section of the virtuoso.ini file, set FDsPerFile to be (the number of concurrent threads * 1.5) ÷ the number of distinct database files.

There are no SSD specific settings.
Loading - How many parallel streams work best? We are looking for non-transactional bulk load, with no inference materialization. For partitioned cluster settings, do we divide the load streams over server processes?

Use one stream per core (not per core thread). In the case of a cluster, divide load streams evenly across all processes. The total number of streams on a cluster can equal the total number of cores; adjust up or down depending on what is observed.

Use the built-in bulk load facility, i.e.,

ld_dir ('<source-filename-or-directory>', '<file name pattern>', '<destination graph iri>');

For example,

SQL> ld_dir ('/path/to/files', '*.n3', 'http://dbpedia.org');

Then do a rdf_loader_run () on enough connections. For example, you can use the shell command

isql rdf_loader_run () &

to start one in a background isql process. When starting background load commands from the shell, you can use the shell wait command to wait for completion. If starting from isql, use the wait_for_children; command (see isql documentation for details).

See the BSBM disclosure report for an example load script.
What command should be used after non-transactional bulk load, to ensure a consistent persistent state on disk, like a log checkpoint or similar? Load and checkpoint will be timed separately, load being CPU-bound and checkpoint being I/O-bound. No roll-forward log or similar is required; the load does not have to recover if it fails before the checkpoint.

Execute

CHECKPOINT;

through a SQL client, e.g., isql. This is not a SPARQL statement and cannot be executed over the SPARQL protocol.
What settings should be used for trickle load of small triple sets into a pre-existing graph? This should be as transactional as supported; at least there should be a roll forward log, unlike the case for the bulk load.

No special settings are needed for load testing; defaults will produce transactional behavior with a roll forward log. Default transaction isolation is REPEATABLE READ, but this may be altered via SQL session settings or at Virtuoso server start-up through the [Parameters] section of the virtuoso.ini file, with

DefaultIsolation = 4

Transaction isolation cannot be set over the SPARQL protocol.

NOTE: When testing full CRUD operations, other isolation settings may be preferable, due to ACID considerations. See answer #12, below, and detailed discussion in part 8 of this series, BSBM Explore and Update.
What settings control allocation of memory for database caching? We will be running mostly from memory, so we need to make sure that there is enough memory configured.

In the [Parameters] section of the virtuoso.ini file, NumberOfBuffers controls the amount of RAM used by Virtuoso to cache database files. One buffer caches an 8KB database page. In practice, count 10KB of memory per page. If "swappiness" on Linux is low (e.g., 2), two-thirds or more of physical memory can be used for database buffers. If swapping occurs, decrease the setting.
What command gives status on memory allocation (e.g., number of buffers, number of dirty buffers, etc.) so that we can verify that things are indeed in server memory and not, for example, being served from OS disk cache. If the cached format is different from the disk layout (e.g., decompression after disk read), is there a command for space statistics for database cache?

In an isql session, execute

STATUS ( ? ? );

The second result paragraph gives counts of total, used, and dirty buffers. If used buffers is steady and less than total, and if the disk read count on the line below does not increase, the system is running from memory. The cached format is the same as the disk based format.
What command gives information on disk allocation for different things? We are looking for the total size of allocated database pages for quads (including table, indices, anything else associated with quads) and dictionaries for literals, IRI names, etc. If there is a text index on literals, what command gives space stats for this? We count used pages, excluding any preallocated unused pages or other gaps. There is one number for quads and another for the dictionaries or other such structures, optionally a third for text index.

Execute on an isql session:
CHECKPOINT; SELECT TOP 20 * FROM sys_index_space_stats ORDER BY iss_pages DESC;
The iss_pages column is the total pages for each index, including blob pages. Pages are 8KB. Only used pages are reported, gaps and unused pages are not counted. The rows pertaining to RDF_QUAD are for quads; RDF_IRI, RDF_PREFIX, RO_START, RDF_OBJ are for dictionaries; RDF_OBJ_RO_FLAGS_WORDS and VTLOG_DB_DBA_RDF_OBJ are for text index.
If there is a choice between triples and quads, we will run with quads. How do we ascertain that the run is with quads? How do we find out the index scheme? Should be use an alternate index scheme? Most of the data will be in a single big graph.

The default scheme uses quads. The default index layout is PSOG, POGS, GS, SP, OP. To see the current index scheme, use an isql session to execute

STATISTICS DB.DBA.RDF_QUAD;
For partitioned cluster settings, are there partitioning-related settings to control even distribution of data between partitions? For example, is there a way to set partitioning by S or O depending on which is first in key order for each index?

The default partitioning settings are good, i.e., partitioning is on O or S, whichever is first in key order.
For partitioned clusters, are there settings to control message batching or similar? What are the statistics available for checking interconnect operation, e.g. message counts, latencies, total aggregate throughput of interconnect?

In the [Cluster] section of the cluster.ini file, ReqBatchSize is the number of query states dispatched between cluster nodes per message round trip. This may be incremented from the default of 10000 to 50000 or so if this is seen to be useful.

To change this on the fly, the following can be issued through an isql session:

cl_exec ( ' __dbf_set (''cl_request_batch_size'', 50000) ' );

The commands below may be executed through an isql session to get a summary of CPU and message traffic for the whole cluster or process-by-process, respectively. The documentation details the fields.
```
 STATUS ('cluster')      ;; whole cluster 
 STATUS ('cluster_d')    ;; process-by-process
   
```
Other settings - Are there settings for limiting query planning, when appropriate? For example, the BSBM Explore mix has a large component of unnecessary query optimizer time, since the queries themselves access almost no data. Any other relevant settings?
- For BSBM, needless query optimization should be capped at Virtuoso server start-up through the [Parameters] section of the virtuoso.ini, with
  
  StopCompilerWhenXOverRun = 1
- When testing full CRUD operations (not simply CREATE, i.e., load, as discussed in #5, above), it is essential to make queries run with transaction isolation of READ COMMITTED, to remove most lock contention. Transaction isolation cannot be adjusted via SPARQL. This can be changed through SQL session settings, or at Virtuoso server start-up through the [Parameters] section of the virtuoso.ini file, with
  
  DefaultIsolation = 2

Benchmarks, Redux Series

bookmark it! submit digg.com

digg it!

reddit!

# PermaLink Comments [0]

03/04/2011 15:28 GMT-0500

Modified: 03/14/2011 17:56 GMT-0500

Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore

In this post I will summarize the figures for BSBM Load and Explore mixes at 100 Mt, 200 Mt, and 1000 Mt. (1 Mt = 1 Megatriple, or one million triples.) The measurements were made on a 72GB 2xXeon 5520 with 4 SSDs. The exact specifications and configurations are in the raw reports to follow.

The load time in the recent Berlin report was measured with the wrong function, and so far as we can tell, without multiple threads. The intermediate cut of Virtuoso they tested also had broken SPARQL/Update (also known as SPARUL) features. We have fixed this since, and give here the right numbers.

In the course of the discussion to follow, we talk about 3 different kinds of Virtuoso:

6 Single is the generally available single server configuration of Virtuoso. Whether this is open source or not does not make a difference.
6 Cluster is the generally available commercial only cluster-capable Virtuoso.
7 Single is the next generation single server Virtuoso, about to be released as a preview.

To understand the numbers, we must explain how these differ from each other in execution:

6 Single has one thread-per-query, and operates on one state of the query at a time.
6 Cluster has one thread-per-query-per-process, and between processes it operates on batches of some tens-of-thousands of simultaneous query states. Within each node, these batches run through the execution pipeline one state at a time. Aggregation is distributed, and the query optimizer is generally smart about shipping colocated functions together.
7 Single has multiple threads-per-query and in all situations operates on batches of 10,000 or more simultaneous query states. This means, for example, that index lookups get large numbers of parameters which then are sorted to get an ascending search pattern which benefits from locality, so the n * log(n) index access for the batch becomes more like linear if the data accessed has any locality. Furthermore, if there are many operands to an operator, these can be split on multiple threads. Also, scans of consecutive rows can be split before the scan on multiple threads, each doing a range of the scan. These features are called vectored execution and query parallelization. These techniques will also be applied to the cluster variant in due time.

The version 6 and 7 variants discussed here use the same physical storage layout with row-wise key compression. Additionally, there exists a column-wise storage option in 7 that can fit 4x the number of quads in the same space. This column store option is not used here because it still has some problems with random order inserts.

We will first consider loading. Below are the load times and rates for 7 at each scale.

7 Single
Scale	Rate (quads per second)	Load time (seconds)	Checkpoint time (seconds)
100 Mt	261,366	301	82
200 Mt	216,000	802	123
1000 Mt	130,378	6641	1012

In each case the load was made on 8 concurrent streams, each reading a file from a pool of 80 files for the two smaller scales and 360 files for the larger scale.

We also loaded the smallest data set with 6 Single using the same load script.

6 Single
Scale	Rate (quads per second)	Load time (seconds)	Checkpoint time (seconds)
100 Mt	74,713	1192	145

CPU time with 6 Single was 8047 seconds. We compare this to 4453 seconds of CPU for the same load on 7 Single. The CPU% during the run was on either side of 700% for 6 Single and 1300% for 7 Single. Note that high percentages involve core threads, not real cores.

The difference is mostly attributable to vectoring and the introduction of a non-transactional insert. The 6 Single inserts transactionally but makes very frequent commits and writes no log, resulting in de facto non-transactional behavior but still there is a lock and commit cycle. Inserts in RDF load usually exhibit locality on all SPOG. Sorting by value gives ascending insert order and eliminates much of the lookup time for deciding where the next row will go. Contention on page read-write locks is less because the engine stays longer on a page, inserting multiple values in one go, instead of re-acquiring the read-write lock and possible transaction locks for each row.

Furthermore, for single stream loading the non-transactional mode can serve one thread doing the parsing with many threads doing the inserting; hence, in practice the speed is bounded by the parsing speed. In multi-stream load this parallelization also happens but is less significant, as adding threads past the count of core threads is not useful. Writes are all in-place, and no delta-merge mechanism is involved. For transactional inserts, the uncommitted rows are not visible to read-committed readers, which do not block. Repeatable and serializable readers would block before an uncommitted insert.

Now for the run (larger numbers indicate more queries executed, and are therefore better):

6 Single Throughput (QMpH, query mixes per hour)
Scale	Single User	16 User
100 Mt	7641	29433
200 Mt	6017	13335
1000 Mt	1770	2487

7 Single Throughput (QMpH, query mixes per hour)
Scale	Single User	16 User
100 Mt	11742	72278
200 Mt	10225	60951
1000 Mt	6262	24672

The 100 Mt and 200 Mt runs are entirely in memory; the 1000 Mt run is mostly in memory, with about a 1.6 MB/s trickle from SSD in steady state. Accordingly, the 1000 Mt run is longer, with 2000 query mixes in the timed period, preceded by a warm-up of 2000 mixes with a different seed. For the memory-only scales, we run 500 mixes twice, and take the timing of the second run.

Looking at single user speeds, 6 Single and 7 Single are closest at the small end and drift farther apart at the larger scales. This comes from the increased opportunity to parallelize Q5, since this works on more data and is relatively more important as the scale gets larger. The 100 Mt run of 7 Single has about 130% CPU, and the 1000 Mt run has about 270%. This also explains why adding clients gives a larger boost at the smaller scale.

Now let us look at the relative effects of parallelizing and vectoring in 7 Single. We run 50 mixes of Single User Explore: 6132 QMpH with both parallelizing and vectoring on; 2805 QMpH with execution limited to a single thread. Then we set the vector size to 1, meaning that the query pipeline runs one row at a time. This gets us 1319 QMpH which is a bit worse than 6 Single. This is to be expected since there is some overhead to running vectored with single-element vectors. Q5 on 7 Single with vectoring and a single thread runs at 1.9 qps; with single-element vectors, at 0.8 qps. The 6 Single engine runs Q5 at 1.13 qps.

The 100 Mt scale 7 Single gains the most from adding clients; the 1000 Mt 6 Single gains the least. The reason for the latter is covered in detail in A Benchmarking Story. We note that while vectoring is primarily geared to better single-thread speed and better cache hit rates, it delivers a huge multithreaded benefit by eliminating the mutex contention at the index tree top which stops 6 Single dead at 1000 Mt.

In conclusion, we see that even with a workload of short queries and little opportunity for parallelism, we get substantial benefits from query parallelization and vectoring. When moving to more complex workloads, the benefits become more pronounced. For a single user complex query load, we can get 7x speed-up from parallelism (8 core), plus up to 3x from vectoring. These numbers do not take into account the benefits of the column store; those will be analyzed separately a bit later.

The full run details will be supplied at the end of this blog series.

Benchmarks, Redux Series

bookmark it! submit digg.com

digg it!

reddit!

# PermaLink Comments [0]

03/02/2011 18:23 GMT-0500

Modified: 03/14/2011 17:16 GMT-0500

VLDB Semdata Workshop - The New Frontier of Semdata

This is a revised version of the talk I will be giving at the Semdata workshop at VLDB 2010.

The paper shows how we store TPC-H data as RDF with relational-level efficiency and how we query both RDF and relational versions in comparable time. We also compare row-wise and column-wise storage formats as implemented in Virtuoso.

A question that has come up a few times during the Semdata initiative is how semantic data will avoid the fate of other would-be database revolutions like OODBMS and deductive databases.

The need and opportunity are driven by the explosion of data in quantity and diversity of structure. The competition consists of analytics RDBMS, point solutions done with map-reduce or the like, and lastly in some cases from key-value stores with relaxed schema but limited querying.

The benefits of RDF are the ever expanding volume of data published in it, reuse of vocabulary, and well-defined semantics. The downside is efficiency. This is not so much a matter of absolute scalability — you can run an RDF database on a cluster — but a question of relative cost as opposed to alternatives.

The baseline is that for relational-style queries, one should get relational performance or close enough. We outline in the paper how RDF reduces to a run-time-typed relational column-store, and gets all the compression and locality advantages traditionally associated with such. After memory is no longer the differentiator, the rest is engineering. So much for the scalability barrier to adoption.

I do not need to talk here about the benefits of linked data and more or less ad hoc integration per se. But again, to make these practical, there are logistics to resolve: How to keep data up to date? How to distribute it incrementally? How to monetize freshness? We propose some solutions for these, looking at diverse-RDF replication and RDB-to-RDF replication in Virtuoso.

But to realize the ultimate promise of RDF/Linked Data/Semdata, however we call it, we must look farther into the landscape of what is being done with big data. Here we are no longer so much running against the RDBMS, but against map-reduce and key-value stores.

Given the psychology of geekdom, the charm of map-reduce is understandable: One controls what is going on, can work in the usual languages, can run on big iron without being picked to pieces by the endless concurrency and timing and order-of-events issues one gets when programming a cluster. Tough for the best, and unworkable for the rest.

The key-value store has some of the same appeal, as it is the DBMS laid bare, so to say, made understandable, without the again intractably-complex questions of fancy query planning and distributed ACID transactions. The psychological rewards of the sense of control are there, never mind the complex query; one can always hard code a point solution for the business question, if really must — maybe even in map-reduce.

Besides, for some things that go beyond SQL (for example, with graph structures), there really isn't a good solution.

Now, enter Vertica, Greenplum, VectorWise (a MonetDB project derivative from Ingres) and Virtuoso, maybe others, who all propose some combination of SQL- and explicit map-reduce-style control structures. This is nice but better is possible.

Here we find the next frontier of Semdata. Take Joe Hellerstein et al's work on declarative logic for the data centric data center.

We have heard it many times — when the data is big, the logic must go to it. We can take declarative, location-conscious rules, à la BOOM and BLOOM, and combine these with the declarative query, well-defined semantics, parallel-database capability of the leading RDF stores. Merge this with locality compression and throughput from the best analytics DBMS.

Here we have a data infrastructure that subsumes map-reduce as a special case of arbitrary distributed-parallel control flow, can send the processing to the data, and has flexible queries and schema-last capability.

Further, since RDF more or less reduces to relational columns, the techniques of caching and reuse and materialized joins and demand-driven indexing, à la MonetDB, are applicable with minimal if any adaptation.

Such a hybrid database-fusion frontier is relevant because it addresses heterogenous, large-scale data, with operations that are not easy to reduce to SQL, still without loss of the advantages of SQL. Apply this to anything from enhancing the business intelligence process by faster integration, including integration with linked open data to the map-reduce bulk processing of today. Do it with strong semantics and inference close to the data.

In short, RDF stays relevant by tackling real issues, with scale second to none, and decisive advantages in time-to-integrate and expressive power.

Last week I was at the LOD2 kick off and a LarKC meeting. The capabilities envisioned in this and the following post mirror our commitments to the EU co-funded LOD2 project. This week is VLDB and the Semdata workshop. I will talk more about how these trends are taking shape within the Virtuoso product development roadmap in future posts.

bookmark it! submit digg.com

digg it!

reddit!

# PermaLink Comments [0]

09/13/2010 18:09 GMT-0500

Modified: 09/21/2010 10:52 GMT-0500

"The Acquired, The Innate, and the Semantic" or "Teaching Sem Tech"

I was recently asked to write a section for a policy document touching the intersection of database and semantics, as a follow up to the meeting in Sofia I blogged about earlier. I will write about technology, but this same document also touches the matter of education and computer science curricula. Since the matter came up, I will share a few thoughts on the latter topic.

I have over the years trained a few truly excellent engineers and managed a heterogeneous lot of people. These days, since what we are doing is in fact quite difficult and the world is not totally without competition, I find that I must stick to core competence, which is hardcore tech and leave management to those who have time for it.

When younger, I thought that I could, through sheer personal charisma, transfer either technical skills, sound judgment, or drive and ambition to people I was working with. Well, to the extent I believed this, my own judgment was not sound. Transferring anything at all is difficult and chancy. I must here think of a fantasy novel where a wizard said that, "working such magic that makes things do what they already want to do is easy." There is a grain of truth in that.

In order to build or manage organizations, we must work, as the wizard put it, with nature, not against it. There are also counter-examples, for example my wife's grandmother had decided to transform a regular willow into a weeping one by tying down the branches. Such "magic," needless to say, takes constant maintenance; else the spell breaks.

To operate efficiently, either in business or education, we need to steer away from such endeavors. This is a valuable lesson, but now consider teaching this to somebody. Those who would most benefit from this wisdom are the least receptive to it. So again, we are reminded to stay away from the fantasy of being able to transfer some understanding we think to have and to have this take root. It will if it will and if it does not, it will take constant follow up, like the would-be weeping willow.

Now, in more specific terms, what can we realistically expect to teach about computer science?

Complexity of algorithms would be the first thing. Understanding the relative throughputs and latencies of the memory hierarchy (i.e., cache, memory, local network, disk, wide area network) is the second. Understanding the difference of synchronous and asynchronous and the cost of synchronization (i.e., anything from waiting for a mutex to waiting for a network message) is the third.

Understanding how a database works would be immensely helpful for almost any application development task but this is probably asking too much.

Then there is the question of engineering. Where do we put interfaces and what should these interfaces expose? Well, they certainly should expose multiple instances of whatever it is they expose, since passing through an interface takes time.

I tried once to tell the SPARQL committee that parameterized queries and array parameters are a self-evident truism on the database side. This is an example of an interface that exposes multiple instances of what it exposes. But the committee decided not to standardize these. There is something in the "semanticist" mind that is irrationally antagonistic to what is self-evident for databasers. This is further an example of ignoring precept 2 above, the point about the throughputs and latencies in the memory hierarchy. Nature is a better and more patient teacher than I; the point will become clear of itself in due time, no worry.

Interfaces seem to be overvalued in education. This is tricky because we should not teach that interfaces are bad either. Nature has islands of tightly intertwined processes, separated by fairly narrow interfaces. People are taught to think in block diagrams, so they probably project this also where it does not apply, thereby missing some connections and porosity of interfaces.

LarKC (EU FP7 Large Knowledge Collider project) is an exercise in interfaces. The lessons so far are that coupling needs to be tight, and that the roles of the components are not always as neatly separable as the block diagram suggests.

Recognizing the points where interfaces are naturally narrow is very difficult. Teaching this in a curriculum is likely impossible. This is not to say that the matter should not be mentioned and examples of over-"paradigmatism" given. The geek mind likes to latch on to a paradigm (e.g., object orientation), and then they try to put it everywhere. It is safe to say that taking block diagrams too naively or too seriously makes for poor performance and needless code. In some cases, block diagrams can serve as tactical disinformation; i.e., you give lip service to the values of structure, information hiding, and reuse, which one is not allowed to challenge, ever, and at the same time you do not disclose the competitive edge, which is pretty much always a breach of these same principles.

I was once at a data integration workshop in the US where some very qualified people talked about the process of science. They had this delightfully American metaphor for it:

The edge is created in the "Wild West" — there are no standards or hard-and-fast rules, and paradigmatism for paradigmatism's sake is a laughing matter with the cowboys in the fringe where new ground is broken. Then there is the OK Corral, where the cowboys shoot it out to see who prevails. Then there is Dodge City, where the lawman already reigns, and compliance, standards, and paradigms are not to be trifled with, lest one get the tar-and-feather treatment and be "driven out o'Dodge."

So, if reality is like this, what attitude should the curriculum have towards it? Do we make innovators or followers? Well, as said before, they are not made. Or if they are made, they are not at least made in the university but much before that. I never made any of either, in spite of trying, but did meet many of both kinds. The education system needs to recognize individual differences, even though this is against the trend of turning out a standardized product. Enforced mediocrity makes mediocrity. The world has an amazing tolerance for mediocrity, it is true. But the edge is not created with this, if edge is what we are after.

But let us move to specifics of semantic technology. What are the core precepts, the equivalent of the complexity/memory/synchronization triangle of general purpose CS basics? Let us not forget that, especially in semantic technology, when we have complex operations, lots of data, and almost always multiple distributed data sources, forgetting the laws of physics carries an especially high penalty.

Know when to ontologize, when to folksonomize. The history of standards has examples of "stacks of Babel," sky-high and all-encompassing, which just result in non-communication and non-adoption. Lighter weight, community driven, tag folksonomy, VoCamp-style approaches can be better. But this is a judgment call, entirely contextual, having to do with the maturity of the domain of discourse, etc.
Answer only questions that are actually asked. This precept is two-pronged. The literal interpretation is not to do inferential closure for its own sake, materializing all implied facts of the knowledge base.

The broader interpretation is to take real-world problems. Expanding RDFS semantics with map-reduce and proving how many iterations this will take is a thing one can do but real-world problems will be more complex and less neat.
Deal with ambiguity. Data on which semantic technologies will be applied will be dirty, with errors from machine processing of natural language to erroneous human annotations. The knowledge bases will not be contradiction free. Michael Witbrock of CYC said many good things about this in Sofia; he would have something to say about a curriculum, no doubt.

Here we see that semantic technology is a younger discipline than computer science. We can outline some desirable skills and directions to follow but the idea of core precepts is not as well formed.

So we can approach the question from the angle of needed skills more than of precepts of science. What should the certified semantician be able to do?

Data integration. Given heterogenous relational schemas talking about the same entities, the semantician should find existing ontologies for the domain, possibly extend these, and then map the relational data to them. After the mapping is conceptually done, the semantician must know what combination of ETL and on-the-fly mapping fits the situation. This does mean that the semantician indeed must understand databases, which I above classified as an almost unreachable ideal. But there is no getting around this. Data is increasingly what makes the world go round. From this it follows that everybody must increasingly publish, consume, and refine, i.e., integrate. The anti-database attitude of the semantic web community simply has to go.
Design and implement workflows for content extraction, e.g., NLP or information extraction from images. This also means familiarity with NLP, desirably to the point of being able to tune the extraction rule sets of various NLP frameworks.
Design SOA workflows. The semantician should be able to extract and represent the semantics of business transactions and the data involved therein.
Lightweight knowledge engineering. The experience of building expert systems from the early days of AI is not the best possible, but with semantics attached to data, some sort of rules seem about inevitable. The rule systems will merge into the DBMS in time. Some ability to work with these, short of making expert systems, will be desirable.
Understand information quality in the sense of trust, provenance, errors in the information, etc. If the world is run based on data analytics, then one must know what the data in the warehouse means, what accidental and deliberate errors it contains, etc.

Of course, most of these tasks take place at some sort of organizational crossroads or interface. This means that the semantician must have some project management skills; must be capable of effectively communicating with different publics and simply getting the job done, always in the face of organizational inertia and often in the face of active resistance from people who view the semantician as some kind of intruder on their turf.

Now, this is a tall order. The semantician will have to be reasonably versatile technically, reasonably clever, and a self-starter on top. The self-starter aspect is the hardest.

The semanticists I have met are more of the scholar than the IT consultant profile. I say semanticist for the semantic web research people and semantician for the practitioner we are trying to define.

We could start by taking people who already do data integration projects and educating them in some semantic technology. We are here talking about a different breed than the one that by nature gravitates to description logics and AI. Projecting semanticist interests or attributes on this public is a source of bias and error.

If we talk about a university curriculum, the part that cannot be taught is the leadership and self-starter aspect, or whatever makes a good IT consultant. Thus the semantic technology studies must be profiled so as to attract people with this profile. As quoted before, the dream job for each era is a scarce skill that makes value from something that is plentiful in the environment. At this moment and for a few moments to come, this is the data geek, or maybe even semantician profile, if we take data geek past statistics and traditional business intelligence skills.

The semantic tech community, especially the academic branch of it, needs to reinvent itself in order to rise to this occasion. The flavor of the dream job curriculum will be away from the theoretical computer science towards the hands-on of database, large systems performance, and the practicalities of getting data intensive projects delivered.

Related

bookmark it! submit digg.com

digg it!

reddit!

# PermaLink Comments [0]

04/05/2010 11:21 GMT-0500

Modified: 05/05/2010 13:49 GMT-0500

Updated hardware improves LUBM 8000 load rate in Virtuoso 6

We repeated the earlier LUBM 8000 experiment on a newer machine, with 2 x Xeon 5520 and 72G 1333MHz memory, and once again with the 2 machines as a networked cluster. Otherwise the settings were the same.

The load rate is now 160,739 triples-per-second.

	Virtuoso 6 (previous run)	Virtuoso 6 (new run)	Virtuoso 6 (newest run)
blades	1	1	2
processors	2 x Xeon 5410	2 x Xeon 5520	2 x Xeon 5520 + 2 x Xeon 5410 with 1x1GigE interconnect
memory	16G 667 MHz	72G 1333 MHz	72G 1333 MHz + 16G 667 MHz respectively
reported load rate triples-per-second	110,532	160,739	214,188

Again, if others talk about loading LUBM, so must we. Otherwise, this metric is rather uninteresting.

bookmark it! submit digg.com

digg it!

reddit!

# PermaLink Comments [0]

08/14/2009 19:01 GMT-0500

Modified: 08/15/2009 15:27 GMT-0500

Beyond Applications - Introducing the Planetary Datasphere (Part 2)

We have looked at the general implications of the DataSphere, a universal, ubiquitous database infrastructure, on end-user experience and application development and content. Now we will look at what this means at the back end, from hosting to security to server software and hardware.

Application Hosting

For the infrastructure provider, hosting the DataSphere is no different from hosting large Web 2.0 sites. This may be paid for by users, as in the cloud computing model where users rent capacity for their own purposes, or by advertisers, as in most of Web 2.0.

Clouds play a role in this as places with high local connectivity. The DataSphere is the atmosphere; the Cloud is an atmospheric phenomenon.

What of Proprietary Data and its Security?

Having proprietary data does not imply using a proprietary language. I would say that for any domain of discourse, no matter how private or specialized, at least some structural concepts can be borrowed from public, more generic sources. This lowers training thresholds and facilitates integration. Being able to integrate does not imply opening one's own data. To take an analogy, if you have a bunker with closed circuit air recycling, you still breathe air, even if that air is cut off from the atmosphere at large. For places with complex existing RDBMS security, the best is to map the RDBMS to RDF on the fly, always running all requests through the RDBMS. This implicitly preserves any policy or label based security schemes.

What of Individual Privacy on the Open Web?

The more complex situations will be found in environments with mixed security needs, as in social networking with partly-open and partly-closed profiles. The FOAF+SSL solution with https:// URIs is one approach. For query processing, we have a question of enforcing instance-level policies. In the DataSphere, granting privileges on tables and views no longer makes sense. In SQL, a policy means that behind the scenes the DBMS will add extra criteria to queries and updates depending on who is issuing them. The query processor adds conditions like getting the user's department ID and comparing it to the department ID on the payroll record. Labeled security is a scheme where data rows themselves contain security tags and the DBMS enforces these, row by row.

I would say that these techniques are suited for highly-structured situations where the roles, compartments, and needs are clear, and where the organization has the database know-how to write, test, and deploy such rules by the table, row, and column. This does not sit well with schema-last. I would not bet much on an average developer's capacity for making airtight policies on RDF data where not even 100% schema-adherence is guaranteed.

Doing security at the RDF graph level seems more appropriate. In many use cases, the graph is analogous to a photo album or a file system directory. A Data Space can be divided into graphs to provide more granularity for expressing topic, provenance, or security. If policy conditions apply mostly to the graph, then things are not as likely to slip by, for example, policy rules missing some infrequent misuse of the schema. In these cases, the burden on the query processor is also not excessive: Just as with documents, the container (table, graph) is the object of access grants, not the individual sentences (DBMS records, RDF triples) in the document.

It is left to the application to present a choice of graph level policies to the user. Exactly what these will be depends on the domain of discourse. A policy might restrict access to a meeting in a calendar to people whose OpenIDs figure in the attendee list, or limit access to a photo album to people mentioned in the owner's social network. Defining such policies is typically a task for the application developer.

The difference between the Document Web and the Linked Data Web is that while the Document Web enforces security when a thing is returned to the user, Linked Data Web enforcement must occur whenever a query references something, even if this is an intermediate result not directly shown to the user.

The DataSphere will offer a generic policy scheme, filtering what graphs are accessed in a given query situation. Other applications may then verify the safety of one's disclosed information using the same DataSphere infrastructure. Of course, the user must rely on the infrastructure provider to correctly enforce these rules. Then again, some users will operate and audit their own infrastructure anyway.

Federation vs. Centralization

On the open web, there is the question of federation vs. centralization. If an application is seen to be an interface to a vocabulary, it becomes more agnostic with respect to this. In practice, if we are talking about hosted services, what is hosted together joins much faster. Data Spaces with lots of interlinking, such as closely connected social networks, will tend to cluster together on the same cloud to facilitate joint operation. Data is ubiquitous and not location-conscious, but what one can efficiently do with it depends on location. Joint access patterns favor joint location. Due to technicalities of the matter, single database clusters will run complex queries within the cluster 100 to 1000 times faster than between clusters. The size of such data clouds may be in the hundreds-of-billions of triples. It seems to make sense to have data belonging to same-type or jointly-used applications close together. In practice, there will arise partitioning by type of usage, user profile, etc., but this is no longer airtight and applications more-or-less float on top of all of this.

A search engine can host a copy of the Document Web and allow text lookups on it. But a text lookup is a single well-defined query that happens to parallelize and partition very well. A search engine can also have all the structured public data copied, but the problem there is that queries are a lot less predictable and may take orders of magnitude more resources than a single text lookup. As a partial answer, even now, we can set up a database so that the first million single-row joins cost the user nothing, but doing more requires a special subscription.

The cost for hosting a trillion triples will vary radically in function of what throughput is promised. This may result in pricing per service level, a bit like ISP pricing varies in function of promised connectivity. Queries can be run for free if no throughput guarantee applies, and might cost more if the host promises at least five-million joins-per-second including infrequently-accessed data.

Performance and cost dynamics will probably lead to the emergence of domain-specific clusters of colocated Data Spaces. The landscape will be hybrid, where usage drives data colocation. A single Google is not a practical solution to the world's spectrum of query needs.

What is the Cost of Schema-Last?

The DataSphere proposition is predicated on a worldwide database fabric that can store anything, just like a network can transport anything. It cannot enforce a fixed schema, just like TCP/IP cannot say that it will transport only email. This is continuous schema evolution. Well, TCP/IP can transport anything but it does transport a lot of HTML and email. Similarly, the DataSphere can optimize for some common vocabularies.

We have seen that an application-specific relational schema is often 10 times more efficient than an equivalent completely generic RDF representation of the same thing. The gap may narrow, but task specific representations will keep an edge. We ought to know, as we do both.

While anything can be represented, the masses are not that creative. For any data-hosting provider, making a specialized representation for the top 100 entities may cut data size in half or better. This is a behind-the-scenes optimization that will in time be a matter of course.

Historically, our industry has been driven by two phenomena:

New PCs every 2 years. To make this necessary, Windows has been getting bigger and bigger, and not upgrading is not an option if one must exchange documents with new data formats and keep up with security.
Agility, or ad hoc over planned. The reason the RDBMS won over CODASYL network databases was that one did not have to define what queries could be made when creating the database. With the Linked Data Web, we have one more step in this direction when we say that one does not have to decide what can be represented when creating the database.

To summarize, there is some cost to schema-last, but then our industry needs more complexity to keep justifying constant investment. The cost is in this sense not all bad.

Building the DataSphere may be the next great driver of server demand. As a case in point, Cisco, whose fortune was made when the network became ubiquitous, just entered the server game. It's in the air.

DataSphere Precursors

Right now, we have the Linked Open Data movement with lots of new data being added. We have the drive for data- and reputation-portability. We have Freebase as a demonstrator of end-users actually producing structured data. We have convergence of terminology around DBpedia, FOAF, SIOC, and more. We have demonstrators of useful data integration on the RDF stack in diverse fields, especially life sciences.

We have a totally ubiquitous network for the distribution of this, plus database technology to make this work.

We have a practical need for semantics, as search is getting saturated, email is getting killed by spam, and information overload is a constant. Social networks can be leveraged for solving a lot of this, if they can only be opened.

Of course, there is a call for transparency in society at large. Well, the battle of transparency vs. spin is a permanent feature of human existence but even there, we cannot ignore the possibilities of open data.

Databases and Servers

Technically, what does this take? Mostly, this takes a lot of memory. The software is there and we are productizing it as we speak. As with other data intensive things, the key is scalable querying over clusters of commodity servers. Nothing we have not heard before. Of course, the DBMS must know about RDF specifics to get the right query plans and so on but this we have explained elsewhere.

This all comes down to the cost of memory. No amount of CPU or network speed will make any difference if data is not in memory. Right now, a board with 8G and a dual core AMD X86-64 and 4 disks may cost about $700. 2 x 4 core Xeon and 16G and 8 disks may be $4000, counting just the components. In our experience, about 32G per billion triples is a minimum. This must be backed by a few independent disks so as to fill the cache in parallel. A cluster with 1 TB of RAM would be under $100K if built from low end boards.

The workload is all about large joins across partitions. The queries parallelize well, thus using the largest and most expensive machines for building blocks is not cost efficient. Having absolutely everything in RAM is also not cost efficient, but it is necessary to have many disks to absorb the random access load. Disk access is predominantly random, unlike some analytics workloads that can read serially. If SSD's get a bit cheaper, one could have SSD for the database and disk for backup.

With large data centers, redundancy becomes an issue. The most cost effective redundancy is simply storing partitions in duplicate or triplicate on different commodity servers. The DBMS software should handle the replication and fail-over.

For operating such systems, scaling-on-demand is necessary. Data must move between servers, and adding or replacing servers should be an on-the-fly operation. Also, since access is essentially never uniform, the most commonly accessed partitions may benefit from being kept in more copies than less frequently accessed ones. The DBMS must be essentially self administrating since these things are quite complex and easily intractable if one does not have in depth understanding of this rather complex field.

The best price point for hardware varies with time. Right now, the optimum is to have many basic motherboards with maximum memory in a rack unit, then another unit with local disks for each motherboard. Much cheaper than SAN's and Infiniband fabrics.

Conclusions and Next Steps

The ingredients and use cases are there. If server clusters with 1TB RAM begin under $100K, the cost of deployment is small compared to personnel costs.

Bootstrapping the DataSphere from current Linked Open Data, such as DBpedia, OpenCYC, Freebase, and every sort of social network, is feasible. Aside from private data integration and analytics efforts and E-science, the use cases are liberating social networks and C2C and some aspects of search from silos, overcoming spam, and mass use of semantics extracted from text. Emergent effects will then carry the ball to places we have not yet been.

The Linked Data Web has its origins in Semantic Web research, and many of the present participants come from these circles. Things may have been slowed down by a disconnect, only too typical of human activity, between Semantic Web research on one hand and database engineering on the other. Right now, the challenge is one of engineering. As documented on this blog, we have worked quite a bit on cluster databases, mostly but not exclusively with RDF use cases. The actual challenges of this are however not at all what is discussed in Semantic Web conferences. These have to do with complexities of parallelism, timing, message bottlenecks, transactions, and the like, i.e., hardcore engineering. These are difficult beyond what the casual onlooker might guess but not impossible. The details that remain to be worked out are nothing semantic, they are hardcore database, concerning automatic provisioning and such matters.

It is as if the Semantic Web people look with envy at the Web 2.0 side where there are big deployments in production, yet they do not seem quite ready to take the step themselves. Well, I will write some other time about research and engineering. For now, the message is &mdash go for it. Stay tuned for more announcements, as we near production with our next generation of software.

bookmark it! submit digg.com

digg it!

reddit!

# PermaLink Comments [0]

03/25/2009 10:50 GMT-0500

Modified: 03/25/2009 12:31 GMT-0500

Beyond Applications - Introducing the Planetary Datasphere (Part 1)

This is the first in a short series of blog posts about what becomes possible when essentially unlimited linked data can be deployed on the open web and private intranets.

The term DataSphere comes from Dan Simmons' Hyperion science fiction series, where it is a sort of pervasive computing capability that plays host to all sorts of processes, including what people do on the net today, and then some. I use this term here in order to emphasize the blurring of silo and application boundaries. The network is not only the computer but also the database. I will look at what effects the birth of a sort of linked data stratum can have on end-user experience, application development, application deployment and hosting, business models and advertising, and security; how cloud computing fits in; and how back-end software such as databases must evolve to support all of these.

This is a mid-term vision. The components are coming into production as we speak, but the end result is not here quite yet.

I use the word DataSphere to refer to a worldwide database fabric, a global Distributed DBMS collective, within which there are many Data Spaces, or Named Data Spaces. A Data Space is essentially a person's or organization's contribution to the DataSphere. I use Linked Data Web to refer to component technologies and practices such as RDF, SPARQL, Linked Data practices, etc. The DataSphere does not have to be built on this technology stack per se, but this stack is still the best bet for it.

General

There exist applications for performing specialized functions such as social networking, shopping, document search, and C2C commerce at planetary scale. All these applications run on their own databases, each with a task specific schema. They communicate by web pages and by predefined messages for diverse application-specific transactions and reports.

These silos are scalable because in general their data has some natural partitioning, and because the set of transactions is predetermined and the data structure is set up for this.

The Linked Data Web proposes to create a data infrastructure that can hold anything, just like a network can transport anything. This is not a network with a memory of messages, but a whole that can answer arbitrary questions about what has been said. The prerequisite is that the questions are phrased in a vocabulary that is compatible with the vocabulary in which the statements themselves were made.

In this setting, the vocabulary takes the place of the application. Of course, there continues to be a procedural element to applications; this has the function of translating statements between the domain vocabulary and a user interface. Examples are data import from existing applications, running predefined reports, composing new reports, and translating between natural language and the domain vocabulary.

The big difference is that the database moves outside of the silo, at least in logical terms. The database will be like the network — horizontal and ubiquitous. The equivalent of TCP/IP will be the RDF/SPARQL combination. The equivalent of routing protocols between ISPs will be gateways between the specific DBMS engines supporting the services.

The place of the DBMS in the stack changes

The RDBMS in itself is eternal, or at least as eternal as a culture with heavy reliance on written records is. Any such culture will invent the RDBMS and use it where it best fits. We are not replacing this; we are building an abstracted worldwide data layer. This is to the RDBMS supporting line-of-business applications what the www was to enterprise content management systems.

For transactions, the Web 2.0-style application-specific messages are fine. Also, any transactional system that must be audited must physically reside somewhere, have physical security, etc. It can't just be somewhere in the DataSphere, managed by some system with which one has no contract, just like Google's web page cache can't be relied on as a permanent repository of web content.

Providing space on the Linked Data Web is like providing hosting on the Document Web. This may have varying service levels, pricing models, etc. The value of a queriable DataSphere is that a new application does not have to begin by building its own schema, database infrastructure, service hosting, etc. The application becomes more like a language meme, a cultural form of interaction mediated by a relatively lightweight user-facing component, laterally open for unforeseen interaction with other applications from other domains of discourse.

End User Benefits

For the end user, the web will still look like a place where one can shop, discuss, date, whatever. These activities will be mediated by user interfaces as they are now. Right now, the end user's web presence is his/her blog or web site, and their contributions to diverse wikis, social web sites, and so forth. These are scattered. The user's Data Space is the collection of all these things, now presented in a queriable form. The user's Data Space is the user's statement of presence, referencing the diverse contributions of the user on diverse sites.

The personal Data Space being a queriable, structured whole facilitates finding and being found, which is what brings individuals to the web in the first place. The best applications and sites are those which make this the easiest. The Linked Data Web allows saying what one wishes in a structured, queriable manner, across all application domains, independently of domain specific silos. The end user's interaction with the personal data space is through applications, like now. But these applications are just wrappers on top of self describing data, represented in domain specific vocabularies; one vocabulary is used for social networking, another for C2C commerce, and so on. The user is the master of their personal Data Space, free to take it where he or she wishes.

Further benefits will include more ready referencing between these spaces, more uniform identity management, cross-application operations, and the emergence of "meta-applications," i.e., unified interfaces for managing many related applications/tasks.

Of course, there is the increase in semantic richness, such as better contextuality derived from entity extraction from text. But this is also possible in a silo. The Linked Data Web angle is the sharing of identifiers for real world entities, which makes extracts of different sources by different parties potentially joinable. The user interaction will hardly ever be with the raw data. But the raw data being still at hand makes for better targeting of advertisements, better offering of related services, easier discovery of related content, and less noise overall.

Kingsley Idehen has coined the term SDQ, for Serendipitous Discovery Quotient, to denote this. When applications expose explicit semantics, constructing a user experience that combines relevant data from many sources, including applications as well as highly targeted advertising, becomes natural. It is no longer a matter of "mashing up" web service interfaces with procedural code, but of "meshing" data through declarative queries across application spaces.

Applications in the DataSphere

The workflows supported by the DataSphere are essentially those taking place on the web now. The DataSphere dimension is expressed by bookmarklets, browser plugins, and the like, with ready access to related data and actions that are relevant for this data. Actions triggered by data can be anything from posting a comment to making an e-commerce purchase. Web 2.0 models fit right in.

Web application development now consists of designing an application-specific database schema and writing web pages to interact with this schema. In the DataSphere, the database is abstracted away, as is a large part of the schema. The application floats on a sea of data instead of being tied to its own specific store and schema. Some local transaction data should still be handled in the old way, though.

For the application developer, the question becomes one of vocabulary choice. How will the application synthesize URIs from the user interaction? Which URIs will be used, since pretty much anything will in practice have many names (e.g., DBpedia Vs. Freebase identifiers). The end user will generally have no idea of this choice, nor of the various degrees of normalization, etc., in the vocabularies. Still, usage of such applications will produce data using some identifiers and vocabularies. Benefits of ready joining without translation will drive adoption. A vocabulary with instance data will get more instance data.

The Linked Data Web infrastructure itself must support vocabulary and identifier choice by answering questions about who uses a particular identifier and where. Even now, we offer entity ranks and resolution of synonyms, queries on what graphs mention a certain identifier and so on. This is a means of finding the most commonly used term for each situation. Convergence of terminology cuts down on translation and makes for easier and more efficient querying.

Advertising

The application developer is, for purposes of advertising, in the position of the inventory owner, just like a traditional publisher, whether web or other. But with smarter data, it is not a matter of static keywords but of the semantically explicit data behind each individual user impression driving the ads. Data itself carries no ads but the user impression will still go through a display layer that can show ads. If the application relies on reuse of licensed content, such as media, then the content provider may get a cut of the ad revenue even if it is not the direct owner of the inventory. The specifics of implementing and enforcing this are to be worked out.

Content Providers, License, and Attribution

For the content provider, the URI is the brand carrier. If the data is well linked and queriable, this will drive usage and traffic to the services of the content provider. This is true of any provider, whether a media publisher, e-commerce business, government agency, or anything else.

Intellectual property considerations will make the URI a first class citizen. Just like the URI is a part of the document web experience, it is a part of the Linked Data Web experience. Just like Creative Commons licenses allow the licensor to define what type of attribution is required, a data publisher can mandate that a user experience mediated by whatever application should expose the source as a dereferenceable URI.

One element of data dereferencing must be linking to applications that facilitate human interaction with the data. A generic data browser is a developer tool; the end user experience must still be mediated by interfaces tailored to the domain. This layer can take care of making the brand visible and can show advertising or be monetized on a usage basis.

Next we will look at the service provider and infrastructure side of this.

Tags: webservices | web2.0 | web20 | rdf | semanticweb | web30 | sparql | socialnetworking | dataspace | .net

bookmark it! submit digg.com

digg it!

reddit!

# PermaLink Comments [0]

03/24/2009 09:38 GMT-0500

Modified: 03/24/2009 10:50 GMT-0500

Linked Data & The Year 2009 (updated)

As is fitting for the season, I will editorialize a bit about what has gone before and what is to come.

Sir Tim said it at WWW08 in Beijing — linked data and the linked data web is the semantic web and the Web done right.

The grail of ad hoc analytics on infinite data has lost none of its appeal. We have seen fresh evidence of this in the realm of data warehousing products, as well as storage in general.

The benefits of a data model more abstract than the relational are being increasingly appreciated also outside the data web circles. Microsoft's Entity Frameworks technology is an example. Agility has been a buzzword for a long time. Everything should be offered in a service based business model and should interoperate and integrate with everything else — business needs first; schema last.

Not to forget that when money is tight, reuse of existing assets and paying on a usage basis are naturally emphasized. Information, as the asset it is, is none the less important, on the contrary. But even with information, value should be realized economically, which, among other things, entails not reinventing the wheel.

It is against this backdrop that this year will play out.

As concerns research, I will again quote Harry Halpin at ESWC 2008: "Men will fight in a war, and even lose a war, for what they believe just. And it may come to pass that later, even though the war were lost, the things then fought for will emerge under another name and establish themselves as the prevailing reality" [or words to this effect].

Something like the data web, and even the semantic web, will happen. Harry's question was whether this would be the descendant of what is today called semantic web research.

I heard in conversation about a project for making a very large metadata store. I also heard that the makers did not particularly insist on this being RDF-based, though.

Why should such a thing be RDF-based? If it is already accepted that there will be ad hoc schema and that queries ought to be able to view the data from all angles, not be limited by having indices one way and not another way, then why not RDF?

The justification of RDF is in reusing and linking-to data and terminology out there. Another justification is that by using an RDF store, one is spared a lot of work and tons of compromises which attend making an entity-attribute-value (EAV, i.e., triple) store on a generic RDBMS. The sem-web world has been there, trust me. We came out well because we put all inside the RDBMS, lowest level, which you can't do unless you own the RDBMS. Source access is not enough; you also need the knowledge.

Technicalities aside, the question is one of proprietary vs. standards-based. This is not only so with software components, where standards have consistently demonstrated benefits, but now also with the data. Zemanta and OpenCalais serving DBpedia URIs are examples. Even in entirely closed applications, there is benefit in reusing open vocabularies and identifiers: One does not need to create a secret language for writing a secret memo.

Where data is a carrier of value, its value is enhanced by it being easy to repurpose (i.e., standard vocabularies) and to discover (i.e., data set metadata). As on the web, so on the enterprise intranet. In this lies the strength of RDF as opposed to proprietary flexible database schemes. This is a qualitative distinction.

In hoc signo vinces.

In this light, we welcome the voiD (VOcabulary of Interlinked Data), which is the first promise of making federatable data discoverable. Now that there is a point of focus for these efforts, the needed expressivity will no doubt accrete around the voiD core.

For data as a service, we clearly see the value of open terminologies as prerequisites for service interchangeability, i.e., creating a marketplace. XML is for the transaction; RDF is for the discovery, query, and analytics. As with databases in general, first there was the transaction; then there was the query. Same here. For monetizing the query, there are models ranging from renting data sets and server capacity in the clouds to hosted services where one pays for processing past a certain quota. For the hosted case, we just removed a major barrier to offering unlimited query against unlimited data when we completed the Virtuoso Anytime feature. With this, the user gets what is found within a set time, which is already something, and in case of needing more, one can pay for the usage. Of course, we do not forget advertising. When data has explicit semantics, contextuality is better than with keywords.

For these visions to materialize on top of the linked data platform, linked data must join the world of data. This means messaging that is geared towards the database public. They know the problem, but the RDF proposition is still not well enough understood for it to connect.

For the relational IT world, we offer passage to the data web and its promise of integration through RDF mapping. We are also bringing out new Microsoft Entity Framework components. This goes in the direction of defining a unified database frontier with RDF and non-RDF entity models side by side.

For OpenLink Software, 2008 was about developing technology for scale, RDF as well as generic relational. We did show a tiny preview with the Billion Triples Challenge demo. Now we are set to come out with the real thing, featuring, among other things, faceted search at the billion triple scale. We started offering ready-to-go Virtuoso-hosted linked open data sets on Amazon EC2 in December. Now we continue doing this based on our next-generation server, as well as make Virtuoso 6 Cluster commercially available. Technical specifics are amply discussed on this blog. There are still some new technology things to be developed this year; first among these are strong SPARQL federation, and on-the-fly resizing of server clusters. On the research partnerships side, we have an EU grant for working with the OntoWiki project from the University of Leipzig, and we are partners in DERI's Líon project. These will provide platforms for further demonstrating the "web" in data web, as in web-scale smart databasing.

2009 will see change through scale. The things that exist will start interconnecting and there will be emergent value. Deployments will be larger and scale will be readily available through a services model or by installation at one's own facilities. We may see the start of Search becoming Find, like Kingsley says, meaning semantics of data guiding search. Entity extraction will multiply data volumes and bring parts of the data web to real time.

Exciting 2009 to all.