Details
Virtuoso Data Space Bot
Burlington, United States
Subscribe
Post Categories
Recent Articles
Display Settings
|
Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
I have in the previous posts generally argued for and demonstrated the usefulness of benchmarks.
Here I will talk about how this could be organized in a way that is tractable, and takes vendor and end user interests into account. These are my views on the subject and do not represent a LOD2 members consensus, but have been discussed in the consortium.
My colleague Ivan Mikhailov once proposed that the only way to get benchmarks run right is to package them as a single script that does everything, like instant noodles -- just add water! But even instant noodles can be abused: Cook too long, add too much water, maybe forget to light the stove, and complain that the result is unsatisfyingly hard and brittle, lacking the suppleness one has grown to expect from this delicacy. No, the answer lies at the other end of the culinary spectrum, in gourmet cooking. Let the best cooks show what they can do, and let them work at it; let those who in fact have capacity and motivation for creating le chef d'oeuvre culinaire ("the culinary masterpiece") create it. Even so, there are many value points along the dimensions of preparation time, cost, and esthetic layout, not to forget taste and nutritional values. Indeed, an intimate knowledge de la vie secrete du canard ("the secret life of duck") is required in order to liberate the aroma that it might take flight and soar. In the previous, I have shed some light on how we prepare le canard, and if le canard be such then la dinde (turkey) might in some ways be analogous; who is to say?
In other words, as a vendor, we want to have complete control over the benchmarking process, and have it take place in our environment at a time of our choice. In exchange for this, we are ready to document and observe possibly complicated rules, document how the runs are made, and let others monitor and repeat them on the equipment on which the results are obtained. This is the TPC (Transaction Processing Performance Council) model.
Another culture of doing benchmarks is the periodic challenge model used in TREC, the Billion Triples Challenge, the Semantic Search
Challenge and others. In this model, vendors prepare the benchmark submission and agree to joint publication.
A third party performing benchmarks by itself is uncommon in databases. Licenses even often explicitly prohibit this, for understandable reasons.
The LOD2 project has an outreach activity called Publink where we offer to help owners of data to publish it as Linked Data. Similarly, since FP 7s are supposed to offer a visible service to their communities, I proposed that LOD2 offer to serve a role in disseminating and auditing RDF store benchmarks.
One representative of an RDF store vendor I talked to, in relation to setting up a benchmark configuration of their product, told me that we could do this and that they would give some advice but that such an exercise was by its nature fundamentally flawed and could not possibly produce worthwhile results. The reason for this was that OpenLink engineers could not possibly learn enough about the other products nor unlearn enough of their own to make this a meaningful comparison.
Isn't this the very truth? Let the chefs mix their own spices.
This does not mean that there would not be comparability of results. If the benchmarks and processes are well defined, documented, and checked by a third party, these can be considered legitimate and not just one-off best-case results without further import.
In order to stretch the envelope, which is very much a LOD2 goal, this benchmarking should be done on a variety of equipment -- whatever works best at the scale in question. Increasing the scale remains a stated objective. LOD2 even promised to run things with a trillion triples in another 3 years.
Imagine that the unimpeachably impartial Berliners made house calls. Would this debase Justice to be a servant of mere show-off? Or would this on the contrary combine strict Justice with edifying Charity? Who indeed is in greater need of the light of objective evaluation than the vendor whose very nature makes a being of bias and prejudice?
Even better, CWI, with its stellar database pedigree, agreed in principle to audit RDF benchmarks in LOD2.
In this way one could get a stamp of approval for one's results regardless of when they were produced, and be free of the arbitrary schedule of third party benchmarking runs. On the relational side this is a process of some cost and complexity, but since the RDF side is still young and more on mutually friendly terms, the process can be somewhat lighter here. I did promise to draft some extra descriptions of process and result disclosure so that we could see how this goes.
We could even do this unilaterally -- just publish Virtuoso results according to a predefined reporting and verification format. If others wished to publish by the same rules, LOD2 could use some of the benchmarking funds for auditing the proceedings. This could all take place over the net, so we are not talking about any huge cost or prohibitive amount of trouble. It would be in the FP7 spirit that LOD2 provide this service for free, naturally within reason.
Then there is the matter of the BSBM Business Intelligence (BI) mix. At present, it seems everybody has chosen to defer the matter to another round of BSBM runs in the summer. This seems to fit the pattern of a public challenge with a few months given for contenders to prepare their submissions. Here we certainly should look at bigger scales and more diverse hardware than in the Berlin runs published this time around. The BI workload is in fact fairly cluster friendly, with big joins and aggregations that parallelize well. There it would definitely make sense to reserve an actual cluster, and have all contenders set up their gear on it. If all have access to the run environment and to monitoring tools, we can be reasonably sure that things will be done in a transparent manner.
(I will talk about the BI mix in more detail in part 13 and part 14 of this series.)
Once the BI mix has settled and there are a few interoperable implementations, likely in the summer, we could pass from the challenge model to a situation where vendors may publish results as they become available, with LOD2 offering its services for audit.
Of course, this could be done even before then, but the content of the mix might not be settled. We likely need to check it on a few implementations first.
For equipment, people can use their own, or LOD2 partners might on a case-by-case basis make some equipment available for running on the same hardware on which say the Virtuoso results were obtained. For example, FU Berlin could give people a login to get their recently published results fixed. Now this might or might not happen, so I will not hold my breath waiting for this but instead close with a proposal.
As a unilateral diplomatic overture I put forth the following: If other vendors are interested in 1:1 comparison of their results with our publications, we can offer them a login to the same equipment. They can set up and tune their systems, and perform the runs. We will just watch. As an extra quid pro quo, they can try Virtuoso as configured for the results we have published, with the same data. Like this, both parties get to see the others' technology with proper tuning and installation. What, if anything, is reported about this activity is up to the owner of the technology being tested. We will publish a set of benchmark rules that can serve as a guideline for mutually comparable reporting, but we cannot force anybody to use these. This all will function as a catalyst for technological advance, all to the ultimate benefit of the end user. If you wish to take advantage of this offer, you may contact Hugh Williams at OpenLink Software, and we will see how this can be arranged in practice.
The next post will talk about the actual content of benchmarks. The milestone after this will be when we publish the measurement and reporting protocols.
Benchmarks, Redux Series
- Benchmarks, Redux (part 1): On RDF Benchmarks
-
Benchmarks, Redux (part 2): A Benchmarking Story
-
Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
-
Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
-
Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
-
Benchmarks, Redux (part 6): BSBM and I/O, continued
-
Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
-
Benchmarks, Redux (part 8): BSBM Explore and Update
-
Benchmarks, Redux (part 9): BSBM With Cluster
-
Benchmarks, Redux (part 10): LOD2 and the Benchmark Process (this post)
-
Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks
-
Benchmarks, Redux (part 12): Our Own BSBM Results Report
-
Benchmarks, Redux (part 13): BSBM BI Modifications
-
Benchmarks, Redux (part 14): BSBM BI Mix
-
Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
|
03/10/2011 18:29 GMT-0500
|
Modified:
03/14/2011 19:37 GMT-0500
|
Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
We will here analyze what the BSBM Explore workload does. This is necessary in order to compare benchmark results at different scales. Historically, BSBM had a Query 6 whose share of the metric approached 100% as scale increased. The present mix does not have this query, but different queries still have different relative importance at different scales.
We will here look at database-running statistics for BSBM at different scales. Finally, we look at CPU profiles.
But first, let us see what BSBM reads in general. The system is in steady state after around 1500 query mixes; after this the working set does not shift much. After several thousand query mixes, we have:
SELECT TOP 10 * FROM sys_d_stat ORDER BY reads DESC;
KEY_TABLE INDEX_NAME TOUCHES READS READ_PCT N_DIRTY N_BUFFERS
================= ============================ ========== ======= ======== ======= =========
DB.DBA.RDF_OBJ RDF_OBJ 114105938 3302150 2 0 3171275
DB.DBA.RDF_QUAD RDF_QUAD 977426773 2041156 0 0 1970712
DB.DBA.RDF_IRI DB_DBA_RDF_IRI_UNQC_RI_ID 8250414 509239 6 15 491631
DB.DBA.RDF_QUAD RDF_QUAD_POGS 3677233812 183860 0 0 175386
DB.DBA.RDF_IRI RDF_IRI 32 99710 302151 5 95353
DB.DBA.RDF_QUAD RDF_QUAD_OP 30597 51593 168 0 48941
DB.DBA.RDF_QUAD RDF_QUAD_SP 265474 47210 17 0 46078
DB.DBA.RDF_PREFIX DB_DBA_RDF_PREFIX_UNQC_RP_ID 6020 212 3 0 212
DB.DBA.RDF_PREFIX RDF_PREFIX 0 167 16700 0 157
The first column is the table, then the index, then the number of times a row was found. The fourth number is the count of disk pages read. The last number is the count of 8K buffer pool pages in use for caching pages of the index in question. Note that the index is clustered, i.e., there is no table data structure separate from the index. Most of the reads are for strings or RDF literals. After this comes the PSOG index for getting a property value given the subject. After this, but much lower, we have lookups of IRI strings given the ID. The index from object value to subject is used the most but the number of pages is small; only a few properties seem to be concerned. The rest is minimal in comparison.
Now let us reset the counts and see what the steady state I/O profile is.
SELECT key_stat (key_table, name_part (key_name, 2), 'reset') FROM sys_keys WHERE key_migrate_to IS NULL;
SELECT TOP 10 * FROM sys_d_stat ORDER BY reads DESC;
KEY_TABLE INDEX_NAME TOUCHES READS READ_PCT N_DIRTY N_BUFFERS
================= ============================ ========== ======= ======== ======= =========
DB.DBA.RDF_OBJ RDF_OBJ 30155789 79659 0 0 3191391
DB.DBA.RDF_QUAD RDF_QUAD 259008064 8904 0 0 1948707
DB.DBA.RDF_QUAD RDF_QUAD_SP 68002 7730 11 0 53360
DB.DBA.RDF_IRI RDF_IRI 12 5415 41653 6 98804
DB.DBA.RDF_QUAD RDF_QUAD_POGS 975147136 1597 0 0 173459
DB.DBA.RDF_IRI DB_DBA_RDF_IRI_UNQC_RI_ID 2213525 1286 0 17 485093
DB.DBA.RDF_QUAD RDF_QUAD_OP 7999 904 11 0 48568
DB.DBA.RDF_PREFIX DB_DBA_RDF_PREFIX_UNQC_RP_ID 1494 1 0 0 213
Literal strings dominate. The SP index is used only for situations where the P is not specified, i.e., the DESCRIBE query. Based on this, I/O seems to be attributable mostly to this. The first RDF_IRI represents translations from string to IRI id; the second represents translations from IRI id to string. The touch count for the first RDF_IRI is not properly recorded, hence the miss % is out of line. We see SP missing the cache the most since its use is infrequent in the mix.
We will next look at query processing statistics. For this we introduce a new meter.
The db_activity SQL function provides a session-by-session cumulative statistic of activity. The fields are:
-
rnd
- Count of random index lookups. Each first row of a select or insert counts as one, regardless of whether something was found.
-
seq
- Count of sequential rows. Every move to next row on a cursor counts as 1, regardless of whether conditions match.
-
same seg
- For column store only; counts how many times the next row in a vectored join using an index falls in the same segment as the previous random access. A segment is the stretch of rows between entries in the sparse top level index on the column projection.
-
same pg
- Counts how many times a vectored index join finds the next match on the same page as the previous one.
-
same par
- Counts how many times the next lookup in a vectored index join falls on a different page than the previous but still under the same parent.
-
disk
- Counts how many disk reads were made, including any speculative reads initiated.
-
spec disk
- Counts speculative disk reads.
-
messages
- Counts cluster interconnect messages
-
B (KB, MB, GB)
- is the total length of the cluster interconnect messages.
-
fork
- Counts how many times a thread was forked (started) for query parallelization.
The numbers are given with 4 significant digits and a scale suffix. G is 10^9 (1,000,000,000); M is 10^6 (1,000,000), K is 10^3 (1,000).
We run 2000 query mixes with 16 Users. The special http account keeps a cumulative account of all activity on web server threads.
SELECT db_activity (2, 'http');
1.674G rnd 3.223G seq 0 same seg 1.286G same pg 314.8M same par 6.186M disk 6.461M spec disk 0B / 0 messages 298.6K fork
We see that random access dominates. The seq number is about twice the rnd number, meaning that the average random lookup gets two rows. Getting a row at random obviously takes more time than getting the next row. Since the index used is row-wise, the same seg is 0; the same pg indicates that 77% of the random accesses fall on the same page as the previous random access; most of the remaining random accesses fall under the same parent as the previous one.
There are more speculative reads than disk reads which is an artifact of counting some concurrently speculated reads twice. This does indicate that speculative reads dominate. This is because a large part of the run was in the warm-up state with aggressive speculative reading. We reset the counts and run another 2000 mixes.
Now let us look at the same reading after 2000 mixes, 16 user at 100Mt.
234.3M rnd 420.5M seq 0 same seg 188.8M same pg 29.09M same par 808.9K disk 919.9K spec disk 0B / 0 messages 76K fork
We note that the ratios between the random and sequential and same page/parent counts are about the same. The sequential number looks to be even a bit smaller in proportion. The count of random accesses for the 100Mt run is 14% of the count for the 1000Mt run. The count of query parallelization threads is also much lower since it is worthwhile to schedule a new thread only if there are at least a few thousand operations to perform on it. The precise criterion for making a thread is that according to the cost model guess, the thread must have at least 5ms worth of work.
We note that the 100 Mt throughput is a little over three-times that of the 1000 Mt throughput, as reported before. We might justifiably ask why the 100 Mt run is not seven-times faster instead, for this much less work.
We note that for one-off random access, it makes no real difference whether the tree has 100 M or 1000 M rows; this translates to roughly 27 vs 30 comparisons, so the depth of the tree is not a factor per se. Besides, vectoring makes the tree often look only one or two levels deep, so the total row count matters even less there.
To elucidate this last question, we look at the CPU profiles. We take an oprofile of 100 Single User mixes at both scales.
For 100 Mt:
61161 10.1723 cmpf_iri64n_iri64n_anyn_gt_lt
31321 5.2093 box_equal
19027 3.1646 sqlo_parse_tree_has_node
15905 2.6453 dk_alloc
15647 2.6024 itc_next_set_neq
12702 2.1126 itc_vec_split_search
12487 2.0768 itc_dive_transit
11450 1.9044 itc_bm_vec_row_check
10646 1.7706 itc_page_rcf_search
9223 1.5340 id_hash_get
9215 1.5326 gen_qsort
8867 1.4748 sqlo_key_part_best
8807 1.4648 itc_param_cmp
8062 1.3409 cmpf_iri64n_iri64n
6820 1.1343 sqlo_in_list
6005 0.9987 dc_iri_id_cmp
5905 0.9821 dk_free_tree
5801 0.9648 box_hash
5509 0.9163 dks_esc_write
5444 0.9054 sql_tree_hash_1
For 1000 Mt
754331 31.4149 cmpf_iri64n_iri64n_anyn_gt_lt
146165 6.0872 itc_vec_split_search
144795 6.0301 itc_next_set_neq
131671 5.4836 itc_dive_transit
110870 4.6173 itc_page_rcf_search
66780 2.7811 gen_qsort
66434 2.7667 itc_param_cmp
58450 2.4342 itc_bm_vec_row_check
55213 2.2994 dk_alloc
47793 1.9904 cmpf_iri64n_iri64n
44277 1.8440 dc_iri_id_cmp
39489 1.6446 cmpf_int64n
36880 1.5359 dc_append_bytes
36601 1.5243 dv_compare
31286 1.3029 dc_any_value_prefetch
25457 1.0602 itc_next_set
20852 0.8684 box_equal
19895 0.8285 dk_free_tree
19698 0.8203 itc_page_insert_search
19367 0.8066 dc_copy
The top function in both is the compare for an equality of two leading IRIs and a range for the trailing any. This corresponds to the range check in Q5. At the larger scale this is three times more important. At the smaller scale, the share of query optimization is about 6.5 times greater. The top function in this category is box_equal with 5.2% vs 0.87%. The remaining SQL compiler functions are all in proportion to this, totaling 14.3% of the 100 Mt top-20 profile.
From this sample it appears ten times more scale is seven times more database operations. This is not taken into account in the metric. Query compilation is significant at the small end, and no longer significant at 1000 Mt. From these numbers, we could say that Virtuoso is about two times more efficient in terms of database operation throughput at 1000 Mt than at 100 Mt.
We may conclude that different BSBM scales measure different things. The TPC workloads are relatively better in that they have a balance between metric components that stay relatively constant across a large range of scales.
This is not necessarily something that should be fixed in the BSBM Explore mix. We must however take these factors better into account in developing the BI mix.
Let us also remember that BSBM Explore is a relational workload. Future posts in this series will outline how we propose to make RDF-friendlier benchmarks.
Benchmarks, Redux Series
- Benchmarks, Redux (part 1): On RDF Benchmarks
-
Benchmarks, Redux (part 2): A Benchmarking Story
-
Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
-
Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
-
Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
-
Benchmarks, Redux (part 6): BSBM and I/O, continued
-
Benchmarks, Redux (part 7): What Does BSBM Explore Measure? (this post)
-
Benchmarks, Redux (part 8): BSBM Explore and Update
-
Benchmarks, Redux (part 9): BSBM With Cluster
-
Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
-
Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks
-
Benchmarks, Redux (part 12): Our Own BSBM Results Report
-
Benchmarks, Redux (part 13): BSBM BI Modifications
-
Benchmarks, Redux (part 14): BSBM BI Mix
-
Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
|
03/07/2011 18:39 GMT-0500
|
Modified:
03/14/2011 17:57 GMT-0500
|
Short Recap of Virtuoso Basics (#3 of 5)
(Third of five posts related to the WWW 2009 conference, held the week of April 20, 2009.)
There are some points that came up in conversation at WWW 2009 that I will reiterate here. We find there is still some lack of clarity in the product image, so I will here condense it.
Virtuoso is a DBMS. We pitch it primarily to the data web space because this is where we see the emerging frontier. Virtuoso does both SQL and SPARQL and can do both at large scale and high performance. The popular perception of RDF and Relational models as mutually exclusive and antagonistic poles is based on the poor scalability of early RDF implementations. What we do is to have all the RDF specifics, like IRIs and typed literals as native SQL types, and to have a cost based optimizer that knows about this all.
If you want application-specific data structures as opposed to a schema-agnostic quad-store model (triple + graph-name), then Virtuoso can give you this too. Rendering application specific data structures as RDF applies equally to relational data in non-Virtuoso databases because Virtuoso SQL can federate tables from heterogenous DBMS.
On top of this, there is a web server built in, so that no extra server is needed for web services, web pages, and the like.
Installation is simple, just one exe and one config file. There is a huge amount of code in installers — application code and test suites and such — but none of this is needed when you deploy. Scale goes from a 25MB memory footprint on the desktop to hundreds of gigabytes of RAM and endless terabytes of disk on shared-nothing clusters.
Clusters (coming in Release 6) and SQL federation are commercial only; the rest can be had under GPL.
To condense further:
- Scalable Delivery of Linked Data
- SPARQL and SQL
- Arbitrary RDF Data + Relational
- Also From 3rd Party RDBMS
- Easy Deployment
- Standard Interfaces
|
04/30/2009 11:49 GMT-0500
|
Modified:
04/30/2009 12:11 GMT-0500
|
Search at WWW 2009 (#2 of 5)
(Second of five posts related to the WWW 2009 conference, held the week of April 20, 2009.)
There was a workshop on semantic search plus a number of papers and of course keynotes from Google and Yahoo.
A general topic was the use of and access to query logs. Are these the monopoly of GYM (Google, Yahoo, Microsoft) or should they be made more generally available? This is a privacy question. Use of query logs and click through of search results for improved ranking was mentioned many times throughout the conference.
The semantic search workshop was largely about benchmarks for keyword search in information retrieval. For linked data, which is a database proposition, these benchmarks are not really applicable. For document search aided by semantics derived by NLP, these are of course applicable. But there is a divide in approach.
Giovanni Tummarello presented Sig.ma, a service using Sindice's RDF index for collecting all RDF statements about entities matching some set of keywords. One could then choose which sources and which entities were the right ones. One could further store such a query and embed it on a page. The point was that the filtering done manually could be persisted and republished, so as to create dynamic content aggregated from selected live sources. Further speculating, one could use such user feedback for adjusting ranking, even though Sig.ma did not. We may adopt the idea of manually excluding sources into our browser too. Fresnel lenses are another thing to look at.
There was a paper by Josep M. Pujol and Pablo Rodriguez, of Telefonica Research, about returning search to the people by means of Porqpine, a peer-to-peer search implementation based on sharing search results from search engines among peers and indexing them locally as they were retrieved. For users with similar interests, this can give a community based ranking model but has issues of privacy. Another point was that with local processing and personal scale data volumes various kinds of brute force processing were feasible that would cost a lot for the web scale. Much can be done web scale but it must be done cleverly, not with a shell script and not so ad hoc.
As a counterpoint to this, there was a talk about Hadoop and Hive, a map-reduce-based SQL-like framework. One could do an SQL GROUP BY on text files with record parsing at run time, all spread over a Hadoop cluster. The issue is, if you have a petabyte of data, you may wish to run more than one ad hoc query on it. This means that joining between partitions and complex processing becomes important. This cannot be done without indices and complex query optimization, and needs a DBMS. Stonebraker and company are fully justified in their critique of map reduce. It looks like each generation must get dazzled by the oversimplified and then retrace the same discoveries of complexity as the previous one.
Some of our future plans were confirmed by what we saw, for example as concerns:
- Interactively selecting sources for search, showing the graphs, then interactively refining
- More social networks, more network analysis, and more work on social recommendation
- Real time indexing of new pings, filling the store by forwarding queries to search engines, and harvesting micro-formats from results
- Using entity extraction
These are all items in the pipeline, easy to do on top of the existing platform. For the machine learning and NLP parts, we will partner with others, details will be worked out while we work on the items we implement by ourselves.
|
04/30/2009 11:18 GMT-0500
|
Modified:
04/30/2009 12:51 GMT-0500
|
Linked Data at WWW 2009 (#1 of 5)
(First of five posts related to the WWW 2009 conference, held the week of April 20, 2009.)
We gave a talk at the Linked Open Data workshop, LDOW 2009, at WWW 2009. I did not go very far into the technical points in the talk, as there was almost no time and the points are rather complex. Instead, I emphasized what new things had become possible with recent developments.
The problem we do not cease hearing about is scale. We have solved most of it. There is scale in the schema: Put together, ontologies go over a million classes/properties. Which ones are relevant depends, and the user should have the choice. The instance data is in the tens of billions of triples, much derived from Web 2.0 sources but also much published as RDF.
To make sense of this all, we need quick summaries and search. Without navigation via joins, the value will be limited. Fast joining, counting, grouping, and ranking are key.
People will use different terms for the same thing. The issue of identity is philosophical. In order to do reasoning one needs strong identity; a statement like x is a bit like y is not very useful in a database context. Whether any x and y can be considered the same depends on the context. So leave this for query time. The conditions under which two people are considered the same will depend on whether you are doing marketing analysis or law enforcement. A general purpose data store cannot anticipate all the possibilities, so smush on demand, as you go, as has been said many times.
Against this backdrop, we offer a solution with which anybody who so chooses can play with big data, whether a search or analytics player.
We are going in the direction of more and more ad hoc processing at larger and larger scale. With good query parallelization, we can do big joins without complex programming. No explicit Map Reduce jobs or the like. What was done with special code with special parallel programming models, can now be done in SQL and SPARQL.
To showcase this, we do linked data search, browsing, and so on, but are essentially a platform provider.
Entry costs into relatively high end databases have dropped significantly. A cluster with 1 TB of RAM sells for $75K or so at today's retail prices and fits under a desk. For intermittent use, the rent for 1TB RAM is $1228 per day on EC2. With this on one side and Virtuoso on the other, a lot that was impractical in the past is now within reach. Like Giovanni Tummarello put it for airplanes, the physics are as they were for da Vinci but materials and engines had to develop a bit before there was commercial potential. So it is also with analytics for everyone.
A remark from the audience was that all the stuff being shown, not limited to Virtuoso, was non-standard, having to do with text search, with ranking, with extensions, and was in fact not SPARQL and pure linked data principles. Further, by throwing this all together, one got something overcomplicated, too heavy.
I answered as follows, which apparently cannot be repeated too much:
First, everybody expects a text search box, and is conditioned to having one. No text search and no ranking is a non-starter. Ceterum censeo, for database, the next generation cannot be less expressive than the previous. All of SQL and then some is where SPARQL must be. The barest minimum is being able to say anything one can say in SQL, and then justify SPARQL by saying that it is better for heterogenous data, schema last, and so on. On top of this, transitivity and rules will not hurt. For now, the current SPARQL working group will at least reach basic SQL parity; the edge will still remain implementation dependent.
Another remark was that joining is slow. Depends. Anything involving more complex disk access than linear reading of a blob is generally not good for interactive use. But with adequate memory, and with all hot spots in memory, we do some 3.2 million random-accesses-per-second on 12 cores, with easily 80% platform utilization for a single large query. The high utilization means that times drop as processing gets divided over more partitions.
There was a talk about MashQL by Mustafa Jarrar, concerning an abstraction on top of SPARQL for easy composition of tree-structured queries. The idea was that such queries can be evaluated "on the fly" as they are being composed. As it happens, we already have an XML-based query abstraction layer incorporated into Virtuoso 6.0's built-in Faceted Data Browser Service, and the effects are probably quite similar. The most important point here is that by using XML, both of these approaches are interoperable against a Virtuoso back-end. Along similar lines, we did not get to talk to the G Facets people but our message to them is the same: Use the faceted browser service to get vastly higher performance when querying against Linked Data, be it DBpedia or the entity LOD Cloud. Virtuoso 6.0 (Open Source Edition) "TP1" is now publicly available as a Technology Preview (beta).
We heard that there is an effort for porting Freebase's Parallax to SPARQL. The same thing applies to this. With a number of different data viewers on top of SPARQL, we come closer to broad-audience linked-data applications. These viewers are still too generic for the end user, though. We fully believe that for both search and transactions, application-domain-specific workflows will stay relevant. But these can be made to a fair degree by specializing generic linked-data-bound controls and gluing them together with some scripting.
As said before, the application will interface the user to the vocabulary. The vocabulary development takes the modeling burden from the application and makes for interchangeable experience on the same data. The data in turn is "virtualized" into the database cloud or the local secure server, as the use case may require.
For ease of adoption, open competition, and safety from lock-in, the community needs a SPARQL whose usability is not totally dependent on vendor extensions. But we might de facto have that in just a bit, whenever there is a working draft from the SPARQL WG.
Another topic that we encounter often is the question of integration (or lack thereof) between communities. For example, database conferences reject semantic web papers and vice versa. Such politics would seem to emerge naturally but are nonetheless detrimental. We really should partner with people who write papers as their principal occupation. We ourselves do software products and use very little time for papers, so some of the bad reviews we have received do make a legitimate point. By rights, we should go for database venues but we cannot have this take too much time. So we are open to partnering for splitting the opportunity cost of multiple submissions.
For future work, there is nothing radically new. We continue testing and productization of cluster databases. Just deliver what is in the pipeline. The essential nature of this is adding more and more cases of better and better parallelization in different query situations. The present usage patterns work well for finding bugs and performance bottlenecks. For presentation, our goal is to have third party viewers operate with our platform. We cannot completely leave data browsing and UI to third parties since we must from time to time introduce various unique functionality. Most interaction should however go via third party applications.
|
04/27/2009 17:28 GMT-0500
|
Modified:
04/28/2009 11:27 GMT-0500
|
Beyond Applications - Introducing the Planetary Datasphere (Part 1)
This is the first in a short series of blog posts about what becomes possible when essentially unlimited linked data can be deployed on the open web and private intranets.
The term DataSphere comes from Dan Simmons' Hyperion science fiction series, where it is a sort of pervasive computing capability that plays host to all sorts of processes, including what people do on the net today, and then some. I use this term here in order to emphasize the blurring of silo and application boundaries. The network is not only the computer but also the database. I will look at what effects the birth of a sort of linked data stratum can have on end-user experience, application development, application deployment and hosting, business models and advertising, and security; how cloud computing fits in; and how back-end software such as databases must evolve to support all of these.
This is a mid-term vision. The components are coming into production as we speak, but the end result is not here quite yet.
I use the word DataSphere to refer to a worldwide database fabric, a global Distributed DBMS collective, within which there are many Data Spaces, or Named Data Spaces. A Data Space is essentially a person's or organization's contribution to the DataSphere. I use Linked Data Web to refer to component technologies and practices such as RDF, SPARQL, Linked Data practices, etc. The DataSphere does not have to be built on this technology stack per se, but this stack is still the best bet for it.
General
There exist applications for performing specialized functions such as social networking, shopping, document search, and C2C commerce at planetary scale. All these applications run on their own databases, each with a task specific schema. They communicate by web pages and by predefined messages for diverse application-specific transactions and reports.
These silos are scalable because in general their data has some natural partitioning, and because the set of transactions is predetermined and the data structure is set up for this.
The Linked Data Web proposes to create a data infrastructure that can hold anything, just like a network can transport anything. This is not a network with a memory of messages, but a whole that can answer arbitrary questions about what has been said. The prerequisite is that the questions are phrased in a vocabulary that is compatible with the vocabulary in which the statements themselves were made.
In this setting, the vocabulary takes the place of the application. Of course, there continues to be a procedural element to applications; this has the function of translating statements between the domain vocabulary and a user interface. Examples are data import from existing applications, running predefined reports, composing new reports, and translating between natural language and the domain vocabulary.
The big difference is that the database moves outside of the silo, at least in logical terms. The database will be like the network — horizontal and ubiquitous. The equivalent of TCP/IP will be the RDF/SPARQL combination. The equivalent of routing protocols between ISPs will be gateways between the specific DBMS engines supporting the services.
The place of the DBMS in the stack changes
The RDBMS in itself is eternal, or at least as eternal as a culture with heavy reliance on written records is. Any such culture will invent the RDBMS and use it where it best fits. We are not replacing this; we are building an abstracted worldwide data layer. This is to the RDBMS supporting line-of-business applications what the www was to enterprise content management systems.
For transactions, the Web 2.0-style application-specific messages are fine. Also, any transactional system that must be audited must physically reside somewhere, have physical security, etc. It can't just be somewhere in the DataSphere, managed by some system with which one has no contract, just like Google's web page cache can't be relied on as a permanent repository of web content.
Providing space on the Linked Data Web is like providing hosting on the Document Web. This may have varying service levels, pricing models, etc. The value of a queriable DataSphere is that a new application does not have to begin by building its own schema, database infrastructure, service hosting, etc. The application becomes more like a language meme, a cultural form of interaction mediated by a relatively lightweight user-facing component, laterally open for unforeseen interaction with other applications from other domains of discourse.
End User Benefits
For the end user, the web will still look like a place where one can shop, discuss, date, whatever. These activities will be mediated by user interfaces as they are now. Right now, the end user's web presence is his/her blog or web site, and their contributions to diverse wikis, social web sites, and so forth. These are scattered. The user's Data Space is the collection of all these things, now presented in a queriable form. The user's Data Space is the user's statement of presence, referencing the diverse contributions of the user on diverse sites.
The personal Data Space being a queriable, structured whole facilitates finding and being found, which is what brings individuals to the web in the first place. The best applications and sites are those which make this the easiest. The Linked Data Web allows saying what one wishes in a structured, queriable manner, across all application domains, independently of domain specific silos. The end user's interaction with the personal data space is through applications, like now. But these applications are just wrappers on top of self describing data, represented in domain specific vocabularies; one vocabulary is used for social networking, another for C2C commerce, and so on. The user is the master of their personal Data Space, free to take it where he or she wishes.
Further benefits will include more ready referencing between these spaces, more uniform identity management, cross-application operations, and the emergence of "meta-applications," i.e., unified interfaces for managing many related applications/tasks.
Of course, there is the increase in semantic richness, such as better contextuality derived from entity extraction from text. But this is also possible in a silo. The Linked Data Web angle is the sharing of identifiers for real world entities, which makes extracts of different sources by different parties potentially joinable. The user interaction will hardly ever be with the raw data. But the raw data being still at hand makes for better targeting of advertisements, better offering of related services, easier discovery of related content, and less noise overall.
Kingsley Idehen has coined the term SDQ, for Serendipitous Discovery Quotient, to denote this. When applications expose explicit semantics, constructing a user experience that combines relevant data from many sources, including applications as well as highly targeted advertising, becomes natural. It is no longer a matter of "mashing up" web service interfaces with procedural code, but of "meshing" data through declarative queries across application spaces.
Applications in the DataSphere
The workflows supported by the DataSphere are essentially those taking place on the web now. The DataSphere dimension is expressed by bookmarklets, browser plugins, and the like, with ready access to related data and actions that are relevant for this data. Actions triggered by data can be anything from posting a comment to making an e-commerce purchase. Web 2.0 models fit right in.
Web application development now consists of designing an application-specific database schema and writing web pages to interact with this schema. In the DataSphere, the database is abstracted away, as is a large part of the schema. The application floats on a sea of data instead of being tied to its own specific store and schema. Some local transaction data should still be handled in the old way, though.
For the application developer, the question becomes one of vocabulary choice. How will the application synthesize URIs from the user interaction? Which URIs will be used, since pretty much anything will in practice have many names (e.g., DBpedia Vs. Freebase identifiers). The end user will generally have no idea of this choice, nor of the various degrees of normalization, etc., in the vocabularies. Still, usage of such applications will produce data using some identifiers and vocabularies. Benefits of ready joining without translation will drive adoption. A vocabulary with instance data will get more instance data.
The Linked Data Web infrastructure itself must support vocabulary and identifier choice by answering questions about who uses a particular identifier and where. Even now, we offer entity ranks and resolution of synonyms, queries on what graphs mention a certain identifier and so on. This is a means of finding the most commonly used term for each situation. Convergence of terminology cuts down on translation and makes for easier and more efficient querying.
Advertising
The application developer is, for purposes of advertising, in the position of the inventory owner, just like a traditional publisher, whether web or other. But with smarter data, it is not a matter of static keywords but of the semantically explicit data behind each individual user impression driving the ads. Data itself carries no ads but the user impression will still go through a display layer that can show ads. If the application relies on reuse of licensed content, such as media, then the content provider may get a cut of the ad revenue even if it is not the direct owner of the inventory. The specifics of implementing and enforcing this are to be worked out.
Content Providers, License, and Attribution
For the content provider, the URI is the brand carrier. If the data is well linked and queriable, this will drive usage and traffic to the services of the content provider. This is true of any provider, whether a media publisher, e-commerce business, government agency, or anything else.
Intellectual property considerations will make the URI a first class citizen. Just like the URI is a part of the document web experience, it is a part of the Linked Data Web experience. Just like Creative Commons licenses allow the licensor to define what type of attribution is required, a data publisher can mandate that a user experience mediated by whatever application should expose the source as a dereferenceable URI.
One element of data dereferencing must be linking to applications that facilitate human interaction with the data. A generic data browser is a developer tool; the end user experience must still be mediated by interfaces tailored to the domain. This layer can take care of making the brand visible and can show advertising or be monetized on a usage basis.
Next we will look at the service provider and infrastructure side of this.
Related
|
03/24/2009 09:38 GMT-0500
|
Modified:
03/24/2009 10:50 GMT-0500
|
Linked Data & The Year 2009 (updated)
As is fitting for the season, I will editorialize a bit about what has gone before and what is to come.
Sir Tim said it at WWW08 in Beijing — linked data and the linked data web is the semantic web and the Web done right.
The grail of ad hoc analytics on infinite data has lost none of its appeal. We have seen fresh evidence of this in the realm of data warehousing products, as well as storage in general.
The benefits of a data model more abstract than the relational are being increasingly appreciated also outside the data web circles. Microsoft's Entity Frameworks technology is an example. Agility has been a buzzword for a long time. Everything should be offered in a service based business model and should interoperate and integrate with everything else — business needs first; schema last.
Not to forget that when money is tight, reuse of existing assets and paying on a usage basis are naturally emphasized. Information, as the asset it is, is none the less important, on the contrary. But even with information, value should be realized economically, which, among other things, entails not reinventing the wheel.
It is against this backdrop that this year will play out.
As concerns research, I will again quote Harry Halpin at ESWC 2008: "Men will fight in a war, and even lose a war, for what they believe just. And it may come to pass that later, even though the war were lost, the things then fought for will emerge under another name and establish themselves as the prevailing reality" [or words to this effect].
Something like the data web, and even the semantic web, will happen. Harry's question was whether this would be the descendant of what is today called semantic web research.
I heard in conversation about a project for making a very large metadata store. I also heard that the makers did not particularly insist on this being RDF-based, though.
Why should such a thing be RDF-based? If it is already accepted that there will be ad hoc schema and that queries ought to be able to view the data from all angles, not be limited by having indices one way and not another way, then why not RDF?
The justification of RDF is in reusing and linking-to data and terminology out there. Another justification is that by using an RDF store, one is spared a lot of work and tons of compromises which attend making an entity-attribute-value (EAV, i.e., triple) store on a generic RDBMS. The sem-web world has been there, trust me. We came out well because we put all inside the RDBMS, lowest level, which you can't do unless you own the RDBMS. Source access is not enough; you also need the knowledge.
Technicalities aside, the question is one of proprietary vs. standards-based. This is not only so with software components, where standards have consistently demonstrated benefits, but now also with the data. Zemanta and OpenCalais serving DBpedia URIs are examples. Even in entirely closed applications, there is benefit in reusing open vocabularies and identifiers: One does not need to create a secret language for writing a secret memo.
Where data is a carrier of value, its value is enhanced by it being easy to repurpose (i.e., standard vocabularies) and to discover (i.e., data set metadata). As on the web, so on the enterprise intranet. In this lies the strength of RDF as opposed to proprietary flexible database schemes. This is a qualitative distinction.
In hoc signo vinces.
In this light, we welcome the voiD (VOcabulary of Interlinked Data), which is the first promise of making federatable data discoverable. Now that there is a point of focus for these efforts, the needed expressivity will no doubt accrete around the voiD core.
For data as a service, we clearly see the value of open terminologies as prerequisites for service interchangeability, i.e., creating a marketplace. XML is for the transaction; RDF is for the discovery, query, and analytics. As with databases in general, first there was the transaction; then there was the query. Same here. For monetizing the query, there are models ranging from renting data sets and server capacity in the clouds to hosted services where one pays for processing past a certain quota. For the hosted case, we just removed a major barrier to offering unlimited query against unlimited data when we completed the Virtuoso Anytime feature. With this, the user gets what is found within a set time, which is already something, and in case of needing more, one can pay for the usage. Of course, we do not forget advertising. When data has explicit semantics, contextuality is better than with keywords.
For these visions to materialize on top of the linked data platform, linked data must join the world of data. This means messaging that is geared towards the database public. They know the problem, but the RDF proposition is still not well enough understood for it to connect.
For the relational IT world, we offer passage to the data web and its promise of integration through RDF mapping. We are also bringing out new Microsoft Entity Framework components. This goes in the direction of defining a unified database frontier with RDF and non-RDF entity models side by side.
For OpenLink Software, 2008 was about developing technology for scale, RDF as well as generic relational. We did show a tiny preview with the Billion Triples Challenge demo. Now we are set to come out with the real thing, featuring, among other things, faceted search at the billion triple scale. We started offering ready-to-go Virtuoso-hosted linked open data sets on Amazon EC2 in December. Now we continue doing this based on our next-generation server, as well as make Virtuoso 6 Cluster commercially available. Technical specifics are amply discussed on this blog. There are still some new technology things to be developed this year; first among these are strong SPARQL federation, and on-the-fly resizing of server clusters. On the research partnerships side, we have an EU grant for working with the OntoWiki project from the University of Leipzig, and we are partners in DERI's Líon project. These will provide platforms for further demonstrating the "web" in data web, as in web-scale smart databasing.
2009 will see change through scale. The things that exist will start interconnecting and there will be emergent value. Deployments will be larger and scale will be readily available through a services model or by installation at one's own facilities. We may see the start of Search becoming Find, like Kingsley says, meaning semantics of data guiding search. Entity extraction will multiply data volumes and bring parts of the data web to real time.
Exciting 2009 to all.
|
01/02/2009 16:17 GMT-0500
|
Modified:
01/02/2009 13:26 GMT-0500
|
Virtuoso RDF: A Getting Started Guide for the Developer
It is a long standing promise of mine to dispel the false impression that using Virtuoso to work with RDF is complicated.
The purpose of this presentation is to show a programmer how to put RDF into Virtuoso and how to query it. This is done programmatically, with no confusing user interfaces.
You should have a Virtuoso Open Source tree built and installed. We will look at the LUBM benchmark demo that comes with the package. All you need is a Unix shell. Running the shell under emacs (m-x shell) is the best. But the open source isql utility should have command line editing also. The emacs shell is however convenient for cutting and pasting things between shell and files.
To get started, cd into binsrc/tests/lubm.
To verify that this works, you can do
./test_server.sh virtuoso-t
This will test the server with the LUBM queries. This should report 45 tests passed. After this we will do the tests step-by-step.
Loading the Data
The file lubm-load.sql contains the commands for loading the LUBM single university qualification database.
The data files themselves are in lubm_8000, 15 files in RDFXML.
There is also a little ontology called inf.nt. This declares the subclass and subproperty relations used in the benchmark.
So now let's go through this procedure.
Start the server:
$ virtuoso-t -f &
This starts the server in foreground mode, and puts it in the background of the shell.
Now we connect to it with the isql utility.
$ isql 1111 dba dba
This gives a SQL> prompt. The default username and password are both dba.
When a command is SQL, it is entered directly. If it is SPARQL, it is prefixed with the keyword sparql. This is how all the SQL clients work. Any SQL client, such as any ODBC or JDBC application, can use SPARQL if the SQL string starts with this keyword.
The lubm-load.sql file is quite self-explanatory. It begins with defining an SQL procedure that calls the RDF/XML load function, DB..RDF_LOAD_RDFXML, for each file in a directory.
Next it calls this function for the lubm_8000 directory under the server's working directory.
sparql
CLEAR GRAPH <lubm>;
sparql
CLEAR GRAPH <inf>;
load_lubm ( server_root() || '/lubm_8000/' );
Then it verifies that the right number of triples is found in the <lubm> graph.
sparql
SELECT COUNT(*)
FROM <lubm>
WHERE { ?x ?y ?z } ;
The echo commands below this are interpreted by the isql utility, and produce output to show whether the test was passed. They can be ignored for now.
Then it adds some implied subOrganizationOf triples. This is part of setting up the LUBM test database.
sparql
PREFIX ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#>
INSERT
INTO GRAPH <lubm>
{ ?x ub:subOrganizationOf ?z }
FROM <lubm>
WHERE { ?x ub:subOrganizationOf ?y .
?y ub:subOrganizationOf ?z .
};
Then it loads the ontology file, inf.nt, using the Turtle load function, DB.DBA.TTLP. The arguments of the function are the text to load, the default namespace prefix, and the URI of the target graph.
DB.DBA.TTLP ( file_to_string ( 'inf.nt' ),
'http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl',
'inf'
) ;
sparql
SELECT COUNT(*)
FROM <inf>
WHERE { ?x ?y ?z } ;
Then we declare that the triples in the <inf> graph can be used for inference at run time. To enable this, a SPARQL query will declare that it uses the 'inft' rule set. Otherwise this has no effect.
rdfs_rule_set ('inft', 'inf');
This is just a log checkpoint to finalize the work and truncate the transaction log. The server would also eventually do this in its own time.
checkpoint;
Now we are ready for querying.
Querying the Data
The queries are given in 3 different versions: The first file, lubm.sql, has the queries with most inference open coded as UNIONs. The second file, lubm-inf.sql, has the inference performed at run time using the ontology information in the <inf> graph we just loaded. The last, lubm-phys.sql, relies on having the entailed triples physically present in the <lubm> graph. These entailed triples are inserted by the SPARUL commands in the lubm-cp.sql file.
If you wish to run all the commands in a SQL file, you can type load <filename>; (e.g., load lubm-cp.sql;) at the SQL> prompt. If you wish to try individual statements, you can paste them to the command line.
For example:
SQL> sparql
PREFIX ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#>
SELECT *
FROM <lubm>
WHERE { ?x a ub:Publication .
?x ub:publicationAuthor <http://www.Department0.University0.edu/AssistantProfessor0>
};
VARCHAR
_______________________________________________________________________
http://www.Department0.University0.edu/AssistantProfessor0/Publication0
http://www.Department0.University0.edu/AssistantProfessor0/Publication1
http://www.Department0.University0.edu/AssistantProfessor0/Publication2
http://www.Department0.University0.edu/AssistantProfessor0/Publication3
http://www.Department0.University0.edu/AssistantProfessor0/Publication4
http://www.Department0.University0.edu/AssistantProfessor0/Publication5
6 Rows. -- 4 msec.
To stop the server, simply type shutdown; at the SQL> prompt.
If you wish to use a SPARQL protocol end point, just enable the HTTP listener. This is done by adding a stanza like —
[HTTPServer]
ServerPort = 8421
ServerRoot = .
ServerThreads = 2
— to the end of the virtuoso.ini file in the lubm directory. Then shutdown and restart (type shutdown; at the SQL> prompt and then virtuoso-t -f & at the shell prompt).
Now you can connect to the end point with a web browser. The URL is http://localhost:8421/sparql. Without parameters, this will show a human readable form. With parameters, this will execute SPARQL.
We have shown how to load and query RDF with Virtuoso using the most basic SQL tools. Next you can access RDF from, for example, PHP, using the PHP ODBC interface.
To see how to use Jena or Sesame with Virtuoso, look at Native RDF Storage Providers. To see how RDF data types are supported, see Extension datatype for RDF
To work with large volumes of data, you must add memory to the configuration file and use the row-autocommit mode, i.e., do log_enable (2); before the load command. Otherwise Virtuoso will do the entire load as a single transaction, and will run out of rollback space. See documentation for more.
|
12/17/2008 12:31 GMT-0500
|
Modified:
12/17/2008 12:41 GMT-0500
|
See the Lite: Embeddable/Background Virtuoso starts at 25MB
We have received many requests for an embeddable-scale Virtuoso. In response to this, we have added a Lite mode, where the initial size of a server process is a tiny fraction of what the initial size would be with default settings. With 2MB of disk cache buffers (ini file setting, NumberOfBuffers = 256), the process size stays under 30MB on 32-bit Linux.
The value of this is that one can now have RDF and full text indexing on the desktop without running a Java VM or any other memory-intensive software. And of course, all of SQL (transactions, stored procedures, etc.) is in the same embeddably-sized container.
The Lite executable is a full Virtuoso executable; the Lite mode is controlled by a switch in the configuration file. The executable size is about 10MB for 32-bit Linux. A database created in the Lite mode will be converted into a fully-featured database (tables and indexes are added, among other things) if the server is started with the Lite setting "off"; functionality can be reverted to Lite mode, though it will now consume somewhat more memory, etc.
Lite mode offers full SQL and SPARQL/SPARUL (via SPASQL), but disables all HTTP-based services (WebDAV, application hosting, etc.). Clients can still use all typical database access mechanisms (i.e., ODBC, JDBC, OLE-DB, ADO.NET, and XMLA) to connect, including the Jena and Sesame frameworks for RDF. ODBC now offers full support of RDF data types for C-based clients. A Redland-compatible API also exists, for use with Redland v1.0.8 and later.
Especially for embedded use, we now allow restricting the listener to be a Unix socket, which allows client connections only from the localhost.
Shipping an embedded Virtuoso is easy. It just takes one executable and one configuration file. Performance is generally comparable to "normal" mode, except that Lite will be somewhat less scalable on multicore systems.
The Lite mode will be included in the next Virtuoso 5 Open Source release.
|
12/17/2008 09:34 GMT-0500
|
Modified:
12/17/2008 12:03 GMT-0500
|
"E Pluribus Unum", or "Inversely Functional Identity", or "Smooshing Without the Stickiness" (re-updated)
What a terrible word, smooshing... I have understood it to mean that when you have two names for one thing, you give each all the attributes of the other. This smooshes them together, makes them interchangeable.
This is complex, so I will begin with the point and the interested may read on for the details and implications. Starting with soon to be released version 6, Virtuoso allows you to say that two things, if they share a uniquely identifying property, are the same. Examples of uniquely identifying properties would be a book's ISBN number, or a person's social security plus full name. In relational language this is a unique key, and in RDF parlance, an inverse functional property.
In most systems, such problems are dealt with as a preprocessing step before querying. For example, all the items that are considered the same will get the same properties or at load time all identifiers will be normalized according to some application rules. This is good if the rules are clear and understood. This is so in closed situations, where things tend to have standard identifiers to begin with. But on the open web this is not so clear cut.
In this post, we show how to do these things ad hoc, without materializing anything. At the end, we also show how to materialize identity and what the consequences of this are with open web data. We use real live web crawls from the Billion Triples Challenge data set.
On the linked data web, there are independently arising descriptions of the same thing and thus arises the need to smoosh, if these are to be somehow integrated. But this is only the beginning of the problems.
To address these, we have added the option of specifying that some property will be considered inversely functional in a query. This is done at run time and the property does not really have to be inversely functional in the pure sense. foaf:name will do for an example. This simply means that for purposes of the query concerned, two subjects which have at least one foaf:name in common are considered the same. In this way, we can join between FOAF files. With the same database, a query about music preferences might consider having the same name as "same enough," but a query about criminal prosecution would obviously need to be more precise about sameness.
Our ontology is defined like this:
-- Populate a named graph with the triples you want to use in query time inferencing
ttlp ( '
@prefix foaf: <xmlns="http" xmlns.com="xmlns.com" foaf="foaf">
</>
@prefix owl: <xmlns="http" www.w3.org="www.w3.org" owl="owl">
</>
foaf:mbox_sha1sum a owl:InverseFunctionalProperty .
foaf:name a owl:InverseFunctionalProperty .
',
'xx',
'b3sifp'
);
-- Declare that the graph contains an ontology for use in query time inferencing
rdfs_rule_set ( 'http://example.com/rules/b3sifp#',
'b3sifp'
);
Then use it:
sparql
DEFINE input:inference "http://example.com/rules/b3sifp#"
SELECT DISTINCT ?k ?f1 ?f2
WHERE { ?k foaf:name ?n .
?n bif:contains "'Kjetil Kjernsmo'" .
?k foaf:knows ?f1 .
?f1 foaf:knows ?f2
};
VARCHAR VARCHAR VARCHAR
______________________________________ _______________________________________________ ______________________________
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/dajobe
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/net_twitter
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/amyvdh
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/pom
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/mattb
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/davorg
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/distobj
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/perigrin
....
Without the inference, we get no matches. This is because the data in question has one graph per FOAF file, and blank nodes for persons. No graph references any person outside the ones in the graph. So if somebody is mentioned as known, then without the inference there is no way to get to what that person's FOAF file says, since the same individual will be a different blank node there. The declaration in the context named b3sifp just means that all things with a matching foaf:name or foaf:mbox_sha1sum are the same.
Sameness means that two are the same for purposes of DISTINCT or GROUP BY, and if two are the same, then both have the UNION of all of the properties of both.
If this were a naive smoosh, then the individuals would have all the same properties but would not be the same for DISTINCT.
If we have complex application rules for determining whether individuals are the same, then one can materialize owl:sameAs triples and keep them in a separate graph. In this way, the original data is not contaminated and the materialized volume stays reasonable — nothing like the blow-up of duplicating properties across instances.
The pro-smoosh argument is that if every duplicate makes exactly the same statements, then there is no great blow-up. Best and worst cases will always depend on the data. In rough terms, the more ad hoc the use, the less desirable the materialization. If the usage pattern is really set, then a relational-style application-specific representation with identity resolved at load time will perform best. We can do that too, but so can others.
The principal point is about agility as concerns the inference. Run time is more agile than materialization, and if the rules change or if different users have different needs, then materialization runs into trouble. When talking web scale, having multiple users is a given; it is very uneconomical to give everybody their own copy, and the likelihood of a user accessing any significant part of the corpus is minimal. Even if the queries were not limited, the user would typically not wait for the answer of a query doing a scan or aggregation over 1 billion blog posts or something of the sort. So queries will typically be selective. Selective means that they do not access all of the data, hence do not benefit from ready-made materialization for things they do not even look at.
The exception is corpus-wide statistics queries. But these will not be done in interactive time anyway, and will not be done very often. Plus, since these do not typically run all in memory, these are disk bound. And when things are disk bound, size matters. Reading extra entailment on the way is just a performance penalty.
Enough talk. Time for an experiment. We take the Yahoo and Falcon web crawls from the Billion Triples Challenge set, and do two things with the FOAF data in them:
- Resolve identity at insert time. We remove duplicate person URIs, and give the single URI all the properties of all the duplicate URIs. We expect these to be most often repeats. If a person references another person, we normalize this reference to go to the single URI of the referenced person.
- Give every duplicate URI of a person all the properties of all the duplicates. If these are the same value, the data should not get much bigger, or so we think.
For the experiment, we will consider two people the same if they have the same foaf:name and are both instances of foaf:Person. This gets some extra hits but should not be statistically significant.
The following is a commented SQL script performing the smoosh. We play with internal IDs of things, thus some of these operations cannot be done in SPARQL alone. We use SPARQL where possible for readability. As the documentation states, iri_to_id converts from the qualified name of an IRI to its ID and id_to_iri does the reverse.
We count the triples that enter into the smoosh:
-- the name is an existence because else we'd get several times more due to
-- the names occurring in many graphs
sparql
SELECT COUNT(*)
WHERE { { SELECT DISTINCT ?person
WHERE { ?person a foaf:Person }
} .
FILTER ( bif:exists ( SELECT (1)
WHERE { ?person foaf:name ?nn }
)
) .
?person ?p ?o
};
-- We get 3284674
We make a few tables for intermediate results.
-- For each distinct name, gather the properties and objects from
-- all subjects with this name
CREATE TABLE name_prop
( np_name ANY,
np_p IRI_ID_8,
np_o ANY,
PRIMARY KEY ( np_name,
np_p,
np_o
)
);
ALTER INDEX name_prop
ON name_prop
PARTITION ( np_name VARCHAR (-1, 0hexffff) );
-- Map from name to canonical IRI used for the name
CREATE TABLE name_iri ( ni_name ANY PRIMARY KEY,
ni_s IRI_ID_8
);
ALTER INDEX name_iri
ON name_iri
PARTITION ( ni_name VARCHAR (-1, 0hexffff) );
-- Map from person IRI to canonical person IRI
CREATE TABLE pref_iri
( i IRI_ID_8,
pref IRI_ID_8,
PRIMARY KEY ( i )
);
ALTER INDEX pref_iri
ON pref_iri
PARTITION ( i INT (0hexffff00) );
-- a table for the materialization where all aliases get all properties of every other
CREATE TABLE smoosh_ct
( s IRI_ID_8,
p IRI_ID_8,
o ANY,
PRIMARY KEY ( s,
p,
o
)
);
ALTER INDEX smoosh_ct
ON smoosh_ct
PARTITION ( s INT (0hexffff00) );
-- disable transaction log and enable row auto-commit. This is necessary, otherwise
-- bulk operations are done transactionally and they will run out of rollback space.
LOG_ENABLE (2);
-- Gather all the properties of all persons with a name under that name.
-- INSERT SOFT means that duplicates are ignored
INSERT SOFT name_prop
SELECT "n", "p", "o"
FROM ( sparql
DEFINE output:valmode "LONG"
SELECT ?n ?p ?o
WHERE { ?x a foaf:Person .
?x foaf:name ?n .
?x ?p ?o
}
) xx ;
-- Now choose for each name the canonical IRI
INSERT INTO name_iri
SELECT np_name,
( SELECT MIN (s)
FROM rdf_quad
WHERE o = np_name
AND p = IRI_TO_ID ('http://xmlns.com/foaf/0.1/name')
) AS mini
FROM name_prop
WHERE np_p = IRI_TO_ID ('http://xmlns.com/foaf/0.1/name') ;
-- For each person IRI, map to the canonical IRI of that person
INSERT SOFT pref_iri (i, pref)
SELECT s,
ni_s
FROM name_iri,
rdf_quad
WHERE o = ni_name
AND p = IRI_TO_ID ('http://xmlns.com/foaf/0.1/name') ;
-- Make a graph where all persons have one iri with all the properties of all aliases
-- and where person-to-person refs are canonicalized
INSERT SOFT rdf_quad (g,s,p,o)
SELECT IRI_TO_ID ('psmoosh'),
ni_s,
np_p,
COALESCE ( ( SELECT pref
FROM pref_iri
WHERE i = np_o
),
np_o
)
FROM name_prop,
name_iri
WHERE ni_name = np_name
OPTION ( loop, quietcast ) ;
-- A little explanation: The properties of names are copied into rdf_quad with the name
-- replaced with its canonical IRI. If the object has a canonical IRI, this is used as
-- the object, else the object is unmodified. This is the COALESCE with the sub-query.
-- This takes a little time. To check on the progress, take another connection to the
-- server and do
STATUS ('cluster');
-- It will return something like
-- Cluster 4 nodes, 35 s. 108 m/s 1001 KB/s 75% cpu 186% read 12% clw threads 5r 0w 0i
-- buffers 549481 253929 d 8 w 0 pfs
-- Now finalize the state; this makes it permanent. Else the work will be lost on server
-- failure, since there was no transaction log
CL_EXEC ('checkpoint');
-- See what we got
sparql
SELECT COUNT (*)
FROM <psmoosh>
WHERE {?s ?p ?o};
-- This is 2253102
-- Now make the copy where all have the properties of all synonyms. This takes so much
-- space we do not insert it as RDF quads, but make a special table for it so that we can
-- run some statistics. This saves time.
INSERT SOFT smoosh_ct (s, p, o)
SELECT s, np_p, np_o
FROM name_prop,
rdf_quad
WHERE o = np_name
AND p = IRI_TO_ID ('http://xmlns.com/foaf/0.1/name') ;
-- as above, INSERT SOFT so as to ignore duplicates
SELECT COUNT (*)
FROM smoosh_ct;
-- This is 167360324
-- Find out where the bloat comes from
SELECT TOP 20 COUNT (*),
ID_TO_IRI (p)
FROM smoosh_ct
GROUP BY p
ORDER BY 1 DESC;
The results are:
54728777 http://www.w3.org/2002/07/owl#sameAs
48543153 http://xmlns.com/foaf/0.1/knows
13930234 http://www.w3.org/2000/01/rdf-schema#seeAlso
12268512 http://xmlns.com/foaf/0.1/interest
11415867 http://xmlns.com/foaf/0.1/nick
6683963 http://xmlns.com/foaf/0.1/weblog
6650093 http://xmlns.com/foaf/0.1/depiction
4231946 http://xmlns.com/foaf/0.1/mbox_sha1sum
4129629 http://xmlns.com/foaf/0.1/homepage
1776555 http://xmlns.com/foaf/0.1/holdsAccount
1219525 http://xmlns.com/foaf/0.1/based_near
305522 http://www.w3.org/1999/02/22-rdf-syntax-ns#type
274965 http://xmlns.com/foaf/0.1/name
155131 http://xmlns.com/foaf/0.1/dateOfBirth
153001 http://xmlns.com/foaf/0.1/img
111130 http://www.w3.org/2001/vcard-rdf/3.0#ADR
52930 http://xmlns.com/foaf/0.1/gender
48517 http://www.w3.org/2004/02/skos/core#subject
45697 http://www.w3.org/2000/01/rdf-schema#label
44860 http://purl.org/vocab/bio/0.1/olb
Now compare with the predicate distribution of the smoosh with identities canonicalized
sparql
SELECT COUNT (*) ?p
FROM <psmoosh>
WHERE { ?s ?p ?o }
GROUP BY ?p
ORDER BY 1 DESC
LIMIT 20;
Results are:
748311 http://xmlns.com/foaf/0.1/knows
548391 http://xmlns.com/foaf/0.1/interest
140531 http://www.w3.org/2000/01/rdf-schema#seeAlso
105273 http://www.w3.org/1999/02/22-rdf-syntax-ns#type
78497 http://xmlns.com/foaf/0.1/name
48099 http://www.w3.org/2004/02/skos/core#subject
45179 http://xmlns.com/foaf/0.1/depiction
40229 http://www.w3.org/2000/01/rdf-schema#comment
38272 http://www.w3.org/2000/01/rdf-schema#label
37378 http://xmlns.com/foaf/0.1/nick
37186 http://dbpedia.org/property/abstract
34003 http://xmlns.com/foaf/0.1/img
26182 http://xmlns.com/foaf/0.1/homepage
23795 http://www.w3.org/2002/07/owl#sameAs
17651 http://xmlns.com/foaf/0.1/mbox_sha1sum
17430 http://xmlns.com/foaf/0.1/dateOfBirth
15586 http://xmlns.com/foaf/0.1/page
12869 http://dbpedia.org/property/reference
12497 http://xmlns.com/foaf/0.1/weblog
12329 http://blogs.yandex.ru/schema/foaf/school
We can drop the owl:sameAs triples from the count, so the bloat is a bit less by that but it still is tens of times larger than the canonicalized copy or the initial state.
Now, when we try using the psmoosh graph, we still get different results from the results with the original data. This is because foaf:knows relations to things with no foaf:name are not represented in the smoosh. The exist:
sparql
SELECT COUNT (*)
WHERE { ?s foaf:knows ?thing .
FILTER ( !bif:exists ( SELECT (1)
WHERE { ?thing foaf:name ?nn }
)
)
};
-- 1393940
So the smoosh graph is not an accurate rendition of the social network. It would have to be smooshed further to be that, since the data in the sample is quite irregular. But we do not go that far here.
Finally, we calculate the smoosh blow up factors. We do not include owl:sameAs triples in the counts.
select (167360324 - 54728777) / 3284674.0;
34.290022997716059
select 2229307 / 3284674.0;
= 0.678699621332284
So, to get a smoosh that is not really the equivalent of the original, either multiply the original triple count by 34 or 0.68, depending on whether synonyms are collapsed or not.
Making the smooshes does not take very long, some minutes for the small one. Inserting the big one would be longer, a couple of hours maybe. It was 33 minutes for filling the smoosh_ct table. The metrics were not with optimal tuning so the performance numbers just serve to show that smooshing takes time. Probably more time than allowable in an interactive situation, no matter how the process is optimized.
|
12/16/2008 14:14 GMT-0500
|
Modified:
12/16/2008 15:01 GMT-0500
|
|
|