Details
Subscribe
Post Categories
Recent Articles
Display Settings
|
Showing posts in all categories Refresh
A quick look at the SP2B SPARQL Performance Benchmark
I finally got around to running the SP2B SPARQL Performance Benchmark on the current Virtuoso Open Source Edition, v5.0.8.
I ran it with the 5M triples scale, which is the highest scale for which the authors give numbers.
I got a run time of 25 minutes for the 12 queries, giving an arithmetic mean of the query time of 125 seconds. This is better than the 800 or so seconds that the authors had measured. Also, Q6 of the set had failed for the authors, but we have since fixed this; the fix is in the v5.0.8 cut.
I also tried it with a scale of 25M, but this became I/O bound and took a bit longer. I will try this with v6 and v7 cluster later, which are vastly better at anything I/O bound.
The machine was a 2GHz Xeon with 8G RAM. The query text was the one from the authors, with an explicit FROM clause added; the client was the command line Interactive SQL (iSQL).
If one does the test with the default index layout without specifying a graph, things will not work very well. Also, returning the million-row results of these queries over the SPARQL protocol is not practical.
I will say something more about SP2B when I get to have a closer look.
|
08/27/2008 16:00 GMT
|
Modified:
08/28/2008 16:54 GMT
|
Configuring Virtuoso for Benchmarking
I will here summarize what should be known about running benchmarks with Virtuoso.
Physical Memory
For 8G RAM, in the [Parameters] stanza of virtuoso.ini, set —
[Parameters]
...
NumberOfBuffers = 550000
For 16G RAM, double this—
[Parameters]
...
NumberOfBuffers = 1100000
Transaction Isolation
For most cases, certainly all RDF cases, Read Committed should be the default transaction isolation. In the [Parameters] stanza of virtuoso.ini, set —
[Parameters]
...
DefaultIsolation = 2
Multiuser Workload
If ODBC, JDBC, or similarly connected client applications are used, there must be more ServerThreads available than there will be client connections. In the [Parameters] stanza of virtuoso.ini, set —
[Parameters]
...
ServerThreads = 100
With web clients (unlike ODBC, JDBC, or similar clients), it may be justified to have fewer ServerThreads than there are concurrent clients. The MaxKeepAlives should be the maximum number of expected web clients. This can be more than the ServerThreads count. In the [HTTPServer] stanza of virtuoso.ini, set —
[HTTPServer]
...
ServerThreads = 100
MaxKeepAlives = 1000
KeepAliveTimeout = 10
Note — The [HTTPServer] ServerThreads are taken from the total pool made available by the [Parameters] ServerThreads. Thus, the [Parameters] ServerThreads should always be at least as large as (and is best set greater than) the [HTTPServer] ServerThreads, and if using the closed-source Commercial Version, should not exceed the licensed thread count.
Disk Use
The basic rule is to use one stripe (file) per distinct physical device (not per file system), using no RAID. For example, one might stripe a database over 6 files (6 physical disks), with an initial size of 60000 pages (the files will grow as needed).
For the above described example, in the [Database] stanza of virtuoso.ini, set —
[Database]
...
Striping = 1
MaxCheckpointRemap = 2000000
— and in the [Striping] stanza, on one line per SegmentName, set —
[Striping]
...
Segment1 = 60000 , /virtdev/db/virt-seg1.db = q1 , /data1/db/virt-seg1-str2.db = q2 , /data2/db/virt-seg1-str3.db = q3 , /data3/db/virt-seg1-str4.db = q4 , /data4/db/virt-seg1-str5.db = q5 , /data5/db/virt-seg1-str6.db = q6
As can be seen here, each file gets a background IO thread (the = qxxx clause). It should be noted that all files on the same physical device should have the same qxxx value. This is not directly relevant to the benchmarking scenario above, because we have only one file per device, and thus only one file per IO queue.
SQL Optimization
If queries have lots of joins but access little data, as with the Berlin SPARQL Benchmark, the SQL compiler must be told not to look for better plans if the best plan so far is quicker than the compilation time expended so far. Thus, in the [Parameters] stanza of virtuoso.ini, set —
[Parameters]
...
StopCompilerWhenXOverRunTime = 1
|
08/25/2008 14:05 GMT
|
Modified:
08/25/2008 15:29 GMT
|
BSBM With Triples and Mapped Relational Data
The special contribution of the Berlin SPARQL Benchmark (BSBM) to the RDF world is to raise the question of doing OLTP with RDF.
Of course, here we immediately hit the question of comparisons with relational databases. To this effect, BSBM also specifies a relational schema and can generate the data as either triples or SQL inserts.
The benchmark effectively simulates the case of exposing an existing RDBMS as RDF. OpenLink Software calls this RDF Views. Oracle is beginning to call this semantic covers. The RDB2RDF XG, a W3C incubator group, has been active in this area since Spring, 2008.
But why an OLTP workload with RDF to begin with?
We believe this is relevant because RDF promises to be the interoperability factor between potentially all of traditional IS. If data is online for human consumption, it may be online via a SPARQL end-point as well. The economic justification will come from discoverability and from applications integrating multi-source structured data. Online shopping is a fine use case.
Warehousing all the world's publishable data as RDF is not our first preference, nor would it be the publisher's. Considerations of duplicate infrastructure and maintenance are reason enough. Consequently, we need to show that mapping can outperform an RDF warehouse, which is what we'll do here.
What We Got
First, we found that making the query plan took much too long in proportion to the run time. With BSBM this is an issue because the queries have lots of joins but access relatively little data. So we made a faster compiler and along the way retouched the cost model a bit.
But the really interesting part with BSBM is mapping relational data to RDF. For us, BSBM is a great way of showing that mapping can outperform even the best triple store. A relational row store is as good as unbeatable with the query mix. And when there is a clear mapping, there is no reason the SPARQL could not be directly translated.
If Chris Bizer et al launched the mapping ship, we will be the ones to pilot it to harbor!
We filled two Virtuoso instances with a BSBM200000 data set, for 100M triples. One was filled with physical triples; the other was filled with the equivalent relational data plus mapping to triples. Performance figures are given in "query mixes per hour". (An update or follow-on to this post will provide elapsed times for each test run.)
With the unmodified benchmark we got:
| Physical Triples:
|
|
1297 qmph |
| Mapped Triples:
|
|
3144 qmph
|
In both cases, most of the time was spent on Q6, which looks for products with one of three words in the label. We altered Q6 to use text index for the mapping, and altered the databases accordingly. (There is no such thing as an e-commerce site without a text index, so we are amply justified in making this change.)
The following were measured on the second run of a 100 query mix series, single test driver, warm cache.
| Physical Triples:
|
|
5746 qmph |
| Mapped Triples:
|
|
7525 qmph
|
We then ran the same with 4 concurrent instances of the test driver. The qmph here is 400 / the longest run time.
| Physical Triples:
|
|
19459 qmph |
| Mapped Triples:
|
|
24531 qmph
|
The system used was 64-bit Linux, 2GHz dual-Xeon 5130 (8 cores) with 8G RAM. The concurrent throughputs are a little under 4 times the single thread throughput, which is normal for SMP due to memory contention. The numbers do not evidence significant overhead from thread synchronization.
The query compilation represents about 1/3 of total server side CPU. In an actual online application of this type, queries would be parameterized, so the throughputs would be accordingly higher. We used the StopCompilerWhenXOverRunTime = 1 option here to cut needless compiler overhead, the queries being straightforward enough.
We also see that the advantage of mapping can be further increased by more compiler optimizations, so we expect in the end mapping will lead RDF warehousing by a factor of 4 or so.
Suggestions for BSBM
-
Reporting Rules. The benchmark spec should specify a form for disclosure of test run data, TPC style. This includes things like configuration parameters and exact text of queries. There should be accepted variants of query text, as with the TPC.
-
Multiuser operation. The test driver should get a stream number as parameter, so that each client makes a different query sequence. Also, disk performance in this type of benchmark can only be reasonably assessed with a naturally parallel multiuser workload.
-
Add business intelligence. SPARQL has aggregates now, at least with Jena and Virtuoso, so let's use these. The BSBM business intelligence metric should be a separate metric off the same data. Adding synthetic sales figures would make more interesting queries possible. For example, producing recommendations like "customers who bought this also bought xxx."
-
For the SPARQL community, BSBM sends the message that one ought to support parameterized queries and stored procedures. This would be a SPARQL protocol extension; the SPARUL syntax should also have a way of calling a procedure. Something like select proc (??, ??) would be enough, where ?? is a parameter marker, like ? in ODBC/JDBC.
-
Add transactions.Especially if we are contrasting mapping vs. storing triples, having an update flow is relevant. In practice, this could be done by having the test driver send web service requests for order entry and the SUT could implement these as updates to the triples or a mapped relational store. This could use stored procedures or logic in an app server.
Comments on Query Mix
The time of most queries is less than linear to the scale factor. Q6 is an exception if it is not implemented using a text index. Without the text index, Q6 will inevitably come to dominate query time as the scale is increased, and thus will make the benchmark less relevant at larger scales.
Next
We include the sources of our RDF view definitions and other material for running BSBM with our forthcoming Virtuoso Open Source 5.0.8 release. This also includes all the query optimization work done for BSBM. This will be available in the coming days.
|
08/06/2008 19:35 GMT
|
Modified:
08/06/2008 16:29 GMT
|
Virtuoso Optimizations for the Berlin SPARQL Benchmark
We had a look at Chris Bizer's initial results with the Berlin SPARQL Benchmark (BSBM) on Virtuoso. The first results were rather bad, as nearly all of the run time was spent optimizing the SPARQL statements and under 10% actually running them.
So I spent a couple of days on the SPARQL/SQL compiler, to the effect of making it do a better guess of initial execution plan and streamlining some operations. In fact, many of the queries in BSBM are not particularly sensitive to execution plan, as they access a very small portion of the database. So to close the matter, I put in a flag that makes the SQL compiler give up on devising new plans if the time of the best plan so far is less than the time spent compiling so far.
With these changes, available now as a diff on top of 5.0.7, we run quite well, several times better than initially. With the compiler time cut-off in place (ini parameter StopCompilerWhenXOverRunTime = 1), we get the following times, output from the BSBM test driver:
Starting test...
0: 1031.22 ms, total: 1151 ms
1: 982.89 ms, total: 1040 ms
2: 923.27 ms, total: 968 ms
3: 898.37 ms, total: 932 ms
4: 855.70 ms, total: 865 ms
Scale factor: 10000
Number of query mix runs: 5 times
min/max Query mix runtime: 0.8557 s / 1.0312 s
Total runtime: 4.691 seconds
QMpH: 3836.77 query mixes per hour
CQET: 0.93829 seconds average runtime
of query mix
CQET (geom.): 0.93625 seconds geometric mean
runtime of query mix
Metrics for Query 1:
Count: 5 times executed in whole run
AQET: 0.012212 seconds (arithmetic mean)
AQET(geom.): 0.009934 seconds (geometric mean)
QPS: 81.89 Queries per second
minQET/maxQET: 0.00684000s / 0.03115700s
Average result count: 7.0
min/max result count: 3 / 10
Metrics for Query 2:
Count: 35 times executed in whole run
AQET: 0.030490 seconds (arithmetic mean)
AQET(geom.): 0.029776 seconds (geometric mean)
QPS: 32.80 Queries per second
minQET/maxQET: 0.02467300s / 0.06753000s
Average result count: 22.5
min/max result count: 15 / 30
Metrics for Query 3:
Count: 5 times executed in whole run
AQET: 0.006947 seconds (arithmetic mean)
AQET(geom.): 0.006905 seconds (geometric mean)
QPS: 143.95 Queries per second
minQET/maxQET: 0.00580000s / 0.00795100s
Average result count: 4.0
min/max result count: 0 / 10
Metrics for Query 4:
Count: 5 times executed in whole run
AQET: 0.008858 seconds (arithmetic mean)
AQET(geom.): 0.008829 seconds (geometric mean)
QPS: 112.89 Queries per second
minQET/maxQET: 0.00804400s / 0.01019500s
Average result count: 3.4
min/max result count: 0 / 10
Metrics for Query 5:
Count: 5 times executed in whole run
AQET: 0.087542 seconds (arithmetic mean)
AQET(geom.): 0.087327 seconds (geometric mean)
QPS: 11.42 Queries per second
minQET/maxQET: 0.08165600s / 0.09889200s
Average result count: 5.0
min/max result count: 5 / 5
Metrics for Query 6:
Count: 5 times executed in whole run
AQET: 0.131222 seconds (arithmetic mean)
AQET(geom.): 0.131216 seconds (geometric mean)
QPS: 7.62 Queries per second
minQET/maxQET: 0.12924200s / 0.13298200s
Average result count: 3.6
min/max result count: 3 / 5
Metrics for Query 7:
Count: 20 times executed in whole run
AQET: 0.043601 seconds (arithmetic mean)
AQET(geom.): 0.040890 seconds (geometric mean)
QPS: 22.94 Queries per second
minQET/maxQET: 0.01984400s / 0.06012600s
Average result count: 26.4
min/max result count: 5 / 96
Metrics for Query 8:
Count: 10 times executed in whole run
AQET: 0.018168 seconds (arithmetic mean)
AQET(geom.): 0.016205 seconds (geometric mean)
QPS: 55.04 Queries per second
minQET/maxQET: 0.01097600s / 0.05066900s
Average result count: 12.8
min/max result count: 6 / 20
Metrics for Query 9:
Count: 20 times executed in whole run
AQET: 0.043813 seconds (arithmetic mean)
AQET(geom.): 0.043807 seconds (geometric mean)
QPS: 22.82 Queries per second
minQET/maxQET: 0.04274900s / 0.04504100s
Average result count: 0.0
min/max result count: 0 / 0
Metrics for Query 10:
Count: 15 times executed in whole run
AQET: 0.030697 seconds (arithmetic mean)
AQET(geom.): 0.029651 seconds (geometric mean)
QPS: 32.58 Queries per second
minQET/maxQET: 0.02072000s / 0.03975700s
Average result count: 1.1
min/max result count: 0 / 4
real 0 m 5.485 s
user 0 m 2.233 s
sys 0 m 0.170 s
Of the approximately 5.5 seconds of running five query mixes, the test driver spends 2.2 s. The server side processing time is 3.1 s, of which SQL compilation is 1.35 s. The rest is miscellaneous system time. The measurement is on 64-bit Linux, 2GHz dual-Xeon 5130 (8 cores) with 8G RAM.
We note that this type of workload would be done with stored procedures or prepared, parameterized queries in the SQL world.
There will be some further tuning still but this addresses the bulk of the matter. There will be a separate message about the patch containing these improvements.
|
07/30/2008 18:17 GMT
|
Modified:
08/06/2008 16:29 GMT
|
Virtuoso 5.0.7 Release, Now With Jena and Sesame APIs
Improvements
-
Full operation with Jena and Sesame RDF Frameworks. This fully replaces any previous attempts at interop, and introduces samples and test suites.
- Better support for alternate RDF indexing schemes
- Parallel operation of the RDF Sponger, importing multiple
sources concurrently.
- New data formats supported for on-demand RDF-ization in the
Sponger
- More efficient support for inference of subclass and
sub-property; now capable of efficiently handling taxonomies of tens
of thousands of classes
-
OWL equivalentClass and equivalentProperty support.
-
Dynamic IRI host part support for mapped data and for metadata of local resources. Renaming the host or using multiple virtual hosts will accept URIs with the right host part and refer to the same thing, no duplicate storage required.
-
SPARQL optimizations for
LIMIT and OFFSET
Documentation
Bug Fixes
- Generally improved safety of built-in functions, better
argument checking.
- Verified UTF8 international character support in all RDF use
cases, SQL client/SPARQL protocol/all data formats.
|
07/17/2008 17:16 GMT
|
Modified:
07/17/2008 15:28 GMT
|
De Paradigmata and The Foundational Issues
I thought that we had talked ourselves to exhaustion and beyond over the issue of the semantic web layer cake. Apparently not. There was a paper called Functional Architecture for the Semantic Web by Aurona Gerber et al at ESWC2008.
The thrust of the matter was that for newcomers the layer cake was confusing and did not clearly indicate the architecture. Why, sure. My point is that no rearranging of the boxes will cut it for the general case.
Any diagram containing the boxes of the layer cake (i.e., URI, XML, SPARQL, OWL, RIF, Crypto, etc., etc.) in whatever order or arrangement can at best be a sort of overview of how these standards reference each other.
Such diagrams are a little like saying that a car combines the combustion properties of fuel/air mixes with the tension and compression resistance properties of metals and composites for producing motion and secondly links to Newton's laws of motion and to aerodynamics.
Not false. But it does not say that a car is good for economical commute or showing off at the strip or any number of niches that a mature industry has grown to serve.
Now, talking of software engineering, modules and interfaces are good and even necessary. The trick is to know where to put the interface.
Such a thing cannot possibly be inferred from the standards' inter-reference picture. APIs, especially if these are Web service APIs, should go where there is low data volume and tolerance for latency. For example, either inference is a preprocessing step or it is embedded right inside a SPARQL engine. Such a thing cannot be seen from the picture. Same for trust. Trust is not an after-thought at the top of the picture, except maybe in the sense of referring to the other parts.
We hear it over and over. Scale and speed are critical. Arrange the blocks of any real system as makes sense for data flow; do not confuse literature references with control or data structure.
The even-more foundational issue is the promotion of the general concept of a Web of Data.
The core idea that the Web would be a query-able collection of data with meaningful reference between data of different provenance cannot be inferred from the picture, even though this should be its primary message. Or it is better to say that the first picture shown should stress this idea and then one could leave the layer cake, in whatever version, for explaining the standards' order of evolution or inter-reference.
So, the value proposition:
Why? Explosion of data volume, increased need of keeping up-to-date, increasing opportunity cost of not keeping in real time.
What? An architecture that is designed for unanticipated joining and evolution of data across heterogeneous sources, either at Web or enterprise scale.
How? URI everything and everything is cool, or, give things global names. Use RDF. Reuse names or ontologies where can. (An ontology is a set of classes and property names plus some more.) Map relational data on the fly or store as RDF, whichever works. Query with SPARQL, easier than SQL.
So, my challenge for the graphics people would be to make an illustration of the above. Forget the alphabet soup. Show the layer cake as a historical reference or literature guide. Do not imply that this proliferation of boxes equates to an equal proliferation of Web services, for example.
|
06/09/2008 14:00 GMT
|
Modified:
06/11/2008 15:54 GMT
|
voiD, or Will the LOD Cloud Bring Rain?
At ESWC2008, we saw the Linked Open Data Cloud condense its first drops of precipitation.
voiD, Vocabulary of Interlinked Datasets, is an idea whose time has clearly come. By the end of the conference, many speakers had already adopted the meme.
The point is to describe what is inside the data sets. People may know this from having worked with the sets or from putting them together but to an outsider this is not evident.
The Semantic Sitemap says where there are files or end points for access. But it does not say what is inside these. Also for federation, it is important to be able to determine whether it makes sense to send a particular query to a particular end point.
If we play this right, this is what voiD will provide. I have to think of Dan Simmons' flamboyant Hyperion sci-fi series where the "void which binds" was a sort of hyperspace containing the thoughts of entities, past and present and even provided teleportation.
So what does the voiD hold, aside infinite potentialities?
The obvious part is DC-like provenance, version, authorship, license and such data set wide information. Also the subject matter could be classified by reference to UMBEL or the Yago classification of DBpedia.
More is needed, though. The simple part is listing the ontologies, if any. Also a set of namespaces would be an idea but this could be very large.
So let us look at what we'd like to be able to answer with the voiD set.
The below could be a sample of voiD questions?
-
What subjects are in the LOD cloud?
-
Given this URI, what set in the LOD cloud can tell me more? This is divided into asking a text index like Sindice for the location, getting the namespace or data set and then querying voiD.
-
What need I federate/load in order to combine all that is reachable from a given vocabulary? There could be for example a graph showing the data sets and edges between them, edges being qualified by a set of same as assertions, itself a voiD described set, if translations were needed.
-
What sets are from the same or equally trusted publisher as this one?
These things are roughly divided into description of the set and then some details on how it is stored on a given end point.
-
Given this set, in which other sets will I find use of the same URIs? For example, if I have language version x, I wish to know that language version y will have the same URIs insofar the things meant are the same.
-
Given this set, which sets of same as assertions will I have for mapping to which other sets? For example, if I have Geonames, I wish to know that set x will map at least some of the URIs in Geonames to DBpedia URIs.
Let me further point out that it is increasingly clear to the community that universal sameAs is dubious, hence sameAs assertions ought to be kept separate and included or excluded depending on the usage context.
-
Given this set, what are the interesting queries I can do? This is a sort of advertisement for human consumption. This is not a list of queries for crashing the end point. Denial of service can be done in SPARQL without knowing the end point content anyhow, so this is not an added risk exposer.
-
Vocabularies used. This is a reference to the OWL or RDFS resources giving the applicable ontologies, if present. Also, a complete list of classes whose direct instances actually occur in the set is useful.
-
Ballpark cardinality. Something like a DARQ optimization profile would be a good idea. I would say that there should be a possibility of just including a DARQ description file as is. This is a sort of baseline and since it already exist, we are spared the committee trouble of figuring out what it ought to contain and what not. If we start defining this from scratch, it will take long. Further, let this be optional. Quite Independently of this, query processors may make optimization related queries to remote end points insofar the specific end point supports these. This will come in time. For now, just the basics.
Along with this, LOD SPARQL end points could adopt a couple of basic conventions. The simplest would be to agree that each would host a graph with a given URI that would contain the voiD descriptions of the data sets contained, along with the graph URI used for each set, if different from the publisher's URI for the graph. There is a point to this since an end point may load multiple data sets into one graph.
We hope to have a good idea of the matter in a couple of weeks, certainly a general statement of direction to be published at Linked Data Planet in a couple of weeks.
|
06/09/2008 13:58 GMT
|
Modified:
06/11/2008 15:15 GMT
|
The DARQ Matter of Federation
Astronomers propose that the universe is held together, so to speak, by the gravity of invisible "dark matter" spread in interstellar and intergalactic space.
For the data web, it will be held together by federation, also an invisible factor. As in Minkowski space, so in cyberspace.
To take the astronomical analogy further, putting too much visible stuff in one place makes a black hole, whose chief properties are that it is very heavy, can only get heavier and that nothing comes out.
DARQ is Bastian Quilitz's federated extension of the Jena ARQ SPARQL processor. It has existed for a while and was also presented at ESWC2008. There is also SPARQL FED from Andy Seaborne, an explicit means of specifying which end point will process which fragment of a distributed SPARQL query. Still, for federation to deliver in an open, decentralized world, it must be transparent. For a specific application, with a predictable workload, it is of course OK to partition queries explicitly.
Bastian had split DBpedia among five Virtuoso servers and was querying this set with DARQ. The end result was that there was a rather frightful cost of federation as opposed to all the data residing in a single Virtuoso. The other result was that if selectivity of predicates was not correctly guessed by the federation engine, the proposition was a non-starter. With correct join order it worked, though.
Yet, we really want federation. Looking further down the road, we simply must make federation work. This is just as necessary as running on a server cluster for mid-size workloads.
Since we are convinced of the cause, let's talk about the means.
For DARQ as it now stands, there's probably an order of magnitude or even more to gain from a couple of simple tricks. If going to a SPARQL end point that is not the outermost in the loop join sequence, batch the requests together in one HTTP/1.1 message. So, if the query is "get me my friends living in cities of over a million people," there will be the fragment "get city where x lives" and later "ask if population of x greater than 1000000". If I have 100 friends, I send the 100 requests in a batch to each eligible server.
Further, if running against a server of known brand, use a client-server connection and prepared statements with array parameters. This can well improve the processing speed at the remote end point by another order of magnitude. This gain may however not be as great as the latency savings from message batching. We will provide a sample of how to do this with Virtuoso over JDBC so Bastian can try this if interested.
These simple things will give a lot of mileage and may even decide whether federation is an option in specific applications. For the open web however, these measures will not yet win the day.
When federating SQL, colocation of data is sort of explicit. If two tables are joined and they are in the same source, then the join can go to the source. For SPARQL this is also so but with a twist:
If a foaf:Person is found on a given server, this does not mean that the Person's geek code or email hash will be on the same server. Thus {?p name "Johnny" . ?p geekCode ?g . ?p emailHash ?h } does not necessarily denote a colocated join if many servers serve items of the vocabulary.
However, in most practical cases, for obtaining a rapid answer, treating this as a colocated fragment will be appropriate. Thus, it may be necessary to be able to declare that geek codes will be assumed colocated with names. This will save a lot of message passing and offer decent, if not theoretically total recall. For search style applications, starting with such assumptions will make sense. If nothing is found, then we can partition each join step separately for the unlikely case that there were a server that gave geek codes but not names.
For Virtuoso, we find that a federated query's asynchronous, parallel evaluation model is not so different from that on a local cluster. So the cluster version could have the option of federated query. The difference is that a cluster is local and tightly coupled and predictably partitioned but a federated setting is none of these.
For description, we would take DARQ's description model and maybe extend it a little where needed. Also we would enhance the protocol to allow just asking for the query cost estimate given a query with literals specified. We will do this eventually.
We would like to talk to Bastian about large improvements to DARQ, specially when working with Virtuoso. We'll see.
Of course, one mode of federating is the crawl-as-you-go approach of the Virtuoso Sponger. This will bring in fragments following seeAlso or sameAs declarations or other references. This will however not have the recall of a warehouse or federation over well described SPARQL end-points. But up to a certain volume it has the speed of local storage.
The emergence of voiD (Vocabulary of Interlinked Data) is a step in the direction of making federation a reality. There is a separate post about this.
|
06/09/2008 13:57 GMT
|
Modified:
06/11/2008 15:15 GMT
|
Aspects of RDF to RDF Mapping
The W3C has recently launched an incubator group about mapping relational data to RDF.
From participating in the group for the few initial sessions, I get the following impressions.
There is a segment of users, for example from the biomedical community, who do heavy duty data integration and look to RDF for managing complexity. Unifying heterogeneous data under OWL ontologies, reasoning, and data integrity, are points of interest.
There is another segment that is concerned with semantifying the document web, which topic includes initiatives such as Triplify and semantic web search such as Sindice. The emphasis there is on minimizing entry cost and creating critical mass. The next one to come will clean up the semantics, if these need be cleaned up at all.
(Some cleanup is taking place with Yago and Zitgist, but this is a matter for a different post.)
Thus, technically speaking, the mapping landscape is diverse, but ETL (extract-transform-load) seems to predominate. The biomedical people make data warehouses for answering specific questions. The web people are interested in putting data out in the expectation that the next player will warehouse it and allow running complex meshups against the whole of the RDF-ized web.
As one would expect, these groups see different issues and needs. Roughly speaking, one is about quality and structure and the other is about volume.
Where do we stand?
We are with the research data warehousers in saying that the mapping question is very complex and that it would indeed be nice to bypass ETL and go to the source RDBMS(s) on demand. Projects in this direction are ongoing.
We are with the web people in building large RDF stores with scalable query answering for arbitrary RDF, for example, hosting a lot of the Linking Open Data sets, and working with Zitgist.
These things are somewhat different.
At present, both the research warehousers and the web scalers predominantly go for ETL.
This is fine by us as we definitely are in the large RDF store race.
Still, mapping has its point. A relational store will perform quite a bit faster than a quad store if it has the right covering indices or application-specific compressed columnar layout. Thus, there is nothing to block us from querying analytics in SPARQL, once the obviously necessary extensions of sub-query, expressions and aggregation are in place.
To cite an example, the Ordnance Survey of the UK has a GIS system running on Oracle with an entry pretty much fo |