Details
Virtuso Data Space Bot
Burlington, United States
Subscribe
Post Categories
Recent Articles
Display Settings
|
Showing posts in all categories Refresh
VLDB Semdata Workshop
I will begin by extending my thanks to the organizers, in specific Reto Krummenacher of STI and Atanas Kiryakov of Ontotext for inviting me to give a position paper at the workshop. Indeed, it is the builders of bridges, the pontifs (pontifex) amongst us who shall be remembered by history. The idea of organizing a semantic data management workshop at VLDB is a laudable attempt at rapprochement between two communities to the advantage of all concerned.
Franz, Ontotext, and OpenLink were the vendors present at the workshop. To summarize very briefly, Jans Aasman of Franz talked about the telco call center automation solution by Amdocs, where the AllegroGraph RDF store is integrated. On the technical side, AllegroGraph has Javascript as a stored procedure language, which is certainly a good idea. Naso of Ontotext talked about the BBC FIFA World Cup site. The technical proposition was that materialization is good and data partitioning is not needed; a set of replicated read-only copies is good enough.
I talked about making RDF cost competitive with relational for data integration and BI. The crux is space efficiency and column store techniques.
One question that came up was that maybe RDF could approach relational in some things, but what about string literals being stored in a separate table? Or URI strings being stored in a separate table?
The answer is that if one accesses a lot of these literals the access will be local and fairly efficient. If one accesses just a few, it does not matter. For user-facing reports, there is no point in returning a million strings that the user will not read anyhow. But then it turned out that there in fact exist reports in bioinformatics where there are 100,000 strings. Now taking the worst abuse of SPARQL, a regexp over all literals in a property of a given class. With a column store this is a scan of the column; with RDF, a three table join. The join is about 10x slower than the column scan. Quite OK, considering that a full text index is the likely solution for such workloads anyway. Besides, a sensible relational schema will also not use strings for foreign keys, and will therefore incur a similar burden from fetching the strings before returning the result.
Another question was about whether the attitude was one of confrontation between RDF and relational and whether it would not be better to join forces. Well, as said in my talk, sauce for the goose is sauce for the gander and generally speaking relational techniques apply equally to RDF. There are a few RDB tricks that have no RDF equivalent, like clustering a fact table on dimension values, e.g., sales ordered by country, manufacturer, month. But by and large, column-store techniques apply. The execution engine can be essentially identical, just needing a couple of extra data types and some run-time typing and in some cases producing nulls instead of errors. Query optimization is much the same, except that RDB stats are not applicable as such; one needs to sample the data in the cost model. All in all, these adaptations to a RDB are not so large, even though they do require changes to source code.
Another question was about combining data models, e.g., relational (rows and columns), RDF (graph), XML (tree), and full text. Here I would say that it is a fault of our messaging that we do not constantly repeat the necessity of this combining, as we take it for granted. Most RDF stores have a full text index on literal values. OWLIM and a CWI prototype even have it for URIs. XML is a valid data type for an RDF literal, even though this does not get used very much. So doing SPARQL to select the values, and then doing XPath and XSLT on the values, is entirely possible, at least in Virtuoso which has an XPath/XSLT engine built in. Same for invoking SPARQL from an XSLT sheet. Colocating a native RDBMS with local and federated SQL is what Virtuoso has always done. One can, for example, map tables in heterogenous remote RDBs into tables in Virtuoso, then map these into RDF, and run SPARQL queries that get translated into SQL against the original tables, thereby getting SPARQL access without any materialization. Alongside this, one can ETL relational data into RDF via the same declarative mapping.
Further, there are RDF extensions for geospatial queries in Virtuoso and AllegroGraph, and soon also in others.
With all this cross-model operation, RDF is definitely not a closed island. We'll have to repeat this more.
Of the academic papers, the SpiderStore (paper is not yet available at time of writing, but should be soon) and Webpie that should be specially noted.
Let us talk about SpiderStore first.
SpiderStore
The SpiderStore from the University of Innsbruck is a main-memory-only system that has a record for each distinct IRI. The IRI record has one array of pointers to all IRI records that are objects where the referencing record is the subject, and a similar array of pointers to all records where the referencing record is the object. Both sets of pointers are clustered based on the predicate labeling the edge.
According to the authors (Robert Binna, Wolfgang Gassler, Eva Zangerle, Dominic Pacher, and Günther Specht), a distinct IRI is 5 pointers and each triple is 3 pointers. This would make about 4 pointers per triple, i.e., 32 bytes with 64-bit pointers.
This is not particularly memory efficient, since one must count unused space after growing the lists, fragmentation, etc., which will make the space consumption closer to 40 bytes per triple, plus should one add a graph to the mix one would need another pointer per distinct predicate, adding another 1-4 bytes per triple. Supporting non-IRI types in the object position is not a problem, as long as all distinct values have a chunk of memory to them with a type tag.
We get a few times better memory efficiency with column compressed quads, plus we are not limited to main memory.
But SpiderStore has a point. Making the traversal of an edge in the graph into a pointer dereference is not such a bad deal, especially if the data set is not that big. Furthermore, compiling the queries into C procedures playing with the pointers alone would give performance to match or exceed any hard coded graph traversal library and would not be very difficult. Supporting multithreaded updates would spoil much of the gain but allowing single threaded updates and forking read-only copies for reading would be fine.
SpiderStore as such is not attractive for what we intend to do, this being aggregating RDF quads in volumes far exceeding main memory and scaling to clusters. We note that SpiderStore hits problems with distributed memory, since SpiderStore executes depth first, which is manifestly impossible if significant latencies are involved. In other words, if there can be latency, one must amortize by having a lot of other possible work available. Running with long vectors of values is one way, as in MonetDB or Virtuoso Cluster. The other way is to have a massively multithreaded platform which favors code with few instructions but little memory locality. SpiderStore could be a good fit for massive multithreading, specially if queries were compiled to C, dramatically cutting down on the count of instructions to execute.
We too could adopt some ideas from SpiderStore. Namely, if running vectored, one just in passing, without extra overhead, generates an array of links to the next IRI, a bit like the array that SpiderStore has for each predicate for the incoming and outgoing edges of a given IRI. Of course, here these would be persistent IDs and not pointers, but a hash from one to the other takes almost no time. So, while SpiderStore alone may not be what we are after for data warehousing, Spiderizing parts of the working set would not be so bad. This is especially so since the Spiderizable data structure almost gets made as a by-product of query evaluation.
If an algorithm made several passes over a relatively small subgraph of the whole database, Spiderizing it would accelerate things. The memory overhead could have a fixed cap so as not to ruin the working set if locality happened not to hold.
Running a SpiderStore-like execution model on vectors instead of single values would likely do no harm and might even result in better cache behavior. The exception is in the event of completely unpredictable patterns of connections which may only be amortized by massive multithreading.
Webpie
Webpie from VU Amsterdam and the LarKC EU FP 7 project is, as it were, the opposite of SpiderStore. This is a map-reduce-based RDFS and OWL Horst inference engine which is all about breadth-first passes over the data in a map-reduce framework with intermediate disk-based storage.
Webpie is not however a database. After the inference result has been materialized, it must be loaded into a SPARQL engine in order to evaluate a query against the result.
The execution plan of Webpie is made from the ontology whose consequences must be materialized. The steps are sorted and run until a fixed point is reached for each. This is similar to running SPARQL INSERT … SELECT statements until no new inserts are produced. The only requirement is that the INSERT statement should report whether new inserts were actually made. This is easy to do. In this way, a comparison between map-reduce plus memory-based joining and a parallel RDF database could be made.
We have suggested such an experiment to the LarKC people. We will see.
|
09/21/2010 17:14 GMT-0500
|
Modified:
09/21/2010 16:22 GMT-0500
|
More On Parallel RDF/Text Query Evaluation
We have received some more questions about Virtuoso's parallel query evaluation model.
In answer, we will here explain how we do search engine style processing by writing SPARQL. There is no need for custom procedural code because the query optimizer does all the partitioning and the equivalent of map reduce.
The point is that what used to require programming can often be done in a generic query language. The technical detail is that the implementation must be smart enough with respect to parallelizing queries for this to be of practical benefit. But by combining these two things, we are a step closer to the web being the database.
I will here show how we do some joins combining full text, RDF conditions, and aggregates and ORDER BY. The sample task is finding the top 20 entities with New York in some attribute value. Then we specify the search further by only taking actors associated with New York. The results are returned in the order of a composite of entity rank and text match score.
The basic query is:
SELECT
(
sql:s_sum_page
(
<sql:vector_agg>
(
<bif:vector> ( ?c1 , ?sm )
),
bif:vector
( 'new', 'york' )
)
) AS ?res
WHERE
{
{
SELECT
(
<SHORT_OR_LONG::>(?s1)
) AS ?c1
(
<sql:S_SUM>
(
<SHORT_OR_LONG::IRI_RANK> ( ?s1 ) ,
<SHORT_OR_LONG::> ( ?s1textp ) ,
<SHORT_OR_LONG::> ( ?o1 ) ,
?sc
)
) AS ?sm
WHERE
{
?s1 ?s1textp ?o1 .
?o1 bif:contains "new AND york"
OPTION ( SCORE ?sc )
}
ORDER BY
DESC
(
<sql:sum_rank>
((
<sql:S_SUM>
(
<SHORT_OR_LONG::IRI_RANK> ( ?s1 ) ,
<SHORT_OR_LONG::> ( ?s1textp ) ,
<SHORT_OR_LONG::> ( ?o1 ) ,
?sc
)
))
)
LIMIT 20
}
}
This takes some explaining. The basic part is
{
?s1 ?s1textp ?o1 .
?o1 bif:contains "new AND york"
OPTION ( SCORE ?sc )
}
This just makes tuples where ?s1 is the object, ?s1textp the property, and ?o1 the literal which contains "New York". For a single ?s1, there can of course be many properties which all contain "New York".
The rest of the query gathers all the "New York" containing properties of an entity into a single aggregate, and then gets the entity ranks of all such entities.
After this, the aggregates are sorted by a sum of the entity rank and a combined text score calculated based on the individual text match scores between "New York" and the strings containing "New York". The text hit score is higher if the words repeat often and in close proximity.
The s_sum function is a user-defined aggregate which takes 4 arguments: The rank of the subject of the triple; the predicate of the triple containing the text; the object of the triple containing the text; and the text match score.
These are grouped by the subject of the triple. After this, these are sorted by sum_score of the aggregate constructed with s_sum. The sum_score is a SQL function combining the entity rank with the text scores of the different literals.
This executes as one would expect: All partitions make a text index lookup, retrieving the object of the triple. The text index entries of an object are stored in the same partition as the object. But the entity rank is a property of the subject and is partitioned by the subject. Also the GROUP BY is by the subject. Thus the data is produced from all partitions, then streamed into the receiving partitions, determined by the subject. This partition can then get the score and group the matches by the subject. Since all these partial aggregates are partitioned by the subject, there is no need to merge them; thus, the top k sort can be done for each partition separately. Finally, the top 20 of each partition are merged into the global top 20. This is then passed to a final function s_sum_page that turns this all into an XML fragment that can be processed with XSLT for inclusion on a web page.
This differs from the text search engine in that the query pipeline can contain arbitrary cross-partition joins. Also, the string "New York" is a common label that occurs in many distinct entities. Thus one text match, to one document, in the case the containing only the string "New York" will get many entities, likely all from different partitions.
So, if we only want actors with a mention of "New York", we need to get the inner part of the query as:
{
?s1 ?s1textp ?o1 .
?o1 bif:contains "new AND york"
OPTION ( SCORE ?sc ) .
?s1 a <http://umbel.org/umbel/sc/Actor>
}
Whether an entity is an actor can be checked in the same partition as the rank of the entity. Thus the query plan gets this check right before getting the rank. This is natural since there is no point in getting the rank of something that is not an actor.
The <short_or_long::sql:func> notation means that we call func, which is a SQL stored procedure with the arguments in their internal form. Thus, if a variable bound to an IRI is passed, the short_or_long specifies that it is passed as its internal ID and is not converted into its text form. This is essential, since there is no point getting the text of half a million IRIs when only 20 at most will be shown in the end.
Now, when we run this on a collection of 4.5 billion triples of linked data, once we have the working set, we can get the top 20 "New York" occurrences, with text summaries and all, in just 1.1s, with 12 of 16 cores busy. (The hardware is two boxes with two quad-core Xeon 5345 each.)
If we run this query in two parallel sessions, we get both results in 1.9s, with 14 of 16 cores busy. This gets about 200K "New York" strings, which becomes about 400K entities with New York somewhere, for which a rank then has to be retrieved. After this, all the possibly-many occurrences of New York in the title, text, and other properties of the entity are aggregated together, resolving into some 220K groups. These are then sorted. This is internally over 1.5 million random lookups and some 40MB of traffic between processes. Restricting the type of the entity to actor drops the execution time of one query to 0.8s because there are then fewer ranks to retrieve and less data to aggregate and sort.
By adding partitions and cores, we scale horizontally, as evaluating the query involves almost no central control, even though data are swapped between partitions. There is some flow control to avoid constructing overly-large intermediate results but generally partitions run independently and asynchronously. In the above case, there is just one fence at the point where all aggregates are complete, so that they can be sorted; otherwise, all is asynchronous.
Doing JOINs between partitions and partitioned GROUP BY/ORDER BY is pretty regular database stuff. Applying this to RDF is a most natural thing.
If we do not parallelize the user-defined aggregate for grouping all the "New York" occurrences, the query takes 8s instead of 1.1s. If we could not put SQL procedures as user-defined aggregates to be parallelized with the query, we'd have to either bring all the data to a central point before the top k, which would destroy performance, or we would have to do procedures with explicit parallel procedure calls which is hard to write, surely too hard for ad hoc queries.
Results of live execution may not be complete on initial load, as this link includes a "Virtuoso Anytime" timeout of 10 seconds. Running against a cold cache, these results may take much longer to return; a warm cache will deliver response times along the lines of those discussed above.
Engineering matters. If we wish to commoditize queries on a lot of data, such intelligence in the DBMS is necessary; it is very unscalable to require people to do procedural code or give query parallelization hints. If you need to optimize a workload of 10 different transactions, this is of course possible and even desirable, but for the infinity of all search or analysis, this will not happen.
|
08/19/2009 13:28 GMT-0500
|
Modified:
08/19/2009 14:00 GMT-0500
|
ISWC 2008: RDB2RDF Face-to-Face
The W3C's RDB-to-RDF mapping incubator group (RDB2RDF XG) met in Karlsruhe after ISWC 2008.
The meeting was about writing a charter for a working group that would define a standard for mapping relational databases to RDF, either for purposes of import into RDF stores or of query mapping from SPARQL to SQL. There was a lot of agreement and the meeting even finished ahead of the allotted time.
Whose Identifiers?
There was discussion concerning using the Entity Name Service from the Okkam project for assigning URIs to entities mapped from relational databases. This makes sense when talking about long-lived, legal entities, such as people or companies or geography. Of course, there are cases where this makes no sense; for example, a purchase order or maintenance call hardly needs an identifier registered with the ENS. The problem is, in practice, a CRM could mention customers that have an ENS registered ID (or even several such IDs) and others that have none. Of course, the CRM's reference cannot depend on any registration. Also, even when there is a stable URI for the entity, a CRM may need a key that specifies some administrative subdivision of the customer.
Also we note that an on-demand RDB-to-RDF mapping may have some trouble dealing with "same as" assertions. If names that are anything other than string forms of the keys in the system must be returned, there will have to be a lookup added to the RDB. This is an administrative issue. Certainly going over the network to ask for names of items returned by queries has a prohibitive cost. It would be good for ad hoc integration to use shared URIs when possible. The trouble of adding and maintaining lookups for these, however, makes this more expensive than just mapping to RDF and using literals for joining between independently maintained systems.
XML or RDF?
We talked about having a language for human consumption and another for discovery and machine processing of mappings. Would this latter be XML or RDF based? Describing every detail of syntax for a mapping as RDF is really tedious. Also such descriptions are very hard to query, just as OWL ontologies are. One solution is to have opaque strings embedded into RDF, just like XSLT has XPath in string form embedded into XML. Maybe it will end up in this way here also. Having a complete XML mapping of the parse tree for mappings, XQueryX-style, could be nice for automatic generation of mappings with XSLT from an XML view of the information schema. But then XSLT can also produce text, so an XML syntax that has every detail of a mapping language as distinct elements is not really necessary for this.
Another matter is then describing the RDF generated by the mapping in terms of RDFS or OWL. This would be a by-product of declaring the mapping. Most often, I would presume the target ontology to be given, though, reducing the need for this feature. But if RDF mapping is used for discovery of data, such a description of the exposed data is essential.
Interoperability
We agreed with Sören Auer that we could make Virtuoso's mapping language compatible with Triplify. Triplify is very simple, extraction only, no SPARQL, but does have the benefit of expressing everything in SQL. As it happens, I would be the last person to tell a web developer what language to program in. So if it is SQL, then let it stay SQL. Technically, a lot of the information the Virtuoso mapping expresses is contained in the Triplify SQL statements, but not all. Some extra declarations are needed still but can have reasonable defaults.
There are two ways of stating a mapping. Virtuoso starts with the triple and says which tables and columns will produce the triple. Triplify starts with the SQL statement and says what triples it produces. These are fairly equivalent. For the web developer, the latter is likely more self-evident, while the former may be more compact and have less repetition.
Virtuoso and Triplify alone would give us the two interoperable implementations required from a working group, supposing the language were annotations on top of SQL. This would be a guarantee of delivery, as we would be close enough to the result from the get go.
Related Web resources
|
11/04/2008 13:26 GMT-0500
|
Modified:
11/04/2008 17:20 GMT-0500
|
State of the Semantic Web, Part 2 - The Technical Questions (updated)
Here I will talk about some more technical questions that came up. This is mostly general; Virtuoso specific questions and answers are separate.
"How to Bootstrap? Where will the triples come from?"
There are already wrappers producing RDF from many applications. Since any structured or semi-structured data can be converted to RDF and often there is even a pre-existing terminology for the application domain, the availability of the data per se is not the concern.
The triples may come from any application or database, but they will not come from the end user directly. There was a good talk about photograph annotation in Vienna, describing many ways of deriving metadata for photos. The essential wisdom is annotating on the spot and wherever possible doing so automatically. The consumer is very unlikely to go annotate photos after the fact. Further, one can infer that photos made with the same camera around the same time are from the same location. There are other such heuristics. In this use case, the end user does not need to see triples. There is some benefit though in using commonly used geographical terminology for linking to other data sources.
"How will one develop applications?"
I'd say one will develop them much the same way as thus far. In PHP, for example. Whether one's query language is SPARQL or SQL does not make a large difference in how basic web UI is made.
A SPARQL end-point is no more an end-user item than a SQL command-line is.
A common mistake among techies is that they think the data structure and user experience can or ought to be of the same structure. The UI dialogs do not, for example, have to have a 1:1 correspondence with SQL tables.
The idea of generating UI from data, whether relational or data-web, is so seductive that generation upon generation of developers fall for it, repeatedly. Even I, at OpenLink, after supposedly having been around the block a couple of times made some experiments around the topic. What does make sense is putting a thin wrapper or HTML around the application, using XSLT and such for formatting. Since the model does allow for unforeseen properties of data, one can build a viewer for these alongside the regular forms. For this, Ajax technologies like OAT (the OpenLink AJAX Toolkit) will be good.
The UI ought not to completely hide the URIs of the data from the user. It should offer a drill down to faceted views of the triples for example. Remember when Xerox talked about graphical user interfaces in 1980? "Don't mode me in" was the slogan, as I recall.
Since then, we have vacillated between modal and non-modal interaction models. Repetitive workflows like order entry go best modally and are anyway being replaced by web services. Also workflows that are very infrequent benefit from modality; take personal network setup wizards, for example. But enabling the knowledge worker is a domain that by its nature must retain some respect for human intelligence and not kill this by denying access to the underlying data, including provenance and URIs. Face it: the world is not getting simpler. It is increasingly data dependent and when this is so, having semantics and flexibility of access for the data is important.
For a real-time task-oriented user interface like a fighter plane cockpit, one will not show URIs unless specifically requested. For planning fighter sorties though, there is some potential benefit in having all data such as friendly and hostile assets, geography, organizational structure, etc., as linked data. It makes for more flexible querying. Linked data does not per se mean open, so one can be joinable with open data through using the same identifiers even while maintaining arbitrary levels of security and compartmentalization.
For automating tasks that every time involve the same data and queries, RDF has no intrinsic superiority. Thus the user interfaces in places where RDF will have real edge must be more capable of ad hoc viewing and navigation than regular real-time or line of business user interfaces.
The OpenLink Data Explorer idea of a "data behind the web page" view goes in this direction. Read the web as before, then hit a switch to go to the data view. There are and will be separate clarifications and demos about this.
"What of the proliferation of standards? Does this not look too tangled, no clear identity? How would one know where to begin?"
When SWEO was beginning, there was an endlessly protracted discussion of the so-called layer cake. This acronym jungle is not good messaging. Just say linked, flexibly repurpose-able data, and rich vocabularies and structure. Just the right amount of structure for the application, less rigid and easier to change than relational.
Do not even mention the different serialization formats. Just say that it fits on top of the accepted web infrastructure — HTTP, URIs, and XML where desired.
It is misleading to say inference is a box at some specific place in the diagram. Inference of different types may or may not take place at diverse points, whether presentation or storage, on demand or as a preprocessing step. Since there is structure and semantics, inference is possible if desired.
"Can I make a social network application in RDF only, with no RDBMS?"
Yes, in principle, but what do you have in mind? The answer is very context dependent. The person posing the question had an E-learning system in mind, with things such as course catalogues, course material, etc. In such a case, RDF is a great match, especially since the user count will not be in the millions. No university has that many students and anyway they do not hang online browsing the course catalogue.
On the other hand, if I think of making a social network site with RDF as the exclusive data model, I see things that would be very inefficient. For example, keeping a count of logins or the last time of login would be by default several times less efficient than with a RDBMS.
If some application is really large scale and has a knowable workload profile, like any social network does, then some task-specific data structure is simply economical. This does not mean that the application language cannot be SPARQL but this means that the storage format must be tuned to favor some operations over others, relational style. This is a matter of cost more than of feasibility. Ten servers cost less than a hundred and have failures ten times less frequently.
In the near term we will see the birth of an application paradigm for the data web. The data will be open, exposed, first-class citizen; yet the user experience will not have to be in a 1:1 image of the data.
|
10/26/2008 12:02 GMT-0500
|
Modified:
10/27/2008 11:28 GMT-0500
|
|
|