In the following, we will talk in terms of triples, but the discussion can be trivially generalized to quads. We will use numbers for IRIs and literals. In most implementations, the internal representation for these is indeed a number (or at least some data type that has a well defined collation order). For ease of presentation, we consider a single index with key parts SPO
. Any other index-like setting with any possible key order will have similar issues.
INSERT
and DELETE
as defined in SPARQL are queries which generate a result set which is then used for instantiating triple patterns. We note that a DELETE
may delete a triple which the DELETE
has not read; thus the delete set is not a subset of the read set. The SQL equivalent is the
DELETE FROM table WHERE key IN
( SELECT key1 FROM other_table )
expression, supposing it were implemented as a scan of other_table
and an index lookup followed by DELETE
on table.
The meaning of INSERT
is that the triples in question exist after the operation, and the meaning of DELETE
is that said triples do not exist. In a transactional context, this means that the after-image of the transaction is guaranteed either to have or not-have said triples.
Suppose that the triples { 1 0 0 }
, { 1 5 6 }
, and { 1 5 7 }
exist in the beginning. If we DELETE { 1 ?x ?y }
and concurrently INSERT { 1 2 4 . 1 2 3 . 1 3 5 }
, then whichever was considered to be first by the concurrency control of the DBMS would complete first, and the other after that. Thus the end state would either have no triples with subject 1
or would have the three just inserted.
Suppose the INSERT
inserts the first triple, { 1 2 4 }
. The DELETE
at the same time reads all triples with subject 1
. The exclusive read waits for the uncommitted INSERT
. The INSERT
then inserts the second triple, { 1 2 3 }
. Depending on the isolation of the read, this either succeeds, since no { 1 2 3 }
was read, or causes a deadlock. The first corresponds to REPEATABLE READ
isolation; the second to SERIALIZABLE
.
We would not get the desired end-state of either all the inserted triples or no triples with subject 1
if the read or the DELETE
were not serializable.
Furthermore if a DELETE
template produced a triple that did not exist in the pre-image, the DELETE
semantics still imply that this also does not exist in the after-image, which implies serializability.
Let us consider the prototypical transaction example of transferring funds from one account to another. Two balances are updated, and a history record is inserted.
The initial state is
a balance 10
b balance 10
We transfer 1
from a
to b
, and at the same time transfer 2
from b
to a
. The end state must have a
at 11
and b
at 9
.
A relational database needs REPEATABLE READ
isolation for this.
With RDF, txn1
reads that a
has a balance
of 10
. At the same time, txn1
reads the balance
of a
. txn2
waits because the read of txn1
is exclusive. txn1
proceeds and read the balance
of b
. It then updates the balance
of a
and b
.
All goes without the deadlock which is always cited in this scenario, because the locks are acquired in the same order. The act of updating the balance of a
, since RDF does not really have an update-in-place, consists of deleting { a balance 10 }
and inserting { a balance 9 }
. This gets done and txn1
commits. At this point, txn2
proceeds after its wait on the row that stated { a balance 10 }
. This row is now gone, and txn2
sees that a
has no balance, which is quite possible in RDF's schema-less model.
We see that REPEATABLE READ
is not adequate with RDF, even though it is with relational. The reason why there is no UPDATE
-in-place is that the PRIMARY KEY
of the triple includes all the parts, including the object. Even in a RDBMS, an UPDATE
of a primary key part amounts to a DELETE
-plus-INSERT
. One could here argue that an implementation might still UPDATE
-in-place if the key order were not changed. This would resolve the special case of the accounts but not a more general case.
Thus we see that the read of the balance must be SERIALIZABLE
. This means that the read locks the space before the first balance, so that no insertion may take place. In this way the read of txn2
waits on the lock that is conceptually before the first possible match of { a balance ?x }
.
To implement TPC-C, I would update the table with the highest cardinality first, and then all tables in descending order of cardinality. In this way, the locks with the highest likelihood for contention are held for the least time. If locking multiple rows of a table, these should be locked in a deterministic order, e.g., lowest key-value first. In this way, the workload would not deadlock. In actual fact, with clusters and parallel execution, the lock acquisition will not be guaranteed to be serial, so deadlocks do not entirely go away, but still may get fewer. Besides, any outside transaction might still lock in the wrong order and cause deadlocks, which is why the OLTP application must in any case be built to deal with the possibility of deadlock.
This is the conventional relational view of the matter. In more recent times, in-memory schemes with deterministic lock acquisition (Abadi VLDB 2010) or single-threaded atomic execution of transactions (Uni Munich BIRTE workshop at VLDB2010, VoltDB) have been proposed. There the transaction is described as a stored procedure, possibly with extra annotations. These techniques might apply to RDF also. RDF is however an unlikely model for transaction-intensive applications, so we will not for now examine these further.
RDBMS usually implement row-level locking. This means that once a column of a row has an uncommitted state, any other transaction is prevented from changing the row. This has no ready RDF equivalent. RDF is usually implemented as a row-per-triple system and applying row-level locking to this does not give the semantic one expects of a relational row.
I would argue that it is not essential to enforce transactional guarantees in units of rows. The guarantees must apply between data that is read and written by a transaction. It does not need to apply to columns that the transaction does not reference. To take the TPC-C example, the new order transaction updates the stock level and the delivery transaction updates the delivery count on the stock table. In practice, a delivery and a new order falling on the same row of stock will lock each other out, but nothing in the semantics of the workload mandates this.
It does not seem a priori necessary to recreate the row as a unit of concurrency control in RDF. One could say that a multi-attribute whole (such as an address) ought to be atomic for concurrency control, but then applications updating addresses will most likely read and update all the fields together even if only the street name changes.
We have so far spoken only in terms of row-level locking, which is to my knowledge the most widely used model in RDBMS, and one we implement ourselves. Some databases (e.g., MonetDB and VectorWise) implement optimistic concurrency control. The general idea is that each transaction has a read and write set and when a transaction commits, any other transactions whose read or write set intersects with the write set of the committing transaction are marked un-committable. Once a transaction thus becomes un-committable, it may presumably continue reading indefinitely but may no longer commit its updates. Optimistic concurrency is generally coupled with multi-version semantics where the pre-image of a transaction is a clean committed state of the database as of a specific point in time, i.e., snapshot isolation.
To implement SERIALIZABLE
isolation, i.e., the guarantee that if a transaction twice performs a COUNT
the result will be the same, one locks also the row that precedes the set of selected rows and marks each lock so as to prevent an insert to the right of the lock in key order. The same thing may be done in an optimistic setting.
Positional Handling of Updates in Column Stores [Heman, Zukowski, CWI science library] discusses management of multiple consecutive snapshots in some detail. The paper does not go into the details of different levels of isolation but nothing there suggests that serializability could not be supported. There is some complexity in marking the space between ordered rows as non-insertable across multiple versions but this should be feasible enough.
The issue of optimistic Vs. pessimistic concurrency does not seem to be affected by the differences between RDF and relational models. We note that an OLTP workload can be made to run with very few transaction aborts (deadlocks) by properly ordering operations when using a locking scheme. The same does not work with optimistic concurrency since updates happen immediately and transaction aborts occur whenever the writes of one intersect the reads or writes of another, regardless of the order in which these were made.
Developers seldom understand transactions; therefore DBMS should, within the limits of the possible, optimize locking order for locking schemes. A simple example is locking in key order when doing an operation on a set of values. A more complex variant would consist of analyzing data dependencies in stored procedures and reordering updates so as to get the highest cardinality tables first. We note that this latter trick also benefits optimistic schemes.
In RDF, the same principles apply but distinguishing cardinality of an updated set will have to rely on statistics of predicate cardinality. Such are anyhow needed for query optimization.
Web scale systems that need to maintain consistent state across multiple data centers sometimes use "eventual consistency" schemes. Two-phase-commit becomes very inefficient as latency increases, thus strict transactional semantics have prohibitive cost if the system is more distributed than a cluster with a fast interconnect.
Eventual consistency schemes (Amazon Dynamo, Yahoo! PNUTS) maintain history information on the record which is the unit of concurrency control. The record is typically a non-first normal form chunk of related data that it makes sense to store together from the application's viewpoint. Application logic can then be applied to reconciling differing copies of the same logical record.
Such a scheme seems a priori ill-suited for RDF, where the natural unit of concurrency control would seem to be the quad. We first note that only recently changed (i.e., DELETEd + INSERTed
quads, as there is no UPDATE
-in-place) need history information. This history information can be stored away from the quad itself, thus not disrupting compression. When detecting that one site has INSERTed
a quad that another has DELETEd
in the same general time period, application logic can still be applied for reading related quads in order to arrive at a decision on how to reconcile two databases that have diverged. The same can apply to conflicting values of properties that for the application should be single-valued. Comparing time-stamped transaction logs on quads is not fundamentally different from comparing record histories in Dynamo or PNUTS.
As we overcome the data size penalties that have until recently been associated with RDF, RDF becomes even more interesting as a data model for large online systems such as social network platforms where frequent application changes lead to volatility of schema. Key value stores are currently found in such applications, but they generally do not provide the query flexibility at which RDF excels.
We have gone over basic aspects of the endlessly complex and variable topic of transactions, and drawn parallels as well as outlined two basic differences between relational and RDF systems: What used to be REPEATABLE READ
becomes SERIALIZABLE
; and row-level locking becomes locking at the level of a single attribute value. For the rest, we see that the optimistic and pessimistic modes of concurrency control, as well as guidelines for writing transaction procedures, remain much the same.
Based on this overview, it should be possible to design an ACID test for describing the ACID behavior of benchmarked systems. We do not intend to make transaction support a qualification requirement for an RDF benchmark, but information on transaction support will still be valuable in comparing different systems.
]]>Transactions are certainly not the first thing that comes to mind when one hears "RDF". We have at times used a recruitment questionnaire where we ask applicants to define a transaction. Many vaguely remember that it is a unit of work, but usually not more than that. We sometimes get questions from users about why they get an error message that says "deadlock". "Deadlock" is what happens when multiple users concurrently update balances on multiple bank accounts in the wrong order. What does this have to do with RDF?
There are in fact users who even use XA with a Virtuoso-based RDF application. Franz also has publicized their development of full ACID capabilities for AllegroGraph. RDF is a database schema model, and transactions will inevitably become an issue in databases.
At the same time, the developer population trained with MySQL and PHP is not particularly transaction-aware. Transactions have gone out of style, declares the No-SQL crowd. Well, it is not so much SQL they object to but ACID, i.e., transactional guarantees. We will talk more about this in the next post. The SPARQL language and protocol do not go into transactions, except for expressing the wish that an UPDATE
request to an end-point be atomic. But beware -- atomicity is a gateway drug, and soon one finds oneself on full ACID.
If one says that a thing will either happen in its entirety or not at all, which is what (A) atomicity means, then the question arises of (I) isolation; that is, what happens if somebody else does something to the same data at the same time? Then comes the question of whether a thing, once having happened, will stay that way; i.e., (D) durability. Finally, there is (C) consistency, which means that the transaction's result must not contradict restrictions the database is supposed to enforce. RDF usually has no restrictions; thus consistency mostly means that the internal state of the DBMS must be consistent, e.g., different indices on triples/quads should contain the same data.
There are, of course, database-like consistency criteria that one can express in RDF Schema and OWL, concerning data types, mandatory presence of properties, or restrictions on cardinality (i.e., one may only have one spouse at a time, and the like).
If one indeed did enforce them all, then RDF would be very like the relational model -- with all the restrictions, but without the 40 years of work on RDBMS performance. For this reason, RDF use tends to involve data that is not structured enough to be a good fit for RDBMS.
There is of course the OWL side, where consistency is important but is defined in such complex ways that they again are not a good fit for RDBMS. RDF could be seen to be split between the schema-last world and the knowledge representation world. I will here focus on the schema-last side.
Transactions are relevant in RDF in two cases: 1. If data is trickle loaded in small chunks, one likes to know that the chunks do not get lost or corrupted; 2. If the application has any semantics that reserve resources, then these operations need transactions. The latter is not so common with RDF but examples include read-write situations, like checking if a seat is available and then reserving it. Transactionality guarantees that the same seat does not get reserved twice.
Web people argue with some justification that since the four cardinal virtues of database never existed on the web to begin with, applying strict ACID to web data is beside the point, like locking the stable after the horse has long since run away. This may be so; yet the systems used for processing data, whether that data is dirty or not, benefit from predictable operation under concurrency and from not losing data.
Analytics workloads are not primarily about transactions, but still need to specify what happens with updates. Analyzing data from measurements may not have concurrent updates, but there the transaction issue is replaced by the question of making explicit how the data was acquired and what processing has been applied to it before storage.
As mentioned before, the LOD2 project is at the crossroads of RDF and database. I construe its mission to be the making of RDF into a respectable database discipline. Database respectability in turn is as good as inconceivable without addressing the very bedrock on which this science was founded: transactions.
As previously argued, we need well-defined and auditable benchmarks. This again brings up the topic of transactions. Once we embark on the database benchmark route, there is no way around this. TPC-H mandates that the system under test support transactions, and the audit involves a test for this. We can do no less.
This has led me to more closely examine the issue of RDF and transactions, and whether there exist differences between transactions applied to RDF and to relational data.
As concerns Virtuoso, our position has been that one can get full ACID in Virtuoso, whether in SQL or SPARQL, by using a connected client (e.g., ODBC, JDBC, or the Jena or Sesame frameworks), and setting the isolation options on the connection. Having taken this step, one then must take the next step, which consists of dealing with deadlocks; i.e., with concurrent utilization, it may happen that the database at any time notifies the client that the transaction got aborted and the client must retry.
Web developers especially do not like this, because this is not what MySQL has taught them to expect. MySQL does have transactional back-ends like InnoDB, but often gets used without transactions.
With the March 2011 Virtuoso releases, we have taken a closer look at transactions with RDF. It is more practical to reduce the possibility of errors than to require developers to pay attention. For this reason we have automated isolation settings for RDF, greatly reduced the incidence of deadlocks, and even incorporated automatic deadlock retries where applicable.
If all users lock resources they need in the same order, there will be no deadlocks. This is what we do with RDF load in Virtuoso 7; thus any mix of concurrent INSERTs
and DELETEs
, if these are under a certain size (normally 10000 quads) are guaranteed never to fail due to locking. These could still fail due to running out of space, though. With previous versions, there always was a possibility of having an INSERT
or DELETE
fail because of deadlock with multiple users. Vectored INSERT
and DELETE
are sufficient for making web crawling or archive maintenance practically deadlock free, since there the primary transaction is the INSERT
or DELETE
of a small graph.
Furthermore, since the SPARQL protocol has no way of specifying transactions consisting of multiple client-server exchanges, the SPARQL end-point may deal with deadlocks by itself. If all else fails, it can simply execute requests one after the other, thus eliminating any possibility of locking. We note that many statements will be intrinsically free of deadlocks by virtue of always locking in key order, but this cannot be universally guaranteed with arbitrary size operations; thus concurrent operations might still sometimes deadlock. Anyway, vectored execution as introduced in Virtuoso 7, besides getting easily double-speed random access, also greatly reduces deadlocks by virtue of ordering operations.
In the next post we will talk about what transactions mean with RDF and whether there is any difference with the relational model.
]]>I will now discuss what we have done towards this end in 2010 and how you will gain by this in 2011.
At the start of 2010, we had internally demonstrated 4x space efficiency gains from column-wise compression and 3x loop join speed gains from vectored execution. To recap, column-wise compression means a column-wise storage layout where values of consecutive rows of a single column are consecutive in memory/disk and are compressed in a manner that benefits from the homogenous data type and possible sort order of the column. Vectored execution means passing large numbers of query variable bindings between query operators and possibly sorting inputs to joins for improving locality. Furthermore, always operating on large sets of values gives extra opportunities for parallelism, from instruction level to threads to scale out.
So, during 2010, we integrated these technologies into Virtuoso, for relational- and graph-based applications alike. Further, even if we say that RDF will be close to relational speed in Virtuoso, the point is moot if Virtuoso's relational speed is not up there with the best of analytics-oriented RDBMS. RDF performance does rest on the basis of general-purpose database performance; what is sauce for the goose is sauce for the gander. So we reimplemented HASH JOIN
and GROUP BY
, and fine-tuned many of the tricks required by TPC-H. TPC-H is not the sole final destination, but it is a step on the way and a valuable checklist for what a database ought to do.
At the Semdata workshop of VLDB 2010 we presented some results of our column store applied to RDF and relational tasks. As noted in the paper, the implementation did demonstrate significant gains over the previous row-wise architecture but was not yet well optimized, so not ready to be compared with the best of the relational analytics world. A good part of the fall of 2010 went into optimizing the column store and completing functionality such as transaction support with columns.
A lot of this work is not specifically RDF oriented, but all of this work is constantly informed by the specific requirements of RDF. For example, the general idea of vectored execution is to eliminate overheads and optimize CPU cache and other locality by doing single query operations on arrays of operands so that the whole batch runs more or less in CPU cache. Are the gains not lost if data is typed at run time, as in RDF? In fact, the cost of run-time-typing turns out to be small, since data in practice tends to be of homogenous type and with locality of reference in values. Virtuoso's column store implementation resembles in broad outline other column stores like Vertica or VectorWise, the main difference being the built-in support for run-time heterogenous types.
The LOD2 EU FP 7 project started in September 2010. In this project OpenLink and the celebrated heroes of the column store, CWI of MonetDB and VectorWise fame, represent the database side.
The first database task of LOD2 is making a survey of the state of the art and a round of benchmarking of RDF stores. The Berlin SPARQL Benchmark (BSBM) has accordingly evolved to include a business intelligence section and an update stream. Initial results from running these will become available in February/March, 2011. The specifics of this process merit another post; let it for now be said that benchmarking is making progress. In the end, it is our conviction that we need a situation where vendors may publish results as and when they are available and where there exists a well defined process for documenting and checking results.
LOD2 will continue by linking the universe, as I half-facetiously put it on a presentation slide. This means alignment of anything from schema to instance identifiers, with and without supervision, and always with provenance, summarization, visualization, and so forth. In fact, putting it this way, this gets to sound like the old chimera of generating applications from data or allowing users to derive actionable intelligence from data of which they do not even know the structure. No, we are not that unrealistic. But we are moving toward more ad-hoc discovery and faster time to answer. And since we provide an infrastructure element under all this, we want to do away with the "RDF tax," by which we mean any significant extra cost of RDF compared to an alternate technology. To put it another way, you ought to pay for unpredictable heterogeneity or complex inference only when you actually use them, not as a fixed up-front overhead.
So much for promises. When will you see something? It is safe to say that we cannot very well publish benchmarks of systems that are not generally available in some form. This places an initial technology preview cut of Virtuoso 7 with vectored execution somewhere in January or early February. The column store feature will be built in, but more than likely the row-wise compressed RDF format of Virtuoso 6 will still be the default. Version 6 and 7 databases will be interchangeable unless column-store structures are used.
For now, our priority is to release the substantial gains that have already been accomplished.
After an initial preview cut, we will return to the agenda of making sure Virtuoso is up there with the best in relational analytics, and that the equivalent workload with an RDF data model runs as close as possible to relational performance. As a first step this means taking TPC-H as is, and then converting the data and queries to the trivially equivalent RDF and SPARQL and seeing how it goes. In the September paper we dabbled a little with the data at a small scale but now we must run the full set of queries at 100GB and 300GB scales, which come to about 14 billion and 42 billion triples, respectively. A well done analysis of the issues encountered, covering similarities and dissimilarities of the implementation of the workload as SQL and SPARQL, should make a good VLDB paper.
Database performance is an entirely open-ended quest and the bag of potentially applicable tricks is as good as infinite. Having said this, it seems that the scales comfortably reached in the TPC benchmarks are more than adequate for pretty much anything one is likely to encounter in real world applications involving comparable workloads. Businesses getting over 6 million new order transactions per minute (the high score of TPC-C) or analyzing a warehouse of 60 billion orders shipped to 6 billion customers over 7 years (10000GB or 10TB TPC-H) are not very common if they exist at all.
The real world frontier has moved on. Scaling up the TPC workloads remains a generally useful exercise that continues to contribute to the state of the art but the applications requiring this advance are changing.
Someone once said that for a new technology to become mainstream, it needs to solve a new class of problem. Yes, while it is a preparatory step to run TPC-H translated to SPARQL without dying of overheads, there is little point in doing this in production since SQL is anyway likely better and already known, proven, and deployed.
The new class of problem, as LOD2 sees it, is the matter of web-wide cross-organizational data integration. Web-wide does not necessarily mean crawling the whole web, but does tend to mean running into significant heterogeneity of sources, both in terms of modeling and in terms of usage of more-or-less standard data models. Around this topic we hear two messages. The database people say that inference beyond what you can express in SQL views is theoretically nice but practically not needed; on the other side, we hear that the inference now being standardized in efforts like RIF and OWL is not expressive enough for the real world. As one expert put it, if enterprise data integration in the 1980s was between a few databases, today it is more like between 1000 databases, which makes this matter similar to searching the web. How can one know in such a situation that the data being aggregated is in fact meaningfully aggregate-able?
Add to this the prevalence of unstructured data in the world and the need to mine it for actionable intelligence. Think of combining data from CRM, worldwide media coverage of own and competitive brands, and in-house emails for assessing organizational response to events on the market.
These are the actual use cases for which we need RDF at relational DW performance and scale. This is not limited to RDF and OWL profiles, since we fully believe that inference needs are more diverse. The reason why this is RDF and not SQL plus some extension of Datalog, is the widespread adoption of RDF and linked data as a data publishing format, with all the schema-last and open world aspects that have been there from the start.
Stay tuned for more news later this month!
Franz, Ontotext, and OpenLink were the vendors present at the workshop. To summarize very briefly, Jans Aasman of Franz talked about the telco call center automation solution by Amdocs, where the AllegroGraph RDF store is integrated. On the technical side, AllegroGraph has Javascript as a stored procedure language, which is certainly a good idea. Naso of Ontotext talked about the BBC FIFA World Cup site. The technical proposition was that materialization is good and data partitioning is not needed; a set of replicated read-only copies is good enough.
I talked about making RDF cost competitive with relational for data integration and BI. The crux is space efficiency and column store techniques.
One question that came up was that maybe RDF could approach relational in some things, but what about string literals being stored in a separate table? Or URI strings being stored in a separate table?
The answer is that if one accesses a lot of these literals the access will be local and fairly efficient. If one accesses just a few, it does not matter. For user-facing reports, there is no point in returning a million strings that the user will not read anyhow. But then it turned out that there in fact exist reports in bioinformatics where there are 100,000 strings. Now taking the worst abuse of SPARQL, a regexp over all literals in a property of a given class. With a column store this is a scan of the column; with RDF, a three table join. The join is about 10x slower than the column scan. Quite OK, considering that a full text index is the likely solution for such workloads anyway. Besides, a sensible relational schema will also not use strings for foreign keys, and will therefore incur a similar burden from fetching the strings before returning the result.
Another question was about whether the attitude was one of confrontation between RDF and relational and whether it would not be better to join forces. Well, as said in my talk, sauce for the goose is sauce for the gander and generally speaking relational techniques apply equally to RDF. There are a few RDB tricks that have no RDF equivalent, like clustering a fact table on dimension values, e.g., sales ordered by country, manufacturer, month. But by and large, column-store techniques apply. The execution engine can be essentially identical, just needing a couple of extra data types and some run-time typing and in some cases producing nulls instead of errors. Query optimization is much the same, except that RDB stats are not applicable as such; one needs to sample the data in the cost model. All in all, these adaptations to a RDB are not so large, even though they do require changes to source code.
Another question was about combining data models, e.g., relational (rows and columns), RDF (graph), XML (tree), and full text. Here I would say that it is a fault of our messaging that we do not constantly repeat the necessity of this combining, as we take it for granted. Most RDF stores have a full text index on literal values. OWLIM and a CWI prototype even have it for URIs. XML is a valid data type for an RDF literal, even though this does not get used very much. So doing SPARQL to select the values, and then doing XPath and XSLT on the values, is entirely possible, at least in Virtuoso which has an XPath/XSLT engine built in. Same for invoking SPARQL from an XSLT sheet. Colocating a native RDBMS with local and federated SQL is what Virtuoso has always done. One can, for example, map tables in heterogenous remote RDBs into tables in Virtuoso, then map these into RDF, and run SPARQL queries that get translated into SQL against the original tables, thereby getting SPARQL access without any materialization. Alongside this, one can ETL relational data into RDF via the same declarative mapping.
Further, there are RDF extensions for geospatial queries in Virtuoso and AllegroGraph, and soon also in others.
With all this cross-model operation, RDF is definitely not a closed island. We'll have to repeat this more.
Of the academic papers, the SpiderStore (paper is not yet available at time of writing, but should be soon) and Webpie that should be specially noted.
Let us talk about SpiderStore first.
The SpiderStore from the University of Innsbruck is a main-memory-only system that has a record for each distinct IRI. The IRI record has one array of pointers to all IRI records that are objects where the referencing record is the subject, and a similar array of pointers to all records where the referencing record is the object. Both sets of pointers are clustered based on the predicate labeling the edge.
According to the authors (Robert Binna, Wolfgang Gassler, Eva Zangerle, Dominic Pacher, and Günther Specht), a distinct IRI is 5 pointers and each triple is 3 pointers. This would make about 4 pointers per triple, i.e., 32 bytes with 64-bit pointers.
This is not particularly memory efficient, since one must count unused space after growing the lists, fragmentation, etc., which will make the space consumption closer to 40 bytes per triple, plus should one add a graph to the mix one would need another pointer per distinct predicate, adding another 1-4 bytes per triple. Supporting non-IRI types in the object position is not a problem, as long as all distinct values have a chunk of memory to them with a type tag.
We get a few times better memory efficiency with column compressed quads, plus we are not limited to main memory.
But SpiderStore has a point. Making the traversal of an edge in the graph into a pointer dereference is not such a bad deal, especially if the data set is not that big. Furthermore, compiling the queries into C procedures playing with the pointers alone would give performance to match or exceed any hard coded graph traversal library and would not be very difficult. Supporting multithreaded updates would spoil much of the gain but allowing single threaded updates and forking read-only copies for reading would be fine.
SpiderStore as such is not attractive for what we intend to do, this being aggregating RDF quads in volumes far exceeding main memory and scaling to clusters. We note that SpiderStore hits problems with distributed memory, since SpiderStore executes depth first, which is manifestly impossible if significant latencies are involved. In other words, if there can be latency, one must amortize by having a lot of other possible work available. Running with long vectors of values is one way, as in MonetDB or Virtuoso Cluster. The other way is to have a massively multithreaded platform which favors code with few instructions but little memory locality. SpiderStore could be a good fit for massive multithreading, specially if queries were compiled to C, dramatically cutting down on the count of instructions to execute.
We too could adopt some ideas from SpiderStore. Namely, if running vectored, one just in passing, without extra overhead, generates an array of links to the next IRI, a bit like the array that SpiderStore has for each predicate for the incoming and outgoing edges of a given IRI. Of course, here these would be persistent IDs and not pointers, but a hash from one to the other takes almost no time. So, while SpiderStore alone may not be what we are after for data warehousing, Spiderizing parts of the working set would not be so bad. This is especially so since the Spiderizable data structure almost gets made as a by-product of query evaluation.
If an algorithm made several passes over a relatively small subgraph of the whole database, Spiderizing it would accelerate things. The memory overhead could have a fixed cap so as not to ruin the working set if locality happened not to hold.
Running a SpiderStore-like execution model on vectors instead of single values would likely do no harm and might even result in better cache behavior. The exception is in the event of completely unpredictable patterns of connections which may only be amortized by massive multithreading.
Webpie from VU Amsterdam and the LarKC EU FP 7 project is, as it were, the opposite of SpiderStore. This is a map-reduce-based RDFS and OWL Horst inference engine which is all about breadth-first passes over the data in a map-reduce framework with intermediate disk-based storage.
Webpie is not however a database. After the inference result has been materialized, it must be loaded into a SPARQL engine in order to evaluate a query against the result.
The execution plan of Webpie is made from the ontology whose consequences must be materialized. The steps are sorted and run until a fixed point is reached for each. This is similar to running SPARQL INSERT ⦠SELECT
statements until no new inserts are produced. The only requirement is that the INSERT
statement should report whether new inserts were actually made. This is easy to do. In this way, a comparison between map-reduce plus memory-based joining and a parallel RDF database could be made.
We have suggested such an experiment to the LarKC people. We will see.
]]>The queries touch very little data, to the point where compilation is a large fraction of execution time. This is not representative of the data integration/analytics orientation of RDF.
Most queries are logarithmic to scale factor, but some are linear. The linear ones come to dominate the metric at larger scales.
An update stream would make the workload more realistic.
We could rectify this all with almost no changes to the data generator or test driver by adding one or two more metrics.
So I am publishing the below as a starting point for discussion.
Below is a set of business questions that can be answered with the BSBM data set. These are more complex and touch a greater percentage of the data than the initial mix. Their evaluation is between linear and n * log(n) to the data size. The TPC-H rules can be used for a power (single user) and a throughput (multi-user, where each submits queries from the mix with different parameters and in different order). The TPC-H score formula and executive summary formats are directly applicable.
This can be a separate metric from the "restricted" BSBM score. Restricted means "without a full scan with regexp" which will dominate the whole metric at larger scales.
Vendor specific variations in syntax will occur, hence these are allowed but disclosure of specific query text should accompany results. Hints for JOIN
order and the like are not allowed; queries must be declarative. We note that both SPARQL and SQL implementations of the queries are possible.
The queries are ordered so that the first ones fill the cache. Running the analytics mix immediately after backup post initial load is allowed, resulting in semi-warm cache. Steady-state rules will be defined later, seeing the characteristics of the actual workload.
For each country, list the top 10 product categories, ordered by the count of reviews from the country.
Product with the most reviews during its first month on the market
10 products most similar to X, with similarity score based on the count of features in common
Top 10 reviewers of category X
Product with largest increase in reviews in month X compared to month X-minus-1.
Product of category X with largest change in mean price in the last month
Most active American reviewer of Japanese cameras last year
Correlation of price and average review
Features with greatest impact on price â for features occurring in category X, find the top 10 features where the mean price with the feature is most above the mean price without the feature
Country with greatest popularity of products in category X â reviews of category X from country Y divided by total reviews
Leading product of category X by country, mentioning mean price in each country and number of offers, sort by number of offers
Fans of manufacturer â find top reviewers who score manufacturer above their mean score
Products sold only in country X
Since RDF stores often implement a full text index, and since a full scan with regexp matching would never be used in an online E-commerce portal, it is meaningful to extend the benchmark to have some full text queries.
For the SPARQL implementation, text indexing should be enabled for all string-valued literals even though only some of them will be queried in the workload.
Q6 from the original mix, now allowing use of text index.
Reviews of products of category X where the review contains the names of 1 to 3 product features that occur in said category of products; e.g., MP3 players with support for mp4 and ogg.
ibid but now specifying review author. The intent is that structured criteria are here more selective than text.
Difference in the frequency of use of "awesome", "super", and "suck(s)" by American vs. European vs. Asian review authors.
For full text queries, the search terms have to be selected according to a realistic distribution. DERI has offered to provide a definition and possibly an implementation for this.
The parameter distribution for the analytics queries will be defined when developing the queries; the intent is that one run will touch 90% of the values in the properties mentioned in the queries.
The result report will have to be adapted to provide a TPC-H executive summary-style report and appropriate metrics.
For supporting the IR mix, reviews should, in addition to random text, contain the following:
For each feature in the product concerned, add the label of said feature to 60% of the reviews.
Add the names of review author, product, product category, and manufacturer.
The review score should be expressed in the text by adjectives (e.g., awesome, super, good, dismal, bad, sucky). Every 20th word can be an adjective from the list correlating with the score in 80% of uses of the word and random in 20%. For 90% of adjectives, pick the adjectives from lists of idiomatic expressions corresponding to the country of the reviewer. In 10% of cases, use a random list of idioms.
Skew the review scores so that comparatively expensive products have a smaller chance for a bad review.
During the benchmark run:
1% of products are added;
3% of initial offers are deleted and 3% are added; and
5% of reviews are added.
Updates may be divided into transactions and run in series or in parallel in a manner specified by the test sponsor. The code for loading the update stream is vendor specific but must be disclosed.
The initial bulk load does not have to be transactional in any way.
Loading the update stream must be transactional, guaranteeing that all information pertaining to a product or an offer constitutes a transaction. Multiple offers or products may be combined in a transaction. Queries should run at least in READ COMMITTED
isolation, so that half-inserted products or offers are not seen.
Full text indices do not have to be updated transactionally; the update can lag up to 2 minutes behind the insertion of the literal being indexed.
The test data generator generates the update stream together with the initial data. The update stream is a set of files containing Turtle-serialized data for the updates, with all triples belonging to a transaction in consecutive order. The possible transaction boundaries are marked with a comment distinguishable from the text. The test sponsor may implement a special load program if desired. The files must be loaded in sequence but a single file may be loaded on any number of parallel threads.
The data generator should generate multiple files for the initial dump in order to facilitate parallel loading.
The same update stream can be used during all tests, starting each run from a backup containing only the initial state. In the original run, the update stream is applied starting at the measurement interval, after the SUT is in steady state.
]]>As concerns OpenLink specifically, we have two short term activities, namely publishing the initial LOD2 repository in December and publishing a set of RDB and RDF benchmarks in February.
The LOD2 repository is a fusion of the OpenLink LOD Cloud Cache (which includes data from URIBurner and PingTheSemanticWeb) and Sindice, both hosted at DERI. The value-add compared to Sindice or the Virtuoso-based LOD Cloud Cache alone is the merger of the timeliness and ping-ping crawling of Sindice with the SPARQL of Virtuoso.
Further down the road, after we migrate the system to the Virtuoso column store, we will also see gains in performance, primarily due to much better working set, as data is many times more compact than with the present row-wise key compression.
Still further, but before next September, we will have dynamic repartitioning; the time of availability is set as this is part of the LOD2 project roadmap. The operational need for this is pushed back somewhat by the compression gains from column-wise storage.
As for benchmarks, I just compiled a draft of suggested extensions to the BSBM (Berlin SPARQL Benchmark). I talked about this with Peter Boncz and Chris Bizer, to the effect that some extensions of BSBM could be done but that the time was a bit short for making a RDF-specific benchmark. We do recall that BSBM is fully feasible with a relational schema and that RDF offers no fundamental edge for the workload.
There was a graph benchmark talk at the TPC workshop at VLDB 2010. There too, the authors were suggesting a social network use case for benchmarking anything from RDF stores to graph libraries. The presentation did not include any specification of test data, so it may be that some cooperation is possible there. The need for such a benchmark is well acknowledged. The final form of this is not yet set but LOD2 will in time publish results from such.
We did informally talk about a process for publishing with our colleagues from Franz and Ontotext at VLDB 2010. The idea is that vendors tune their own systems and do the runs and that the others check on this, preferably all using the same hardware.
Now, the LOD2 benchmarks will also include relational-to-RDF comparisons, for example TPC-H in SQL and SPARQL. The SQL will be Virtuoso, MonetDB, and possibly VectorWise and others, depending on what legal restrictions apply at the time. This will give an RDF-to-SQL comparison of TPC-H at least on Virtuoso, later also on MonetDB, depending on the schedule for a MonetDB SPARQL front-end.
In the immediate term, this of course focuses our efforts on productizing the Virtuoso column store extension and the optimizations that go with it.
LOD2 is however about much more than database benchmarks. Over the longer term, we plan to apply suitable parts of the ground-breaking database research done at CWI to RDF use cases.
This involves anything from adaptive indexing, to reuse and caching of intermediate results, to adaptive execution. This is however more than just mapping column store concepts to RDF. New challenges are posed by running on clusters and dealing with more expressive queries than just SQL, in specific queries with Datalog-like rules and recursion.
LOD2 is principally about integration and alignment, from the schema to the instance level. This involves complex batch processing, close to the data, on large volumes of data. Map-reduce is not the be-all-end-all of this. Of course, a parallel database like Virtuoso, Greenplum, or Vertica can do map-reduce style operations under control of the SQL engine. After all, the SQL engine needs to do map-reduce and a lot more to provide good throughput for parallel, distributed SQL. Something like the Berkeley Orders Of Magnitude (BOOM) distributed Datalog implementation (Overlog, Deadalus, BLOOM) could be a parallel computation framework that would subsume any map-reduce-style functionality under a more elegant declarative framework while still leaving control of execution to the developer for the cases where this is needed.
From our viewpoint, the project's gains include:
Significant narrowing of the RDB to RDF performance gap. RDF will be an option for large scale warehousing, cutting down on time to integration by providing greater schema flexibility.
Ready to use toolbox for data integration, including schema alignment and resolution of coreference.
Data discovery, summarization and visualization
Integrating this into a relatively unified stack of tools is possible, since these all cluster around the task of linking the universe with RDF and linked data. In this respect the integration of results may be stronger than often seen in European large scale integrating projects.
The use cases fit the development profile well:
Wolters Kluwer will develop an application for integrating resources around law, from the actual laws to court cases to media coverage. The content is modeled in a fine grained legal ontology.
Exalead will implement the linked data enterprise, addressing enterprise search and any typical enterprise data integration plus generating added value from open sources.
The Open Knowledge Foundation will create a portal of all government published data for easy access by citizens.
In all these cases, the integration requirements of schema alignment, resolution of identity, information extraction, and efficient storage and retrieval play a significant role. The end user interfaces will be task-specific but developer interfaces around integration tools and query formulation may be quite generic and suited for generic RDF application development.
]]>I have over the years trained a few truly excellent engineers and managed a heterogeneous lot of people. These days, since what we are doing is in fact quite difficult and the world is not totally without competition, I find that I must stick to core competence, which is hardcore tech and leave management to those who have time for it.
When younger, I thought that I could, through sheer personal charisma, transfer either technical skills, sound judgment, or drive and ambition to people I was working with. Well, to the extent I believed this, my own judgment was not sound. Transferring anything at all is difficult and chancy. I must here think of a fantasy novel where a wizard said that, "working such magic that makes things do what they already want to do is easy." There is a grain of truth in that.
In order to build or manage organizations, we must work, as the wizard put it, with nature, not against it. There are also counter-examples, for example my wife's grandmother had decided to transform a regular willow into a weeping one by tying down the branches. Such "magic," needless to say, takes constant maintenance; else the spell breaks.
To operate efficiently, either in business or education, we need to steer away from such endeavors. This is a valuable lesson, but now consider teaching this to somebody. Those who would most benefit from this wisdom are the least receptive to it. So again, we are reminded to stay away from the fantasy of being able to transfer some understanding we think to have and to have this take root. It will if it will and if it does not, it will take constant follow up, like the would-be weeping willow.
Now, in more specific terms, what can we realistically expect to teach about computer science?
Complexity of algorithms would be the first thing. Understanding the relative throughputs and latencies of the memory hierarchy (i.e., cache, memory, local network, disk, wide area network) is the second. Understanding the difference of synchronous and asynchronous and the cost of synchronization (i.e., anything from waiting for a mutex to waiting for a network message) is the third.
Understanding how a database works would be immensely helpful for almost any application development task but this is probably asking too much.
Then there is the question of engineering. Where do we put interfaces and what should these interfaces expose? Well, they certainly should expose multiple instances of whatever it is they expose, since passing through an interface takes time.
I tried once to tell the SPARQL committee that parameterized queries and array parameters are a self-evident truism on the database side. This is an example of an interface that exposes multiple instances of what it exposes. But the committee decided not to standardize these. There is something in the "semanticist" mind that is irrationally antagonistic to what is self-evident for databasers. This is further an example of ignoring precept 2 above, the point about the throughputs and latencies in the memory hierarchy. Nature is a better and more patient teacher than I; the point will become clear of itself in due time, no worry.
Interfaces seem to be overvalued in education. This is tricky because we should not teach that interfaces are bad either. Nature has islands of tightly intertwined processes, separated by fairly narrow interfaces. People are taught to think in block diagrams, so they probably project this also where it does not apply, thereby missing some connections and porosity of interfaces.
LarKC (EU FP7 Large Knowledge Collider project) is an exercise in interfaces. The lessons so far are that coupling needs to be tight, and that the roles of the components are not always as neatly separable as the block diagram suggests.
Recognizing the points where interfaces are naturally narrow is very difficult. Teaching this in a curriculum is likely impossible. This is not to say that the matter should not be mentioned and examples of over-"paradigmatism" given. The geek mind likes to latch on to a paradigm (e.g., object orientation), and then they try to put it everywhere. It is safe to say that taking block diagrams too naively or too seriously makes for poor performance and needless code. In some cases, block diagrams can serve as tactical disinformation; i.e., you give lip service to the values of structure, information hiding, and reuse, which one is not allowed to challenge, ever, and at the same time you do not disclose the competitive edge, which is pretty much always a breach of these same principles.
I was once at a data integration workshop in the US where some very qualified people talked about the process of science. They had this delightfully American metaphor for it:
The edge is created in the "Wild West" â there are no standards or hard-and-fast rules, and paradigmatism for paradigmatism's sake is a laughing matter with the cowboys in the fringe where new ground is broken. Then there is the OK Corral, where the cowboys shoot it out to see who prevails. Then there is Dodge City, where the lawman already reigns, and compliance, standards, and paradigms are not to be trifled with, lest one get the tar-and-feather treatment and be "driven out o'Dodge."
So, if reality is like this, what attitude should the curriculum have towards it? Do we make innovators or followers? Well, as said before, they are not made. Or if they are made, they are not at least made in the university but much before that. I never made any of either, in spite of trying, but did meet many of both kinds. The education system needs to recognize individual differences, even though this is against the trend of turning out a standardized product. Enforced mediocrity makes mediocrity. The world has an amazing tolerance for mediocrity, it is true. But the edge is not created with this, if edge is what we are after.
But let us move to specifics of semantic technology. What are the core precepts, the equivalent of the complexity/memory/synchronization triangle of general purpose CS basics? Let us not forget that, especially in semantic technology, when we have complex operations, lots of data, and almost always multiple distributed data sources, forgetting the laws of physics carries an especially high penalty.
Know when to ontologize, when to folksonomize. The history of standards has examples of "stacks of Babel," sky-high and all-encompassing, which just result in non-communication and non-adoption. Lighter weight, community driven, tag folksonomy, VoCamp-style approaches can be better. But this is a judgment call, entirely contextual, having to do with the maturity of the domain of discourse, etc.
Answer only questions that are actually asked. This precept is two-pronged. The literal interpretation is not to do inferential closure for its own sake, materializing all implied facts of the knowledge base.
The broader interpretation is to take real-world problems. Expanding RDFS semantics with map-reduce and proving how many iterations this will take is a thing one can do but real-world problems will be more complex and less neat.
Deal with ambiguity. Data on which semantic technologies will be applied will be dirty, with errors from machine processing of natural language to erroneous human annotations. The knowledge bases will not be contradiction free. Michael Witbrock of CYC said many good things about this in Sofia; he would have something to say about a curriculum, no doubt.
Here we see that semantic technology is a younger discipline than computer science. We can outline some desirable skills and directions to follow but the idea of core precepts is not as well formed.
So we can approach the question from the angle of needed skills more than of precepts of science. What should the certified semantician be able to do?
Data integration. Given heterogenous relational schemas talking about the same entities, the semantician should find existing ontologies for the domain, possibly extend these, and then map the relational data to them. After the mapping is conceptually done, the semantician must know what combination of ETL and on-the-fly mapping fits the situation. This does mean that the semantician indeed must understand databases, which I above classified as an almost unreachable ideal. But there is no getting around this. Data is increasingly what makes the world go round. From this it follows that everybody must increasingly publish, consume, and refine, i.e., integrate. The anti-database attitude of the semantic web community simply has to go.
Design and implement workflows for content extraction, e.g., NLP or information extraction from images. This also means familiarity with NLP, desirably to the point of being able to tune the extraction rule sets of various NLP frameworks.
Design SOA workflows. The semantician should be able to extract and represent the semantics of business transactions and the data involved therein.
Lightweight knowledge engineering. The experience of building expert systems from the early days of AI is not the best possible, but with semantics attached to data, some sort of rules seem about inevitable. The rule systems will merge into the DBMS in time. Some ability to work with these, short of making expert systems, will be desirable.
Understand information quality in the sense of trust, provenance, errors in the information, etc. If the world is run based on data analytics, then one must know what the data in the warehouse means, what accidental and deliberate errors it contains, etc.
Of course, most of these tasks take place at some sort of organizational crossroads or interface. This means that the semantician must have some project management skills; must be capable of effectively communicating with different publics and simply getting the job done, always in the face of organizational inertia and often in the face of active resistance from people who view the semantician as some kind of intruder on their turf.
Now, this is a tall order. The semantician will have to be reasonably versatile technically, reasonably clever, and a self-starter on top. The self-starter aspect is the hardest.
The semanticists I have met are more of the scholar than the IT consultant profile. I say semanticist for the semantic web research people and semantician for the practitioner we are trying to define.
We could start by taking people who already do data integration projects and educating them in some semantic technology. We are here talking about a different breed than the one that by nature gravitates to description logics and AI. Projecting semanticist interests or attributes on this public is a source of bias and error.
If we talk about a university curriculum, the part that cannot be taught is the leadership and self-starter aspect, or whatever makes a good IT consultant. Thus the semantic technology studies must be profiled so as to attract people with this profile. As quoted before, the dream job for each era is a scarce skill that makes value from something that is plentiful in the environment. At this moment and for a few moments to come, this is the data geek, or maybe even semantician profile, if we take data geek past statistics and traditional business intelligence skills.
The semantic tech community, especially the academic branch of it, needs to reinvent itself in order to rise to this occasion. The flavor of the dream job curriculum will be away from the theoretical computer science towards the hands-on of database, large systems performance, and the practicalities of getting data intensive projects delivered.
Related
]]>In 2009, RDF scalability questions were solved in their broad outline and the corresponding Virtuoso release was built and used in production internally. Its general availability is now imminent while it has been available on a case-by-case basis thus far.
In 2010, we take on a new challenge: To bring RDF closer to parity with equivalent relational solutions. This will also entail some significant improvements to our relational technology.
Storage density is a key ingredient of performance. Some of the advances will be in this area; other advances will be in increased parallelism of execution. Right now we run things in vectored batches in cluster situations where message latency forces operations to be shipped in large chunks. Next we will do this across the board, also in single servers. The advantages of this for cache behavior and other factors are known in the literature.
Looking at environmental factors, we have a new SPARQL at a Working Draft stage. We have basic parity with SQL expressivity, which is a prerequisite for RDF to become a data model that can be an alternative to relational outside of very specialized contexts.
As the standards process makes SPARQL closer to being an alternative to SQL for data integration, we will make the database engine technology such that RDF's inherent penalty in terms of storage overhead and processing time substantially decreases. This will make RDF a workable integration medium also in places where it was not such before. Of course, an application-specific schema will retain some advantage over a generic one, but then one can have a purely relational application on Virtuoso as well. Just think of the possibility of an application-specific schema emerging by itself in a workload-driven fashion.
As background data for an increasing number of fields becomes available as linked data, using this together with proprietary data for analytics and discovery becomes increasingly interesting. This is the initial line of RDF data warehousing. The biomedical field has many examples. The technologies we will release during 2010 will be geared towards enabling a second line of RDF applications, where ad hoc agile integration with RDF as a lingua franca becomes a real alternative to relational solutions with ETL point solutions for harvesting information from diverse systems. One may see how RDF's flexibility and expressivity may add to agility in any number of situations where data from heterogenous sources needs to be integrated. Which of today's business scenarios does not face this issue?
References:
]]>The geometry support is for both SQL and SPARQL. On the SQL side, it works with the ISO/IEC 13249 SQL/MM API; with RDF, a geometry can occur as the object of a quad. If the object is a typed-literal of the virtrdf:Geometry
type, it gets indexed in a geometry index over all geometries in quads; no special declarations are needed. After this, SQL MM predicates and functions can be used with SPARQL, like this:
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> SELECT ?class COUNT (*) WHERE { ?m geo:geometry ?geo . ?m a ?class . FILTER ( <bif:st_intersects> ( ?geo, <bif:st_point> (0, 52), 100 ) ) } GROUP BY ?class ORDER BY DESC 2
This returns the counts of objects of each class occurring within 100 km of (0, 52), a point near London.
For any data set with WGS 84 geo:long
and geo:lat
values, a simple SQL function makes a point geometry for each such coordinate pair and adds it as the geo:geometry
property of the subject with the long/lat. This then enables fast spatial access to arbitrary location data in RDF.
Right now, we hardly see any geometries other than points in RDF data, even though there are some efforts for vocabularies for more complex entities. As these get adopted we will support them.
For scalability, we tried the implementation with OpenStreetMap's 350 million or so points. The geometry implementation partitions well over a cluster, similarly to a full text index, i.e., every server has its slice of the geometries, partitioned by the geometry object's key, thus not by range of coordinates or such. Like this, the items are evenly spread even though the coordinate distribution is highly uneven.
We can do spatial joins like â
SELECT ?s ( <sql:num_or_null> (?p) ) COUNT (*) WHERE { ?s <http://dbpedia.org/ontology/populationTotal> ?p . FILTER ( <sql:num_or_null> (?p) > 1000000 ) . ?s geo:geometry ?geo . FILTER ( <bif:st_intersects> ( ?pt, ?geo, 5 ) ) . ?xx geo:geometry ?pt } GROUP BY ?s ( <sql:num_or_null> (?p) ) ORDER BY DESC 3 LIMIT 20
This takes the DBpedia subjects that have a population over 1 million and a geometry. We then count all the geometries within 5 km of the point location of the first geometry. With DBpedia (about 5 million points), GeoNames (7 million points), and OpenStreetMap (350 million points), we get the result:
http://dbpedia.org/resource/Munich 1356594 117280 http://dbpedia.org/resource/London 7355400 81486 http://dbpedia.org/resource/Davao_City 1363337 58640 http://dbpedia.org/resource/Belo_Horizonte 2412937 58640 http://dbpedia.org/resource/Chengde 3610000 58640 http://dbpedia.org/resource/Hamburg 1769117 51664 http://dbpedia.org/resource/San_Diego%2C_California 1266731 47685 http://dbpedia.org/resource/Bursa 1562828 47685 http://dbpedia.org/resource/Port-au-Prince 1082800 47685 http://dbpedia.org/resource/Oakland_County%2C_Michigan 1194156 45636 http://dbpedia.org/resource/Sana%27a 1747627 40923 http://dbpedia.org/resource/Milan 1303437 40923 http://dbpedia.org/resource/Campinas 1059420 40923 http://dbpedia.org/resource/Hohhot 2580000 40923 http://dbpedia.org/resource/Brussels 1031215 40923 http://dbpedia.org/resource/Bogra_District 2988567 40923 http://dbpedia.org/resource/Cort%C3%A9s_Department 1202510 40923 http://dbpedia.org/resource/Berlin 3416300 35668 http://dbpedia.org/resource/New_York_City 8274527 30810 http://dbpedia.org/resource/Los_Angeles%2C_California 3849378 25614
20 Rows. -- 1733 msec.
Cluster 8 nodes, 1 s. 358 m/s 1596 KB/s 664% cpu 2% read 16% clw threads 1r 0w 0i buffers 1124351 0 d 0 w 0 pfs
This takes 1.7 seconds on a Virtuoso Cluster configured with 8 processes on a single dual-Xeon 5520 box, running at about 664% CPU with warm cache. Fair enough for a first crack, this can obviously be optimized further. Still, the geo part of the processing is already as good as instantaneous.
We will shortly have the geography features installed on DBpedia and the other data sets we host. As these come online we will show more demo queries.
For more about SQL/MM, you can look to a couple of PDFs:
Since the questionnaire is public, I am publishing my answers below.
Data and data types
What volumes of data are we dealing with today? What is the growth rate? Where can we expect to be in 2015?
Private data warehouses of corporations have more than doubled yearly for the past years; hundreds of TB is not exceptional. This will continue. The real shift is in structured data being published in increasing quantities with a minimum level of integrate-ability through use of RDF and linked data principles. There are rewards for use of standard vocabularies and identifiers through search engines recognizing such data. There is convergence around DBpedia identifiers for real-world entities, e.g., most things that would be in the news.
This also means that internal data processes and silos may be enriched with this content. There is consequent pressure for accommodating more diversity of data, with more flexible schema.
Ultimately, all content presently stored in RDBs and presented in public accessible dynamic web pages will end up on the web of linked data. Examples are product catalogs, price lists, event schedules and the like.
The volume of the well known linked data sets is around 10 billion statements. With the above mentioned trends, growth by two or three orders of magnitude by 2015 seems reasonable, This is so especially if explicit semantics are extracted from the document web and if there is some further progress in the precision/recall of such extraction.
Relevant sections of this mass of data are a potential addition to any present or future analytics application.
Since arbitrary analytics over the database which is the web cannot be economically provided by a centralized search engine, a cloud model may be used for on-demand selection of relevant data and mixing it with private data. This will drive database innovation for the next years even more than the continued classical warehouse growth.
Science data is another driver of the data overflow. For example, faster gene sequencing, more accurate measurements in high energy physics, better imaging, and remote sensing will produce large volumes of data. This data has highly regular structure but labeling this data with source and lineage calls for a flexible, schema-last, self-describing model, such as RDF and linked data. Data and metadata should travel together but may have different data models.
By and large, the metadata of science data will be another stream to the web of linked data, at least to the degree it is publicly accessible. Restricted circles can and likely will implement similar ideas.
What types of data can we deal with intelligently due to their inherent structure (geospatial, temporal, social or knowledge graphs, 3D, sensor streams...)?
All the above types should be supported inside one DBMS so as to allow efficient querying combining conditions on all these types of data, e.g., photos of sunsets taken last summer in Ibiza, with over 20 megapixels, by people I know.
Note that the test for being a sunset is an operation on the image blob that should be taken to the data; the images cannot be economically transferred.
Interleaving of all database functions and types becomes increasingly important.
Industries, communities
Who is producing these data and why? Could they do it better? How?
Right now, projects such as Bio2RDF, Neurocommons, and DBPedia produce this data. The processes are in place and are reasonable. Incremental improvement is to be expected. These processes, along with the linked data meme generally taking off, drive demand for better NLP (Natural Language Processing), e.g., entity and relationship extraction, especially extraction that can produce instance data in given ontologies (e.g., events) using common identifiers (e.g., DBPedia URIs).
Mapping of RDBs to RDF is possible, and a W3C working group is developing standards for this. The required baseline level has been reached; the rest is a matter of automating deployment. Within the enterprise, there are advantages to be gained for information integration; e.g., all entities in the CRM space can be integrated with all email and support tickets through giving everything a URI. Some of this information may even be published on an extranet for self-service and web-service interfaces. This has been done at small scales and the rest is a matter of spreading adoption and lowering the entry barrier. Incremental progress will take place, eventually resulting in qualitatively better integration along the value chain when adoption is sufficiently widespread.
Who is consuming these data and why? Could they do it better? How?
Consumers are various. The greatest need is for tools that summarize complex data and allow getting a bird's eye view of what data is in the first instance available. Consuming the data is hindered by the user not even necessarily knowing what data there is. This is somewhat new, as traditionally the business analyst did know the schema of the warehouse and was proficient with SQL report generators and statistics packages.
Where Web 2.0 made the citizen journalist, the web of linked data will make the citizen analyst. For this to happen, with benefits for individuals, enterprises, and governments alike, more work in user interfaces, knowledge discovery, and query composition will be useful. We may envision a "meshup economy" where data is plentiful, but the unit of value and exchange is the smart report that crystallizes actionable value from this ocean.
What industrial sectors in Europe could become more competitive if they became much better at managing data?
Any sector could benefit. Early adopters are seen in the biomedical field and to an extent in media.
Is the regulation landscape imposing constraints (privacy, compliance ...) that don't have today good tool support?
The regulation landscape drives database demand through data retention requirements and the like.
With data integration, especially with privacy-sensitive data (as in medicine), there are issues of whether one dares put otherwise-shareable information online. Regulation is needed to protect individuals, but integration should still be possible for science.
For this, we see a need for progress in applying policy-based approaches (e.g., row level security) to relatively schema-last data such as RDF. This is possible but needs some more work. Also, creating on-the-fly-anonymizing views on data might help.
More research is needed for reconciling the need for security with the advantages of broad-based ad hoc integration. Ideally, data should be intelligent, aware of its origins and classification and cautious of whom it interacts with, all of this supported under the covers so that the user could ask anything but the data might refuse to answer or might restrict answers according to the user's profile. This is a tall order and implementing something of the sort is an open question.
What are the main practical problem identified for individuals and organizations? Please give examples and tell us about the main obstacles and barriers.
We have come across the following:
Other problems have to do with sheer volume, i.e., transfer of data even in a local area network is too slow, let alone over a wide area network. Computation needs to go to the data, and databases need to support this.
Services, software stacks, protocols, standards, benchmarks
What combinations of components are needed to deal with these problems?
Recent times have seen a proliferation of special purpose databases. Since the data needs of the future are about combining data with maximum agility and minimum performance hit, there is need to gather the currently-separate functionality into an integrated system with sufficient flexibility. We see some of this in integration of map-reduce and scale-out databases. The former antagonists have become partners. Vertica, Greenplum, and OpenLink Virtuoso are example of DBMS featuring work in this direction.
Interoperability and at least de facto standards in ways of doing this will emerge.
What data exchange and processing mechanisms will be needed to work across platforms and programming languages?
HTTP, XML, and RDF are in fact very verbose, yet these are the formats and models that have uptake. Thus, these will continue to be used even though one might think binary formats to be more efficient.
There are of course science data set standards that are more compressed and these will continue, hopefully adding a practice of rich metadata in RDF.
For internals of systems, MPI and TCP/IP with proprietary optimized wire formats will continue. Inter-system communication will likely continue to be HTTP, XML, and RDF as appropriate.
What data environments are today so wastefully messy that they would benefit from the development of standards?
RDF and OWL are not messy but they could use some more performance; we are working on this. SPARQL is finally acquiring the capabilities of a serious query language, so things are slowly coming together.
Community process for developing application domain specific vocabularies works quite well, even though one could argue it is ad hoc and not up to what a modeling purist might wish.
Top-down imposition of standards has a mixed history, with long and expensive development and sometimes no or little uptake, consider some WS* standards for example.
What kind of performance is expected or required of these systems? Who will measure it reliably? How?
Relational databases have a history of substantial investment in optimization and some of them are very good for what they do, e.g., the newer generation of analytics databases.
The very large schema-last, no-SQL, sometimes eventually consistent key-value stores have a somewhat shorter history but do fill a real need.
These trends will merge: Extreme scale, schema-last, complex queries, even more complex inference, custom code for in-database machine learning and other bulk processing.
We find RDF augmented with some binary types at this crossroads. This point of the design space will have to provide performance roughly on the level of today's best relational solution for workloads that fit the relational model. The added cost of schema-last and inference must come down. We are working on this. Research work such as carried out with MonetDB gives clues as to how these aims can be reached.
The separation of query language and inference is artificial. After the concepts are mature, these functions will merge and execute close to the data; there are clear evolutionary pressures in this direction.
Benchmarks are key. Some gain can be had even from repurposing standard relational benchmarks like TPC-H. But the TPC-H rules do not allow official reporting of such.
Development of benchmarks for RDF, complex queries, and inference is needed. A bold challenge to the community, it should be rooted in real-life integration needs and involve high heterogeneity. A key-value store benchmark might also be conceived. A transaction benchmark like TPC-C might be the basis, maybe augmented with massive user-generated content like reviews and blogs.
If benchmarks exist and are not too easy nor inaccessibly difficult nor too expensive to run â think of the high end TPC-C results â then TPC-style rules and processes would be quite adequate. The threshold to publish should be lowered: Everybody runs the TPC workloads internally but few publish.
Some EC initiative for benchmarking could make sense, similar to the TREC initiative of the US government. Industry should be consulted for the specific content; possibly the answers to the present questionnaire can provide an approximate direction.
Benchmarks should be run by software vendors on their own systems, tuned by themselves. But there should be a process of disclosure and auditing; the TPC rules give an example. Compliance should not be too expensive or time consuming. Some community development for automating these things would be a worthwhile target for EC funding.
Usability and training
How difficult will it be for a developer of average competence to deploy components whose core is based on rather deep computer science? Do we all need to understand Monads and Continuations? What can be done to make it ever easier?
In the database world, huge advances in technology have taken place behind a relatively simple and stable interface: SQL. For the linked data web, the same will take place behind SPARQL.
Beyond these, for example, programming with MPI with good utilization of a cluster platform for an arbitrary algorithm, is quite difficult. The casual amateur is hereby warned.
There is no single solution. For automatic parallelization, since explicit, programmatic parallelization of things with MPI for example is very unscalable in terms of required skill, we should favor declarative and/or functional approaches.
Developing a debugger and explanation engine for rule-based and description-logics-based inference would be an idea.
For procedural workloads, things like Erlang may be good in cases and are not overly difficult in principle, especially if there are good debugging facilities.
For shipping functions in a cluster or cloud, the BOOM (Berkeley Orders Of Magnitude) approach or logic programming with explicit specification of compute location seem promising, surely more flexible than map-reduce. The question is whether a PHP developer can be made to do logic programming.
This bridge will be crossed only with actual need and even then reluctantly. We may look at the Web 2.0 practice of sharding MySQL, inconvenient as this may be, for an example. There is inertia and thus re-architecting is a constant process that is generally in reaction to facts, post hoc, often a point solution. One could argue that planning ahead would be smarter but by and large the world does not work so.
One part of the answer is an infinitely-scalable SQL database that expands and shrinks in the clouds, with the usual semantics, maybe optional eventual consistency and built-in map reduce. If such a thing is inexpensive enough and syntax-level-compatible with present installed base, many developers do not have to learn very much more.
This is maybe good for the bread-and-butter IT, but European competitiveness should not rest on this. Therefore we wish to go for bold new application types for which the client-server database application is not the model. Data-centric languages like BOOM, if they can be made very efficient and have good debugging support, are attractive there. These do require more intellectual investment but that is not a problem since the less-inquisitive part of the developer community is served by the first part of the answer.
How is a developer of average skills going to learn about these new advanced tools? How can we plan for excellent documentation and training, community mentoring, exchange of good practices, etc... across all EU countries?
For the most part, developers do not learn things for the sake of learning. When they have learned something and it is adequate, they stay with it for the most part and are even reluctant to engage in cross-camps interaction. The research world is often similarly insular. A new inflection in the application landscape is needed to drive learning. This inflection is provided by the ubiquity of mobile devices, sensor data, explicit semantics, NLP concept extraction, web of linked data, and such factors.
RDFa is a good example of a new technique piggybacking on something everybody uses, namely HTML. These new things should, within possibility, be deployed in the usual technology stack, LAMP or Java. Of course these do not have to be LAMP or Java or HTML or HTTP themselves but they must manifest through these.
A lot of the semantic web potential can be realized within the client-server database application model, thus no fundamental re-architecting, just some new data types and queries.
For data- or processing-intensive tasks, an on-demand hookup to cloud-based servers with Erlang and/or BOOM for programming model would be easy enough to learn and utilize.
The question is one of providing challenges. Addressing actual challenges with these techniques will lead to maturity, documentation, examples, and training. With virtual, Europe-wide distributed teams a reality in many places, Europe-wide dissemination is no longer insurmountable.
As the data overflow proceeds, its victims will multiply and create demand for solutions. The EC could here encourage research project use cases gaining an extended life past the end of research projects, possibly being maintained and multiplied and spun off.
If such things could be mutated into self-sustaining service businesses with pay-per-use revenue, say through a cloud SaaS business model, still primarily leveraging an open source technology stack, we could have self-propagating and self-supporting models for exploiting advanced IT. This would create interest, and interest would drive training and dissemination.
The problem is creating the pull.
Challenges
What should be, in this domain, the equivalent of the Netflix challenge, Ansari X Prize, Google Lunar X Prize, etc. ... ?
The EC itself no doubt suffers from data overflow in one function or another. Unless security/secrecy prohibits, simply publishing a large data set and a description of what operations should be done on it would be a start. The more real the data, the better â reality is consistently more complex and surprising than imagination. Since many interesting problems touch on fraud detection and law enforcement, there may be some security obstacles for using these application domains as subject matters of open challenges.
Once there is a good benchmark, as discussed above, there can be some prize money allocated for the winners, specially if the race is tight.
The Semantic Web Challenge and the Billion Triples Challenge exist and are useful as such, but do not seem to have any huge impact.
The incentives should be sufficient and part of the expenses arising from running for such challenges could be funded. Otherwise investing in existing business development will be more interesting to industry. Some industry participation seems necessary; we would wish academia and industry to work closer. Also, having industry supply the baseline guarantees that academia actually does further the state of the art. This is not always certain.
If challenges are based on actual problems, whether of the EC, its member governments, or private entities, and winning the challenge may lead to a contract for supplying an actual solution, these will naturally become more interesting for consortia involving integrators, specialist software vendors, and academia. Such a model would build actual capacity to deploy leading edge technologies in production, which is sorely needed.
What should one do to set up such a challenge, administer, and monitor it?
The EC should probably circulate a call for actual problem scenarios involving big data. If the matter of the overflow is as dire as represented, cases should be easy to find. A few should be selected and then anonymized if needed.
The party with the use case would benefit by having hopefully the best work on it. The contestants would benefit from having real world needs guide R&D. The EC would not have to do very much, except possibly use some money for funding the best proposals. The winner would possibly get a large account and related sales and service income. The contestants would have to be teams possibly involving many organizations; for example, development and first-line services and support could come from different companies along a systems integrator model such as is widely used in the US.
There may be a good benchmark at the time, possibly resulting from FP7 itself. In such a case, the EC could offer a prize for winners. Details would have to be worked out case by case. Such a challenge could be repeated a few times, as benchmark-driven progress in databases or TREC for example have taken some years to reach a point of slowdown in progress.
Administrating such an activity should not be prohibitive, as most of the expertise can be found with the stakeholders.
RDF and linked data principles could evidently be a great help. This is a large topic that goes into the culture of doing science and will deserve a more extensive treatment down the road.
For now, I will talk about possible ways of dealing with provenance annotations in Virtuoso at a fairly technical level.
If data comes many-triples-at-a-time from some source (e.g., library catalogue, user of a social network), then it is often easiest to put the data from each source/user into its own graph. Annotations can then be made on the graph. The graph IRI will simply occur as the subject of a triple in the same or some other graph. For example, all such annotations could go into a special annotations graph.
On the query side, having lots of distinct graphs does not have to be a problem if the index scheme is the right one, i.e., the 4 index scheme discussed in the Virtuoso documentation. If the query does not specify a graph, then triples in any graph will be considered when evaluating the query.
One could write queries like â
SELECT ?pub
WHERE
{
GRAPH ?g
{
?person foaf:knows ?contact
}
?contact foaf:name "Alice" .
?g xx:has_publisher ?pub
}
This would return the publishers of graphs that assert that somebody knows Alice.
Of course, the RDF reification vocabulary can be used as-is to say things about single triples. It is however very inefficient and is not supported by any specific optimization. Further, reification does not seem to get used very much; thus there is no great pressure to specially optimize it.
If we have to say things about specific triples and this occurs frequently (i.e., for more than 10% or so of the triples), then modifying the quad table becomes an option. For all its inefficiency, the RDF reification vocabulary is applicable if reification is a rarity.
Virtuoso's RDF_QUAD
table can be altered to have more columns. The problem with this is that space usage is increased and the RDF loading and query functions will not know about the columns. A SQL update statement can be used to set values for these additional columns if one knows the G,S,P,O
.
Suppose we annotated each quad with the user who inserted it and a timestamp. These would be columns in the RDF_QUAD
table. The next choice would be whether these were primary key parts or dependent parts. If primary key parts, these would be non-NULL
and would occur on every index. The same quad would exist for each distinct user and time this quad had been inserted. For loading functions to work, these columns would need a default. In practice, we think that having such metadata as a dependent part is more likely, so that G,S,P,O
are the unique identifier of the quad. Whether one would then include these columns on indices other than the primary key would depend on how frequently they were accessed.
In SPARQL, one could use an extension syntax like â
SELECT *
WHERE
{ ?person foaf:knows ?connection
OPTION ( time ?ts ) .
?connection foaf:name "Alice" .
FILTER ( ?ts > "2009-08-08"^^xsd:datetime )
}
This would return everybody who knows Alice since a date more recent than 2009-08-08. This presupposes that the quad table has been extended with a datetime column.
The OPTION (time ?ts)
syntax is not presently supported but we can easily add something of the sort if there is user demand for it. In practice, this would be an extension mechanism enabling one to access extension columns of RDF_QUAD
via a column ?variable
syntax in the OPTION
clause.
If quad metadata were not for every quad but still relatively frequent, another possibility would be making a separate table with a key of GSPO
and a dependent part of R
, where R
would be the reification URI of the quad. Reification statements would then be made with R
as a subject. This would be more compact than the reification vocabulary and would not modify the RDF_QUAD
table. The syntax for referring to this could be something like â
SELECT *
WHERE
{ ?person foaf:knows ?contact
OPTION ( reify ?r ) .
?r xx:assertion_time ?ts .
?contact foaf:name "Alice" .
FILTER ( ?ts > "2008-8-8"^^xsd:datetime )
}
We could even recognize the reification vocabulary and convert it into the reify option if this were really necessary. But since it is so unwieldy I don't think there would be huge demand. Who knows? You tell us.
]]>There are some points that came up in conversation at WWW 2009 that I will reiterate here. We find there is still some lack of clarity in the product image, so I will here condense it.
Virtuoso is a DBMS. We pitch it primarily to the data web space because this is where we see the emerging frontier. Virtuoso does both SQL and SPARQL and can do both at large scale and high performance. The popular perception of RDF and Relational models as mutually exclusive and antagonistic poles is based on the poor scalability of early RDF implementations. What we do is to have all the RDF specifics, like IRIs and typed literals as native SQL types, and to have a cost based optimizer that knows about this all.
If you want application-specific data structures as opposed to a schema-agnostic quad-store model (triple + graph-name), then Virtuoso can give you this too. Rendering application specific data structures as RDF applies equally to relational data in non-Virtuoso databases because Virtuoso SQL can federate tables from heterogenous DBMS.
On top of this, there is a web server built in, so that no extra server is needed for web services, web pages, and the like.
Installation is simple, just one exe and one config file. There is a huge amount of code in installers â application code and test suites and such â but none of this is needed when you deploy. Scale goes from a 25MB memory footprint on the desktop to hundreds of gigabytes of RAM and endless terabytes of disk on shared-nothing clusters.
Clusters (coming in Release 6) and SQL federation are commercial only; the rest can be had under GPL.
To condense further:
]]>We gave a talk at the Linked Open Data workshop, LDOW 2009, at WWW 2009. I did not go very far into the technical points in the talk, as there was almost no time and the points are rather complex. Instead, I emphasized what new things had become possible with recent developments.
The problem we do not cease hearing about is scale. We have solved most of it. There is scale in the schema: Put together, ontologies go over a million classes/properties. Which ones are relevant depends, and the user should have the choice. The instance data is in the tens of billions of triples, much derived from Web 2.0 sources but also much published as RDF.
To make sense of this all, we need quick summaries and search. Without navigation via joins, the value will be limited. Fast joining, counting, grouping, and ranking are key.
People will use different terms for the same thing. The issue of identity is philosophical. In order to do reasoning one needs strong identity; a statement like x is a bit like y is not very useful in a database context. Whether any x and y can be considered the same depends on the context. So leave this for query time. The conditions under which two people are considered the same will depend on whether you are doing marketing analysis or law enforcement. A general purpose data store cannot anticipate all the possibilities, so smush on demand, as you go, as has been said many times.
Against this backdrop, we offer a solution with which anybody who so chooses can play with big data, whether a search or analytics player.
We are going in the direction of more and more ad hoc processing at larger and larger scale. With good query parallelization, we can do big joins without complex programming. No explicit Map Reduce jobs or the like. What was done with special code with special parallel programming models, can now be done in SQL and SPARQL.
To showcase this, we do linked data search, browsing, and so on, but are essentially a platform provider.
Entry costs into relatively high end databases have dropped significantly. A cluster with 1 TB of RAM sells for $75K or so at today's retail prices and fits under a desk. For intermittent use, the rent for 1TB RAM is $1228 per day on EC2. With this on one side and Virtuoso on the other, a lot that was impractical in the past is now within reach. Like Giovanni Tummarello put it for airplanes, the physics are as they were for da Vinci but materials and engines had to develop a bit before there was commercial potential. So it is also with analytics for everyone.
A remark from the audience was that all the stuff being shown, not limited to Virtuoso, was non-standard, having to do with text search, with ranking, with extensions, and was in fact not SPARQL and pure linked data principles. Further, by throwing this all together, one got something overcomplicated, too heavy.
I answered as follows, which apparently cannot be repeated too much:
First, everybody expects a text search box, and is conditioned to having one. No text search and no ranking is a non-starter. Ceterum censeo, for database, the next generation cannot be less expressive than the previous. All of SQL and then some is where SPARQL must be. The barest minimum is being able to say anything one can say in SQL, and then justify SPARQL by saying that it is better for heterogenous data, schema last, and so on. On top of this, transitivity and rules will not hurt. For now, the current SPARQL working group will at least reach basic SQL parity; the edge will still remain implementation dependent.
Another remark was that joining is slow. Depends. Anything involving more complex disk access than linear reading of a blob is generally not good for interactive use. But with adequate memory, and with all hot spots in memory, we do some 3.2 million random-accesses-per-second on 12 cores, with easily 80% platform utilization for a single large query. The high utilization means that times drop as processing gets divided over more partitions.
There was a talk about MashQL by Mustafa Jarrar, concerning an abstraction on top of SPARQL for easy composition of tree-structured queries. The idea was that such queries can be evaluated "on the fly" as they are being composed. As it happens, we already have an XML-based query abstraction layer incorporated into Virtuoso 6.0's built-in Faceted Data Browser Service, and the effects are probably quite similar. The most important point here is that by using XML, both of these approaches are interoperable against a Virtuoso back-end. Along similar lines, we did not get to talk to the G Facets people but our message to them is the same: Use the faceted browser service to get vastly higher performance when querying against Linked Data, be it DBpedia or the entity LOD Cloud. Virtuoso 6.0 (Open Source Edition) "TP1" is now publicly available as a Technology Preview (beta).
We heard that there is an effort for porting Freebase's Parallax to SPARQL. The same thing applies to this. With a number of different data viewers on top of SPARQL, we come closer to broad-audience linked-data applications. These viewers are still too generic for the end user, though. We fully believe that for both search and transactions, application-domain-specific workflows will stay relevant. But these can be made to a fair degree by specializing generic linked-data-bound controls and gluing them together with some scripting.
As said before, the application will interface the user to the vocabulary. The vocabulary development takes the modeling burden from the application and makes for interchangeable experience on the same data. The data in turn is "virtualized" into the database cloud or the local secure server, as the use case may require.
For ease of adoption, open competition, and safety from lock-in, the community needs a SPARQL whose usability is not totally dependent on vendor extensions. But we might de facto have that in just a bit, whenever there is a working draft from the SPARQL WG.
Another topic that we encounter often is the question of integration (or lack thereof) between communities. For example, database conferences reject semantic web papers and vice versa. Such politics would seem to emerge naturally but are nonetheless detrimental. We really should partner with people who write papers as their principal occupation. We ourselves do software products and use very little time for papers, so some of the bad reviews we have received do make a legitimate point. By rights, we should go for database venues but we cannot have this take too much time. So we are open to partnering for splitting the opportunity cost of multiple submissions.
For future work, there is nothing radically new. We continue testing and productization of cluster databases. Just deliver what is in the pipeline. The essential nature of this is adding more and more cases of better and better parallelization in different query situations. The present usage patterns work well for finding bugs and performance bottlenecks. For presentation, our goal is to have third party viewers operate with our platform. We cannot completely leave data browsing and UI to third parties since we must from time to time introduce various unique functionality. Most interaction should however go via third party applications.
]]>The term DataSphere comes from Dan Simmons' Hyperion science fiction series, where it is a sort of pervasive computing capability that plays host to all sorts of processes, including what people do on the net today, and then some. I use this term here in order to emphasize the blurring of silo and application boundaries. The network is not only the computer but also the database. I will look at what effects the birth of a sort of linked data stratum can have on end-user experience, application development, application deployment and hosting, business models and advertising, and security; how cloud computing fits in; and how back-end software such as databases must evolve to support all of these.
This is a mid-term vision. The components are coming into production as we speak, but the end result is not here quite yet.
I use the word DataSphere to refer to a worldwide database fabric, a global Distributed DBMS collective, within which there are many Data Spaces, or Named Data Spaces. A Data Space is essentially a person's or organization's contribution to the DataSphere. I use Linked Data Web to refer to component technologies and practices such as RDF, SPARQL, Linked Data practices, etc. The DataSphere does not have to be built on this technology stack per se, but this stack is still the best bet for it.
There exist applications for performing specialized functions such as social networking, shopping, document search, and C2C commerce at planetary scale. All these applications run on their own databases, each with a task specific schema. They communicate by web pages and by predefined messages for diverse application-specific transactions and reports.
These silos are scalable because in general their data has some natural partitioning, and because the set of transactions is predetermined and the data structure is set up for this.
The Linked Data Web proposes to create a data infrastructure that can hold anything, just like a network can transport anything. This is not a network with a memory of messages, but a whole that can answer arbitrary questions about what has been said. The prerequisite is that the questions are phrased in a vocabulary that is compatible with the vocabulary in which the statements themselves were made.
In this setting, the vocabulary takes the place of the application. Of course, there continues to be a procedural element to applications; this has the function of translating statements between the domain vocabulary and a user interface. Examples are data import from existing applications, running predefined reports, composing new reports, and translating between natural language and the domain vocabulary.
The big difference is that the database moves outside of the silo, at least in logical terms. The database will be like the network â horizontal and ubiquitous. The equivalent of TCP/IP will be the RDF/SPARQL combination. The equivalent of routing protocols between ISPs will be gateways between the specific DBMS engines supporting the services.
The RDBMS in itself is eternal, or at least as eternal as a culture with heavy reliance on written records is. Any such culture will invent the RDBMS and use it where it best fits. We are not replacing this; we are building an abstracted worldwide data layer. This is to the RDBMS supporting line-of-business applications what the www was to enterprise content management systems.
For transactions, the Web 2.0-style application-specific messages are fine. Also, any transactional system that must be audited must physically reside somewhere, have physical security, etc. It can't just be somewhere in the DataSphere, managed by some system with which one has no contract, just like Google's web page cache can't be relied on as a permanent repository of web content.
Providing space on the Linked Data Web is like providing hosting on the Document Web. This may have varying service levels, pricing models, etc. The value of a queriable DataSphere is that a new application does not have to begin by building its own schema, database infrastructure, service hosting, etc. The application becomes more like a language meme, a cultural form of interaction mediated by a relatively lightweight user-facing component, laterally open for unforeseen interaction with other applications from other domains of discourse.
For the end user, the web will still look like a place where one can shop, discuss, date, whatever. These activities will be mediated by user interfaces as they are now. Right now, the end user's web presence is his/her blog or web site, and their contributions to diverse wikis, social web sites, and so forth. These are scattered. The user's Data Space is the collection of all these things, now presented in a queriable form. The user's Data Space is the user's statement of presence, referencing the diverse contributions of the user on diverse sites.
The personal Data Space being a queriable, structured whole facilitates finding and being found, which is what brings individuals to the web in the first place. The best applications and sites are those which make this the easiest. The Linked Data Web allows saying what one wishes in a structured, queriable manner, across all application domains, independently of domain specific silos. The end user's interaction with the personal data space is through applications, like now. But these applications are just wrappers on top of self describing data, represented in domain specific vocabularies; one vocabulary is used for social networking, another for C2C commerce, and so on. The user is the master of their personal Data Space, free to take it where he or she wishes.
Further benefits will include more ready referencing between these spaces, more uniform identity management, cross-application operations, and the emergence of "meta-applications," i.e., unified interfaces for managing many related applications/tasks.
Of course, there is the increase in semantic richness, such as better contextuality derived from entity extraction from text. But this is also possible in a silo. The Linked Data Web angle is the sharing of identifiers for real world entities, which makes extracts of different sources by different parties potentially joinable. The user interaction will hardly ever be with the raw data. But the raw data being still at hand makes for better targeting of advertisements, better offering of related services, easier discovery of related content, and less noise overall.
Kingsley Idehen has coined the term SDQ, for Serendipitous Discovery Quotient, to denote this. When applications expose explicit semantics, constructing a user experience that combines relevant data from many sources, including applications as well as highly targeted advertising, becomes natural. It is no longer a matter of "mashing up" web service interfaces with procedural code, but of "meshing" data through declarative queries across application spaces.
The workflows supported by the DataSphere are essentially those taking place on the web now. The DataSphere dimension is expressed by bookmarklets, browser plugins, and the like, with ready access to related data and actions that are relevant for this data. Actions triggered by data can be anything from posting a comment to making an e-commerce purchase. Web 2.0 models fit right in.
Web application development now consists of designing an application-specific database schema and writing web pages to interact with this schema. In the DataSphere, the database is abstracted away, as is a large part of the schema. The application floats on a sea of data instead of being tied to its own specific store and schema. Some local transaction data should still be handled in the old way, though.
For the application developer, the question becomes one of vocabulary choice. How will the application synthesize URIs from the user interaction? Which URIs will be used, since pretty much anything will in practice have many names (e.g., DBpedia Vs. Freebase identifiers). The end user will generally have no idea of this choice, nor of the various degrees of normalization, etc., in the vocabularies. Still, usage of such applications will produce data using some identifiers and vocabularies. Benefits of ready joining without translation will drive adoption. A vocabulary with instance data will get more instance data.
The Linked Data Web infrastructure itself must support vocabulary and identifier choice by answering questions about who uses a particular identifier and where. Even now, we offer entity ranks and resolution of synonyms, queries on what graphs mention a certain identifier and so on. This is a means of finding the most commonly used term for each situation. Convergence of terminology cuts down on translation and makes for easier and more efficient querying.
The application developer is, for purposes of advertising, in the position of the inventory owner, just like a traditional publisher, whether web or other. But with smarter data, it is not a matter of static keywords but of the semantically explicit data behind each individual user impression driving the ads. Data itself carries no ads but the user impression will still go through a display layer that can show ads. If the application relies on reuse of licensed content, such as media, then the content provider may get a cut of the ad revenue even if it is not the direct owner of the inventory. The specifics of implementing and enforcing this are to be worked out.
For the content provider, the URI is the brand carrier. If the data is well linked and queriable, this will drive usage and traffic to the services of the content provider. This is true of any provider, whether a media publisher, e-commerce business, government agency, or anything else.
Intellectual property considerations will make the URI a first class citizen. Just like the URI is a part of the document web experience, it is a part of the Linked Data Web experience. Just like Creative Commons licenses allow the licensor to define what type of attribution is required, a data publisher can mandate that a user experience mediated by whatever application should expose the source as a dereferenceable URI.
One element of data dereferencing must be linking to applications that facilitate human interaction with the data. A generic data browser is a developer tool; the end user experience must still be mediated by interfaces tailored to the domain. This layer can take care of making the brand visible and can show advertising or be monetized on a usage basis.
Next we will look at the service provider and infrastructure side of this.
The old problem has been that it is not really practical to pre-compute counts of everything for all possible combinations of search conditions and counting/grouping/sorting. The actual matches take time.
Well, neither is in fact necessary. When there are large numbers of items matching the conditions, counting them can take time but then this is the beginning of the search, and the user is not even likely to look very closely at the counts. It is enough to see that there are many of one and few of another. If the user already knows the precise predicate or class to look for, then the top-level faceted view is not even needed. The faceted view for guiding search and precise analytics are two different problems.
There are client-side faceted views like Exhibit or our own ODE. The problem with these is that there are a few orders of magnitude difference between the actual database size and what fits on the user agent. This is compounded by the fact that one does not know what to cache on the user agent because of the open nature of the data web. If this were about a fixed workflow, then a good guess would be possible â but we are talking about the data web, the very soul of serendipity and unexpected discovery.
So we made a web service that will do faceted search on arbitrary RDF. If it does not get complete results within a timeout, it will return what it has counted so far, using Virtuoso's Anytime feature. Looking for subjects with some specific combination of properties is however a bit limited, so this will also do JOINs
. Many features are one or two JOINs
away; take geographical locations or social networks, for example.
Yet a faceted search should be point-and-click, and should not involve a full query construction. We put the compromise at starting with full text or property or class, then navigating down properties or classes, to arbitrary depth, tree-wise. At each step, one can see the matching instances or their classes or properties, all with counts, faceted-style.
This is good enough for queries like 'what do Harry Potter fans also like' or 'who are the authors of articles tagged semantic web and machine learning and published in 2008'. For complex grouping, sub-queries, arithmetic or such, one must write the actual query.
But one can begin with facets, and then continue refining the query by hand since the service also returns SPARQL text. We made a small web interface on top of the service with all logic server side. This proves that the web service is usable and that an interface with no AJAX, and no problems with browser interoperability or such, is possible and easy. Also, the problem of syncing between a user-agent-based store and a database is entirely gone.
If we are working with a known data structure, the user interface should choose the display by the data type and offer links to related reports. This is all easy to build as web pages or AJAX. We show how the generic interface is done in Virtuoso PL, and you can adapt that or rewrite it in PHP, Java, JavaScript, or anything else, to accommodate use-case specific navigation needs such as data format.
The web service takes an XML representation of the search, which is more restricted and easier to process by machine than the SPARQL syntax. The web service returns the results, the SPARQL query it generated, whether the results are complete or not, and some resource use statistics.
The source of the PL functions, Web Service and Virtuoso Server Page (HTML UI) will be available as part of Virtuoso 6.0 and higher. A Programmer's Guide will be available as part of the standard Virtuoso Documentation collection, including the Virtuoso Open Source Edition Website.
]]>Sir Tim said it at WWW08 in Beijing â linked data and the linked data web is the semantic web and the Web done right.
The grail of ad hoc analytics on infinite data has lost none of its appeal. We have seen fresh evidence of this in the realm of data warehousing products, as well as storage in general.
The benefits of a data model more abstract than the relational are being increasingly appreciated also outside the data web circles. Microsoft's Entity Frameworks technology is an example. Agility has been a buzzword for a long time. Everything should be offered in a service based business model and should interoperate and integrate with everything else â business needs first; schema last.
Not to forget that when money is tight, reuse of existing assets and paying on a usage basis are naturally emphasized. Information, as the asset it is, is none the less important, on the contrary. But even with information, value should be realized economically, which, among other things, entails not reinventing the wheel.
It is against this backdrop that this year will play out.
As concerns research, I will again quote Harry Halpin at ESWC 2008: "Men will fight in a war, and even lose a war, for what they believe just. And it may come to pass that later, even though the war were lost, the things then fought for will emerge under another name and establish themselves as the prevailing reality" [or words to this effect].
Something like the data web, and even the semantic web, will happen. Harry's question was whether this would be the descendant of what is today called semantic web research.
I heard in conversation about a project for making a very large metadata store. I also heard that the makers did not particularly insist on this being RDF-based, though.
Why should such a thing be RDF-based? If it is already accepted that there will be ad hoc schema and that queries ought to be able to view the data from all angles, not be limited by having indices one way and not another way, then why not RDF?
The justification of RDF is in reusing and linking-to data and terminology out there. Another justification is that by using an RDF store, one is spared a lot of work and tons of compromises which attend making an entity-attribute-value (EAV, i.e., triple) store on a generic RDBMS. The sem-web world has been there, trust me. We came out well because we put all inside the RDBMS, lowest level, which you can't do unless you own the RDBMS. Source access is not enough; you also need the knowledge.
Technicalities aside, the question is one of proprietary vs. standards-based. This is not only so with software components, where standards have consistently demonstrated benefits, but now also with the data. Zemanta and OpenCalais serving DBpedia URIs are examples. Even in entirely closed applications, there is benefit in reusing open vocabularies and identifiers: One does not need to create a secret language for writing a secret memo.
Where data is a carrier of value, its value is enhanced by it being easy to repurpose (i.e., standard vocabularies) and to discover (i.e., data set metadata). As on the web, so on the enterprise intranet. In this lies the strength of RDF as opposed to proprietary flexible database schemes. This is a qualitative distinction.
In this light, we welcome the voiD (VOcabulary of Interlinked Data), which is the first promise of making federatable data discoverable. Now that there is a point of focus for these efforts, the needed expressivity will no doubt accrete around the voiD core.
For data as a service, we clearly see the value of open terminologies as prerequisites for service interchangeability, i.e., creating a marketplace. XML is for the transaction; RDF is for the discovery, query, and analytics. As with databases in general, first there was the transaction; then there was the query. Same here. For monetizing the query, there are models ranging from renting data sets and server capacity in the clouds to hosted services where one pays for processing past a certain quota. For the hosted case, we just removed a major barrier to offering unlimited query against unlimited data when we completed the Virtuoso Anytime feature. With this, the user gets what is found within a set time, which is already something, and in case of needing more, one can pay for the usage. Of course, we do not forget advertising. When data has explicit semantics, contextuality is better than with keywords.
For these visions to materialize on top of the linked data platform, linked data must join the world of data. This means messaging that is geared towards the database public. They know the problem, but the RDF proposition is still not well enough understood for it to connect.
For the relational IT world, we offer passage to the data web and its promise of integration through RDF mapping. We are also bringing out new Microsoft Entity Framework components. This goes in the direction of defining a unified database frontier with RDF and non-RDF entity models side by side.
For OpenLink Software, 2008 was about developing technology for scale, RDF as well as generic relational. We did show a tiny preview with the Billion Triples Challenge demo. Now we are set to come out with the real thing, featuring, among other things, faceted search at the billion triple scale. We started offering ready-to-go Virtuoso-hosted linked open data sets on Amazon EC2 in December. Now we continue doing this based on our next-generation server, as well as make Virtuoso 6 Cluster commercially available. Technical specifics are amply discussed on this blog. There are still some new technology things to be developed this year; first among these are strong SPARQL federation, and on-the-fly resizing of server clusters. On the research partnerships side, we have an EU grant for working with the OntoWiki project from the University of Leipzig, and we are partners in DERI's LÃon project. These will provide platforms for further demonstrating the "web" in data web, as in web-scale smart databasing.
2009 will see change through scale. The things that exist will start interconnecting and there will be emergent value. Deployments will be larger and scale will be readily available through a services model or by installation at one's own facilities. We may see the start of Search becoming Find, like Kingsley says, meaning semantics of data guiding search. Entity extraction will multiply data volumes and bring parts of the data web to real time.
Exciting 2009 to all.
]]>Q: What is the storage cost per triple? answer
Q: What is the cost to insert a triple? answer
Q: What is the cost to delete a triple? (For the insertion itself, as well as for updating any indices) answer
Q: What is the cost to search on a given property? answer
Q: What data types are supported? answer
Q: What inferencing is supported? answer
Q: Is the inferencing dynamic or is an extra step required before inferencing can be used? answer
Q: Do you support full text search? answer
Q: What programming interfaces are supported? Do you support standard SPARQL protocol? answer
Q: How can data be partitioned across multiple servers? answer
Q: How many triples can a single server handle? answer
Q: What is the performance impact of going from the billion to the trillion triples? answer
Q: Do you support additional metadata for triples, such as timestamps, security tags etc? answer
Q: Should we use RDF for our large metadata store? What are the alternatives? answer
Q: How multithreaded is Virtuoso? answer
Q: Can multiple servers run off a single shared disk database? answer
Q: Can Virtuoso run on a SAN? answer
Q: How does Virtuoso join across partitions? answer
Q: Does Virtuoso support federated triple stores? If there are multiple SPARQL end points, can Virtuoso be used to do queries joining between these? answer
Q: How many servers can a cluster contain? answer
Q: How do I reconfigure a cluster, adding and removing machines, etc? answer
Q: How will Virtuoso handle regional clusters? answer
Q: Is there a mechanism for terminating long running queries? answer
Q: Can the user be asynchronously notified when a long running query terminates? answer
Q: How many concurrent queries can Virtuoso handle? answer
Q: What is the relative performance of SPARQL queries vs. native relational queries answer
Q: Does Virtuoso support property tables? answer
Q: What performance metrics does Virtuoso offer? answer
Q: What support do you provide for concurrency/multithreading operation? Is your interface thread-safe? answer
Q: What level of ACID properties are supported? answer
Q: Do you provide the ability to atomically add a set of triples, where either all are added or none are added? answer
Q: Do you provide the ability to add a set of triples, respecting the isolation property (so concurrent accessors either see none of the triple values, or all of them)? answer
Q: What is the time to start a database, create/open a graph? answer
Q: What sort of security features are built into Virtuoso? answer
]]>The purpose of this presentation is to show a programmer how to put RDF into Virtuoso and how to query it. This is done programmatically, with no confusing user interfaces.
You should have a Virtuoso Open Source tree built and installed. We will look at the LUBM benchmark demo that comes with the package. All you need is a Unix shell. Running the shell under emacs (m-x shell
) is the best. But the open source isql
utility should have command line editing also. The emacs shell is however convenient for cutting and pasting things between shell and files.
To get started, cd into binsrc/tests/lubm
.
To verify that this works, you can do
./test_server.sh virtuoso-t
This will test the server with the LUBM queries. This should report 45 tests passed. After this we will do the tests step-by-step.
The file lubm-load.sql
contains the commands for loading the LUBM single university qualification database.
The data files themselves are in lubm_8000
, 15 files in RDFXML.
There is also a little ontology called inf.nt
. This declares the subclass and subproperty relations used in the benchmark.
So now let's go through this procedure.
Start the server:
$ virtuoso-t -f &
This starts the server in foreground mode, and puts it in the background of the shell.
Now we connect to it with the isql utility.
$ isql 1111 dba dba
This gives a SQL>
prompt. The default username and password are both dba
.
When a command is SQL, it is entered directly. If it is SPARQL, it is prefixed with the keyword sparql
. This is how all the SQL clients work. Any SQL client, such as any ODBC or JDBC application, can use SPARQL if the SQL string starts with this keyword.
The lubm-load.sql
file is quite self-explanatory. It begins with defining an SQL procedure that calls the RDF/XML load function, DB..RDF_LOAD_RDFXML
, for each file in a directory.
Next it calls this function for the lubm_8000
directory under the server's working directory.
sparql CLEAR GRAPH <lubm>; sparql CLEAR GRAPH <inf>; load_lubm ( server_root() || '/lubm_8000/' );
Then it verifies that the right number of triples is found in the <lubm> graph.
sparql SELECT COUNT(*) FROM <lubm> WHERE { ?x ?y ?z } ;
The echo commands below this are interpreted by the isql utility, and produce output to show whether the test was passed. They can be ignored for now.
Then it adds some implied subOrganizationOf
triples. This is part of setting up the LUBM test database.
sparql PREFIX ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#> INSERT INTO GRAPH <lubm> { ?x ub:subOrganizationOf ?z } FROM <lubm> WHERE { ?x ub:subOrganizationOf ?y . ?y ub:subOrganizationOf ?z . };
Then it loads the ontology file, inf.nt
, using the Turtle load function, DB.DBA.TTLP
. The arguments of the function are the text to load, the default namespace prefix, and the URI of the target graph.
DB.DBA.TTLP ( file_to_string ( 'inf.nt' ), 'http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl', 'inf' ) ; sparql SELECT COUNT(*) FROM <inf> WHERE { ?x ?y ?z } ;
Then we declare that the triples in the <inf>
graph can be used for inference at run time. To enable this, a SPARQL query will declare that it uses the 'inft'
rule set. Otherwise this has no effect.
rdfs_rule_set ('inft', 'inf');
This is just a log checkpoint to finalize the work and truncate the transaction log. The server would also eventually do this in its own time.
checkpoint;
Now we are ready for querying.
The queries are given in 3 different versions: The first file, lubm.sql
, has the queries with most inference open coded as UNIONs
. The second file, lubm-inf.sql
, has the inference performed at run time using the ontology information in the <inf>
graph we just loaded. The last, lubm-phys.sql
, relies on having the entailed triples physically present in the <lubm>
graph. These entailed triples are inserted by the SPARUL commands in the lubm-cp.sql
file.
If you wish to run all the commands in a SQL file, you can type load <filename>;
(e.g., load lubm-cp.sql;
) at the SQL>
prompt. If you wish to try individual statements, you can paste them to the command line.
For example:
SQL> sparql PREFIX ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#> SELECT * FROM <lubm> WHERE { ?x a ub:Publication . ?x ub:publicationAuthor <http://www.Department0.University0.edu/AssistantProfessor0> }; VARCHAR _______________________________________________________________________ http://www.Department0.University0.edu/AssistantProfessor0/Publication0 http://www.Department0.University0.edu/AssistantProfessor0/Publication1 http://www.Department0.University0.edu/AssistantProfessor0/Publication2 http://www.Department0.University0.edu/AssistantProfessor0/Publication3 http://www.Department0.University0.edu/AssistantProfessor0/Publication4 http://www.Department0.University0.edu/AssistantProfessor0/Publication5 6 Rows. -- 4 msec.
To stop the server, simply type shutdown;
at the SQL>
prompt.
If you wish to use a SPARQL protocol end point, just enable the HTTP listener. This is done by adding a stanza like â
[HTTPServer] ServerPort = 8421 ServerRoot = . ServerThreads = 2
â to the end of the virtuoso.ini
file in the lubm
directory. Then shutdown and restart (type shutdown;
at the SQL>
prompt and then virtuoso-t -f &
at the shell prompt).
Now you can connect to the end point with a web browser. The URL is http://localhost:8421/sparql
. Without parameters, this will show a human readable form. With parameters, this will execute SPARQL.
We have shown how to load and query RDF with Virtuoso using the most basic SQL tools. Next you can access RDF from, for example, PHP, using the PHP ODBC interface.
To see how to use Jena or Sesame with Virtuoso, look at Native RDF Storage Providers. To see how RDF data types are supported, see Extension datatype for RDF
To work with large volumes of data, you must add memory to the configuration file and use the row-autocommit mode, i.e., do log_enable (2);
before the load command. Otherwise Virtuoso will do the entire load as a single transaction, and will run out of rollback space. See documentation for more.
NumberOfBuffers = 256
), the process size stays under 30MB on 32-bit Linux.
The value of this is that one can now have RDF and full text indexing on the desktop without running a Java VM or any other memory-intensive software. And of course, all of SQL (transactions, stored procedures, etc.) is in the same embeddably-sized container.
The Lite executable is a full Virtuoso executable; the Lite mode is controlled by a switch in the configuration file. The executable size is about 10MB for 32-bit Linux. A database created in the Lite mode will be converted into a fully-featured database (tables and indexes are added, among other things) if the server is started with the Lite setting "off"; functionality can be reverted to Lite mode, though it will now consume somewhat more memory, etc.
Lite mode offers full SQL and SPARQL/SPARUL (via SPASQL), but disables all HTTP-based services (WebDAV, application hosting, etc.). Clients can still use all typical database access mechanisms (i.e., ODBC, JDBC, OLE-DB, ADO.NET, and XMLA) to connect, including the Jena and Sesame frameworks for RDF. ODBC now offers full support of RDF data types for C-based clients. A Redland-compatible API also exists, for use with Redland v1.0.8 and later.
Especially for embedded use, we now allow restricting the listener to be a Unix socket, which allows client connections only from the localhost.
Shipping an embedded Virtuoso is easy. It just takes one executable and one configuration file. Performance is generally comparable to "normal" mode, except that Lite will be somewhat less scalable on multicore systems.
The Lite mode will be included in the next Virtuoso 5 Open Source release.
]]>This is complex, so I will begin with the point and the interested may read on for the details and implications. Starting with soon to be released version 6, Virtuoso allows you to say that two things, if they share a uniquely identifying property, are the same. Examples of uniquely identifying properties would be a book's ISBN number, or a person's social security plus full name. In relational language this is a unique key, and in RDF parlance, an inverse functional property.
In most systems, such problems are dealt with as a preprocessing step before querying. For example, all the items that are considered the same will get the same properties or at load time all identifiers will be normalized according to some application rules. This is good if the rules are clear and understood. This is so in closed situations, where things tend to have standard identifiers to begin with. But on the open web this is not so clear cut.
In this post, we show how to do these things ad hoc, without materializing anything. At the end, we also show how to materialize identity and what the consequences of this are with open web data. We use real live web crawls from the Billion Triples Challenge data set.
On the linked data web, there are independently arising descriptions of the same thing and thus arises the need to smoosh, if these are to be somehow integrated. But this is only the beginning of the problems.
To address these, we have added the option of specifying that some property will be considered inversely functional in a query. This is done at run time and the property does not really have to be inversely functional in the pure sense. foaf:name
will do for an example. This simply means that for purposes of the query concerned, two subjects which have at least one foaf:name
in common are considered the same. In this way, we can join between FOAF files. With the same database, a query about music preferences might consider having the same name as "same enough," but a query about criminal prosecution would obviously need to be more precise about sameness.
Our ontology is defined like this:
-- Populate a named graph with the triples you want to use in query time inferencing
ttlp ( ' @prefix foaf: <xmlns="http" xmlns.com="xmlns.com" foaf="foaf"> </> @prefix owl: <xmlns="http" www.w3.org="www.w3.org" owl="owl"> </> foaf:mbox_sha1sum a owl:InverseFunctionalProperty . foaf:name a owl:InverseFunctionalProperty . ', 'xx', 'b3sifp' );
-- Declare that the graph contains an ontology for use in query time inferencing
rdfs_rule_set ( 'http://example.com/rules/b3sifp#', 'b3sifp' );
Then use it:
sparql DEFINE input:inference "http://example.com/rules/b3sifp#" SELECT DISTINCT ?k ?f1 ?f2 WHERE { ?k foaf:name ?n . ?n bif:contains "'Kjetil Kjernsmo'" . ?k foaf:knows ?f1 . ?f1 foaf:knows ?f2 };
VARCHAR VARCHAR VARCHAR ______________________________________ _______________________________________________ ______________________________
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/dajobe http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/net_twitter http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/amyvdh http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/pom http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/mattb http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/davorg http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/distobj http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/perigrin ....
Without the inference, we get no matches. This is because the data in question has one graph per FOAF file, and blank nodes for persons. No graph references any person outside the ones in the graph. So if somebody is mentioned as known, then without the inference there is no way to get to what that person's FOAF file says, since the same individual will be a different blank node there. The declaration in the context named b3sifp
just means that all things with a matching foaf:name
or foaf:mbox_sha1sum
are the same.
Sameness means that two are the same for purposes of DISTINCT
or GROUP BY
, and if two are the same, then both have the UNION
of all of the properties of both.
If this were a naive smoosh, then the individuals would have all the same properties but would not be the same for DISTINCT
.
If we have complex application rules for determining whether individuals are the same, then one can materialize owl:sameAs
triples and keep them in a separate graph. In this way, the original data is not contaminated and the materialized volume stays reasonable â nothing like the blow-up of duplicating properties across instances.
The pro-smoosh argument is that if every duplicate makes exactly the same statements, then there is no great blow-up. Best and worst cases will always depend on the data. In rough terms, the more ad hoc the use, the less desirable the materialization. If the usage pattern is really set, then a relational-style application-specific representation with identity resolved at load time will perform best. We can do that too, but so can others.
The principal point is about agility as concerns the inference. Run time is more agile than materialization, and if the rules change or if different users have different needs, then materialization runs into trouble. When talking web scale, having multiple users is a given; it is very uneconomical to give everybody their own copy, and the likelihood of a user accessing any significant part of the corpus is minimal. Even if the queries were not limited, the user would typically not wait for the answer of a query doing a scan or aggregation over 1 billion blog posts or something of the sort. So queries will typically be selective. Selective means that they do not access all of the data, hence do not benefit from ready-made materialization for things they do not even look at.
The exception is corpus-wide statistics queries. But these will not be done in interactive time anyway, and will not be done very often. Plus, since these do not typically run all in memory, these are disk bound. And when things are disk bound, size matters. Reading extra entailment on the way is just a performance penalty.
Enough talk. Time for an experiment. We take the Yahoo and Falcon web crawls from the Billion Triples Challenge set, and do two things with the FOAF data in them:
For the experiment, we will consider two people the same if they have the same foaf:name
and are both instances of foaf:Person
. This gets some extra hits but should not be statistically significant.
The following is a commented SQL script performing the smoosh. We play with internal IDs of things, thus some of these operations cannot be done in SPARQL alone. We use SPARQL where possible for readability. As the documentation states, iri_to_id
converts from the qualified name of an IRI to its ID and id_to_iri
does the reverse.
We count the triples that enter into the smoosh:
-- the name is an existence because else we'd get several times more due to -- the names occurring in many graphs
sparql SELECT COUNT(*) WHERE { { SELECT DISTINCT ?person WHERE { ?person a foaf:Person } } . FILTER ( bif:exists ( SELECT (1) WHERE { ?person foaf:name ?nn } ) ) . ?person ?p ?o };
-- We get 3284674
We make a few tables for intermediate results.
-- For each distinct name, gather the properties and objects from -- all subjects with this name
CREATE TABLE name_prop ( np_name ANY, np_p IRI_ID_8, np_o ANY, PRIMARY KEY ( np_name, np_p, np_o ) ); ALTER INDEX name_prop ON name_prop PARTITION ( np_name VARCHAR (-1, 0hexffff) );
-- Map from name to canonical IRI used for the name
CREATE TABLE name_iri ( ni_name ANY PRIMARY KEY, ni_s IRI_ID_8 ); ALTER INDEX name_iri ON name_iri PARTITION ( ni_name VARCHAR (-1, 0hexffff) );
-- Map from person IRI to canonical person IRI
CREATE TABLE pref_iri ( i IRI_ID_8, pref IRI_ID_8, PRIMARY KEY ( i ) ); ALTER INDEX pref_iri ON pref_iri PARTITION ( i INT (0hexffff00) );
-- a table for the materialization where all aliases get all properties of every other
CREATE TABLE smoosh_ct ( s IRI_ID_8, p IRI_ID_8, o ANY, PRIMARY KEY ( s, p, o ) ); ALTER INDEX smoosh_ct ON smoosh_ct PARTITION ( s INT (0hexffff00) );
-- disable transaction log and enable row auto-commit. This is necessary, otherwise -- bulk operations are done transactionally and they will run out of rollback space.
LOG_ENABLE (2);
-- Gather all the properties of all persons with a name under that name. -- INSERT SOFT means that duplicates are ignored
INSERT SOFT name_prop SELECT "n", "p", "o" FROM ( sparql DEFINE output:valmode "LONG" SELECT ?n ?p ?o WHERE { ?x a foaf:Person . ?x foaf:name ?n . ?x ?p ?o } ) xx ;
-- Now choose for each name the canonical IRI
INSERT INTO name_iri SELECT np_name, ( SELECT MIN (s) FROM rdf_quad WHERE o = np_name AND p = IRI_TO_ID ('http://xmlns.com/foaf/0.1/name') ) AS mini FROM name_prop WHERE np_p = IRI_TO_ID ('http://xmlns.com/foaf/0.1/name') ;
-- For each person IRI, map to the canonical IRI of that person
INSERT SOFT pref_iri (i, pref) SELECT s, ni_s FROM name_iri, rdf_quad WHERE o = ni_name AND p = IRI_TO_ID ('http://xmlns.com/foaf/0.1/name') ;
-- Make a graph where all persons have one iri with all the properties of all aliases -- and where person-to-person refs are canonicalized
INSERT SOFT rdf_quad (g,s,p,o) SELECT IRI_TO_ID ('psmoosh'), ni_s, np_p, COALESCE ( ( SELECT pref FROM pref_iri WHERE i = np_o ), np_o ) FROM name_prop, name_iri WHERE ni_name = np_name OPTION ( loop, quietcast ) ;
-- A little explanation: The properties of names are copied into rdf_quad with the name -- replaced with its canonical IRI. If the object has a canonical IRI, this is used as -- the object, else the object is unmodified. This is the COALESCE with the sub-query.
-- This takes a little time. To check on the progress, take another connection to the -- server and do
STATUS ('cluster');
-- It will return something like -- Cluster 4 nodes, 35 s. 108 m/s 1001 KB/s 75% cpu 186% read 12% clw threads 5r 0w 0i -- buffers 549481 253929 d 8 w 0 pfs
-- Now finalize the state; this makes it permanent. Else the work will be lost on server -- failure, since there was no transaction log
CL_EXEC ('checkpoint');
-- See what we got
sparql SELECT COUNT (*) FROM <psmoosh> WHERE {?s ?p ?o};
-- This is 2253102
-- Now make the copy where all have the properties of all synonyms. This takes so much -- space we do not insert it as RDF quads, but make a special table for it so that we can -- run some statistics. This saves time.
INSERT SOFT smoosh_ct (s, p, o) SELECT s, np_p, np_o FROM name_prop, rdf_quad WHERE o = np_name AND p = IRI_TO_ID ('http://xmlns.com/foaf/0.1/name') ;
-- as above, INSERT SOFT so as to ignore duplicates
SELECT COUNT (*) FROM smoosh_ct;
-- This is 167360324
-- Find out where the bloat comes from
SELECT TOP 20 COUNT (*), ID_TO_IRI (p) FROM smoosh_ct GROUP BY p ORDER BY 1 DESC;
The results are:
54728777 http://www.w3.org/2002/07/owl#sameAs 48543153 http://xmlns.com/foaf/0.1/knows 13930234 http://www.w3.org/2000/01/rdf-schema#seeAlso 12268512 http://xmlns.com/foaf/0.1/interest 11415867 http://xmlns.com/foaf/0.1/nick 6683963 http://xmlns.com/foaf/0.1/weblog 6650093 http://xmlns.com/foaf/0.1/depiction 4231946 http://xmlns.com/foaf/0.1/mbox_sha1sum 4129629 http://xmlns.com/foaf/0.1/homepage 1776555 http://xmlns.com/foaf/0.1/holdsAccount 1219525 http://xmlns.com/foaf/0.1/based_near 305522 http://www.w3.org/1999/02/22-rdf-syntax-ns#type 274965 http://xmlns.com/foaf/0.1/name 155131 http://xmlns.com/foaf/0.1/dateOfBirth 153001 http://xmlns.com/foaf/0.1/img 111130 http://www.w3.org/2001/vcard-rdf/3.0#ADR 52930 http://xmlns.com/foaf/0.1/gender 48517 http://www.w3.org/2004/02/skos/core#subject 45697 http://www.w3.org/2000/01/rdf-schema#label 44860 http://purl.org/vocab/bio/0.1/olb
Now compare with the predicate distribution of the smoosh with identities canonicalized
sparql SELECT COUNT (*) ?p FROM <psmoosh> WHERE { ?s ?p ?o } GROUP BY ?p ORDER BY 1 DESC LIMIT 20;
Results are:
748311 http://xmlns.com/foaf/0.1/knows 548391 http://xmlns.com/foaf/0.1/interest 140531 http://www.w3.org/2000/01/rdf-schema#seeAlso 105273 http://www.w3.org/1999/02/22-rdf-syntax-ns#type 78497 http://xmlns.com/foaf/0.1/name 48099 http://www.w3.org/2004/02/skos/core#subject 45179 http://xmlns.com/foaf/0.1/depiction 40229 http://www.w3.org/2000/01/rdf-schema#comment 38272 http://www.w3.org/2000/01/rdf-schema#label 37378 http://xmlns.com/foaf/0.1/nick 37186 http://dbpedia.org/property/abstract 34003 http://xmlns.com/foaf/0.1/img 26182 http://xmlns.com/foaf/0.1/homepage 23795 http://www.w3.org/2002/07/owl#sameAs 17651 http://xmlns.com/foaf/0.1/mbox_sha1sum 17430 http://xmlns.com/foaf/0.1/dateOfBirth 15586 http://xmlns.com/foaf/0.1/page 12869 http://dbpedia.org/property/reference 12497 http://xmlns.com/foaf/0.1/weblog 12329 http://blogs.yandex.ru/schema/foaf/school
We can drop the owl:sameAs
triples from the count, so the bloat is a bit less by that but it still is tens of times larger than the canonicalized copy or the initial state.
Now, when we try using the psmoosh graph, we still get different results from the results with the original data. This is because foaf:knows
relations to things with no foaf:name
are not represented in the smoosh. The exist:
sparql SELECT COUNT (*) WHERE { ?s foaf:knows ?thing . FILTER ( !bif:exists ( SELECT (1) WHERE { ?thing foaf:name ?nn } ) ) };
-- 1393940
So the smoosh graph is not an accurate rendition of the social network. It would have to be smooshed further to be that, since the data in the sample is quite irregular. But we do not go that far here.
Finally, we calculate the smoosh blow up factors. We do not include owl:sameAs
triples in the counts.
select (167360324 - 54728777) / 3284674.0; 34.290022997716059
select 2229307 / 3284674.0; = 0.678699621332284
So, to get a smoosh that is not really the equivalent of the original, either multiply the original triple count by 34 or 0.68, depending on whether synonyms are collapsed or not.
Making the smooshes does not take very long, some minutes for the small one. Inserting the big one would be longer, a couple of hours maybe. It was 33 minutes for filling the smoosh_ct
table. The metrics were not with optimal tuning so the performance numbers just serve to show that smooshing takes time. Probably more time than allowable in an interactive situation, no matter how the process is optimized.
As we are on the brink of hosting the whole DBpedia Linked Open Data cloud in Virtuoso Cluster, we have had to think of what we'll do if, for example, somebody decides to count all the triples in the set.
How can we encourage clever use of data, yet not die if somebody, whether through malice, lack of understanding, or simple bad luck, submits impossible queries?
Restricting the language is not the way; any language beyond text search can express queries that will take forever to execute. Also, just returning a timeout after the first second (or whatever arbitrary time period) leaves people in the dark and does not produce an impression of responsiveness. So we decided to allow arbitrary queries, and if a quota of time or resources is exceeded, we return partial results and indicate how much processing was done.
Here we are looking for the top 10 people whom people claim to know without being known in return, like this:
SQL> sparql SELECT ?celeb, COUNT (*) WHERE { ?claimant foaf:knows ?celeb . FILTER (!bif:exists ( SELECT (1) WHERE { ?celeb foaf:knows ?claimant } ) ) } GROUP BY ?celeb ORDER BY DESC 2 LIMIT 10;
celeb callret-1 VARCHAR VARCHAR ________________________________________ _________
http://twitter.com/BarackObama 252 http://twitter.com/brianshaler 183 http://twitter.com/newmediajim 101 http://twitter.com/HenryRollins 95 http://twitter.com/wilw 81 http://twitter.com/stevegarfield 78 http://twitter.com/cote 66 mailto:adam.westerski@deri.org 66 mailto:michal.zaremba@deri.org 66 http://twitter.com/dsifry 65
*** Error S1TAT: [Virtuoso Driver][Virtuoso Server]RC...: Returning incomplete results, query interrupted by result timeout. Activity: 1R rnd 0R seq 0P disk 1.346KB / 3 messages
SQL> sparql SELECT ?celeb, COUNT (*) WHERE { ?claimant foaf:knows ?celeb . FILTER (!bif:exists ( SELECT (1) WHERE { ?celeb foaf:knows ?claimant } ) ) } GROUP BY ?celeb ORDER BY DESC 2 LIMIT 10;
celeb callret-1 VARCHAR VARCHAR ________________________________________ _________
http://twitter.com/JasonCalacanis 496 http://twitter.com/Twitterrific 466 http://twitter.com/ev 442 http://twitter.com/BarackObama 356 http://twitter.com/laughingsquid 317 http://twitter.com/gruber 294 http://twitter.com/chrispirillo 259 http://twitter.com/ambermacarthur 224 http://twitter.com/t 219 http://twitter.com/johnedwards 188
*** Error S1TAT: [Virtuoso Driver][Virtuoso Server]RC...: Returning incomplete results, query interrupted by result timeout. Activity: 329R rnd 44.6KR seq 342P disk 638.4KB / 46 messages
The first query read all data from disk; the second run had the working set from the first and could read some more before time ran out, hence the results were better. But the response time was the same.
If one has a query that just loops over consecutive joins, like in basic SPARQL, interrupting the processing after a set time period is simple. But such queries are not very interesting. To give meaningful partial answers with nested aggregation and sub-queries requires some more tricks. The basic idea is to terminate the innermost active sub-query/aggregation at the first timeout, and extend the timeout a bit so that accumulated results get fed to the next aggregation, like from the GROUP BY
to the ORDER BY
. If this again times out, we continue with the next outer layer. This guarantees that results are delivered if there were any results found for which the query pattern is true. False results are not produced, except in cases where there is comparison with a count and the count is smaller than it would be with the full evaluation.
One can also use this as a basis for paid services. The cutoff does not have to be time; it can also be in other units, making it insensitive to concurrent usage and variations of working set.
This system will be deployed on our Billion Triples Challenge demo instance in a few days, after some more testing. When Virtuoso 6 ships, all LOD Cloud AMIs and OpenLink-hosted LOD Cloud SPARQL endpoints will have this enabled by default. (AMI users will be able to disable the feature, if desired.) The feature works with Virtuoso 6 in both single server and cluster deployment.
]]>The scalability dream is to add hardware and get increased performance in proportion to the power the added component has when measured by itself. A corollary dream is to take scalability effects that are measured in a simple task and see them in a complex task.
Below we show how we do 3.3 million random triple lookups per second on two 8 core commodity servers producing complete results, joining across partitions. On a single 4 core server, the figure is about 1 million lookups per second. With a single thread, it is about 250K lookups per second. This is the good case. But even our worse case is quite decent.
We took a simple SPARQL query, counting how many people say they reciprocally know each other. In the Billion Triples Challenge data set, there are 25M foaf:knows
quads of which 92K are reciprocal. Reciprocal here means that when x knows y in some graph, y knows x in the same or any other graph.
SELECT COUNT (*) WHERE { ?p1 foaf:knows ?p2 . ?p2 foaf:knows ?p1 }
There is no guarantee that the triple of x knows y
is in the same partition as the triple y knows x. Thus the join is randomly distributed, n partitions to n partitions.
We left this out of the Billion Triples Challenge demo because this did not run fast enough for our liking. Since then, we have corrected this.
If run on a single thread, this query would be a loop over all the quads with a predicate of foaf:knows
, and an inner loop looking for a quad with 3 of 4 fields given (SPO
). If we have a partitioned situation, we have a loop over all the foaf:knows
quads in each partition, and an inner lookup looking for the reciprocal foaf:knows
quad in whatever partition it may be found.
We have implemented this with two different message patterns:
Centralized: One process reads all the foaf:knows
quads from all processes. Every 50K quads, it sends a batch of reciprocal quad checks to each partition that could contain a reciprocal quad. Each partition keeps the count of found reciprocal quads, and these are gathered and added up at the end.
Symmetrical: Each process reads the foaf:knows
quads in its partition, and sends a batch of checks to each process that could have the reciprocal foaf:knows
quad every 50K quads. At the end, the counts are gathered from all partitions. There is some additional control traffic but we do not go into its details here.
Below is the result measured on 2 machines each with 2 x Xeon 5345 (quad core; total 8 cores), 16G RAM, and each machine running 6 Virtuoso instances. The interconnect is dual 1-Gbit ethernet. Numbers are with warm cache.
Centralized: 35,543 msec, 728,634 sequential + random lookups per second
Cluster 12 nodes, 35 s. 1072 m/s 39,085 KB/s 316% cpu ...
Symmetrical: 7706 msec, 3,360,740 sequential + random lookups per second
Cluster 12 nodes, 7 s. 572 m/s 16,983 KB/s 1137% cpu ...
The second line is the summary from the cluster status report for the duration of the query. The interesting numbers are the KB/s and the %CPU. The former is the cross-sectional data transfer rate for intra-cluster communication; the latter is the consolidated CPU utilization, where a constantly-busy core counts for 100%. The point to note is that the symmetrical approach takes 4x less real time with under half the data transfer rate. Further, when using multiple machines, the speed of a single interface does not limit the overall throughput as it does in the centralized situation.
These figures represent the best and worst cases of distributed JOIN
ing. If we have a straight sequence of JOIN
s, with single pattern optionals and existences and the order in which results are produced is not significant (i.e., there is aggregation, existence test, or ORDER BY
), the symmetrical pattern is applicable. On the other hand, if there are multiple triple pattern optionals, complex sub-queries, DISTINCT
s in the middle of the query, or results have to be produced in the order of an index, then the centralized approach must be used at least part of the time.
Also, if we must make transitive closures, which can be thought of as an extension of a DISTINCT
in a subquery, we must pass the data through a single point before moving the bindings to the next JOIN
in the sequence. This happens for example in resolving owl:sameAs
at run time. However, the good news is that performance does not fall much below the centralized figure even when there are complex nested structures with intermediate transitive closures, DISTINCT
s, complex existence tests, etc., that require passing all intermediate results through a central point. No matter the complexity, it is always possible to vector some tens-of-thousands of variable bindings into a single message exchange. And if there are not that many intermediate results, then single query execution time is not a problem anyhow.
For our sample query, we would get still more speed by using a partitioned hash join, filling the hash from the foaf:knows
relations and then running the foaf:knows
relations through the hash. If the hash size is right, a hash lookup is somewhat better than an index lookup. The problem is that when the hash join is not the right solution, it is an expensive mistake: the best case is good; the worst case is very bad. But if there is no index then hash join is better than nothing. One problem of hash joins is that they make temporary data structures which, if large, will skew the working set. One must be quite sure of the cardinality before it is safe to try a hash join. So we do not do hash joins with RDF, but we do use them sometimes with relational data.
These same methods apply to relational data just as well. This does not make generic RDF storage outperform an application-specific relational representation on the same platform, as the latter benefits from all the same optimizations, but in terms of sheer numbers, this makes RDF representation an option where it was not an option before. RDF is all about not needing to design the schema around the queries, and not needing to limit what joins with what else.
]]>So we decided to do it ourselves.
The score is (updated with revised innodb_buffer_pool_size
setting, based on advice noted down below):
n-clients | Virtuoso | MySQL (with increased buffer pool size) |
MySQL (with default buffer poll size) |
---|---|---|---|
1 | 41,161.33 | 27,023.11 | 12,171.41 |
4 | 127,918.30 | (pending) | 37,566.82 |
8 | 218,162.29 | 105,524.23 | 51,104.39 |
16 | 214,763.58 | 98,852.42 | 47,589.18 |
The metric is the query mixes per hour from the BSBM test driver output. For the interested, the complete output is here.
The benchmark is pure SQL, nothing to do with SPARQL or RDF.
The hardware is 2 x Xeon 5345 (2 x quad core, 2.33 GHz), 16 G RAM. The OS is 64-bit Debian Linux.
The benchmark was run at a scale of 200,000. Each run had 2000 warm-up query mixes and 500 measured query mixes, which gives steady state, eliminating any effects of OS disk cache and the like. Both databases were configured to use 8G for disk cache. The test effectively runs from memory. We ran an analyze table on each MySQL table but noticed that this had no effect. Virtuoso does the stats sampling on the go; possibly MySQL also since the explicit stats did not make any difference. The MySQL tables were served by the InnoDB engine. MySQL appears to cache results of queries in some cases. This was not apparent in the tests.
The versions are 5.09 for Virtuoso and 5.1.29 for MySQL. You can download and examine --
MySQL ought to do better. We suspect that here, just as in the TPC-D experiment we made way back, the query plans are not quite right. Also we rarely saw over 300% CPU utilization for MySQL. It is possible there is a config parameter that affects this. The public is invited to tell us about such.
Update:
Andreas Schultz of the BSBM team advised us to increase the innodb_buffer_pool_size
setting in the MySQL config. We did and it produced some improvement. Indeed, this is more like it, as we now see CPU utilization around 700% instead of the 300% in the previously published run, which rendered it suspect. Also, our experiments with TPC-D led us to expect better. We ran these things a few times so as to have warm cache.
On the first run, we noticed that the Innodb warm up time was somewhere well in excess of 2000 query mixes. Another time, we should make a graph of throughput as a function of time for both MySQL and Virtuoso. We recently made a greedy prefetch hack that should give us some mileage there. For the next BSBM, all we can advise is to run larger scale system for half an hour first and then measure and then measure again. If the second measurement is the same as the first then it is good.
As always, since MySQL is not our specialty, we confidently invite the public to tell us how to make it run faster. So, unless something more turns up, our next trial is a revisit of TPC-H.
]]>We got a number of questions about Virtuoso's inference support. It seems that we are the odd one out, as we do not take it for granted that inference ought to consist of materializing entailment.
Firstly, of course one can materialize all one wants with Virtuoso. The simplest way to do this is using SPARUL. With the recent transitivity extensions to SPARQL, it is also easy to materialize implications of transitivity with a single statement. Our point is that for trivial entailment such as subclass, sub-property, single transitive property, and owl:sameAs, we do not require materialization, as we can resolve these at run time also, with backward-chaining built into the engine.
For more complex situations, one needs to materialize the entailment. At the present time, we know how to generalize our transitive feature to run arbitrary backward-chaining rules, including recursive ones. We could have a sort of Datalog backward-chaining embedded in our SQL/SPARQL and could run this with good parallelism, as the transitive feature already works with clustering and partitioning without dying of message latency. Exactly when and how we do this will be seen. Even if users want entailment to be materialized, such a rule system could be used for producing the materialization at good speed.
We had a word with Ian Horrocks on the question. He noted that it is often naive on behalf of the community to tend to equate description of semantics with description of algorithm. The data need not always be blown up.
The advantage of not always materializing is that the working set stays better. Once the working set is no longer in memory, response times jump disproportionately. Also, if the data changes or is retracted or is unreliable, one can end up doing a lot of extra work with materialization. Consider the effect of one malicious sameAs statement. This can lead to a lot of effects that are hard to retract. On the other hand, if running in memory with static data such as the LUBM benchmark, the queries run some 20% faster if entailment subclasses and sub-properties are materialized rather than done at run time.
Our compliments for the wildest idea of the conference go to Eyal Oren, Christophe Guéret, and Stefan Schlobach, et al, for their paper on using genetic algorithms for guessing how variables in a SPARQL query ought to be instantiated. Prisoners of our "conventional wisdom" as we are, this might never have occurred to us.
It is interesting to see how the industry comes to the semantic web conferences talking about schema last while at the same time the traditional semantic web people stress enforcing schema constraints and making more predictably performing and database friendlier logics. So do the extremes converge.
There is a point to schema last. RDF is very good for getting a view of ad hoc or unknown data. One can just load and look at what there is. Also, additions of unforeseen optional properties or relations to the schema are easy and efficient. However, it seems that a really high traffic online application would always benefit from having some application specific data structures. Such could also save considerably in hardware.
It is not a sharp divide between RDF and relational application oriented representation. We have the capabilities in our RDB to RDF mapping. We just need to show this and have SPARUL and data loading
]]>The live demo is at http://b3s.openlinksw.com/. This site is under development and may not be on all the time. We are taking it in the direction of hosting the whole LOD cloud. This is an evolving operation where we will continue showcasing how one can ask increasingly interesting questions from a growing online database, in the spirit of the billion triples charter.
In the words of Jim Hendler, we were not selected for the finale because this would have made the challenge a database shootout instead of a more research-oriented event. There is some point to this since if the event becomes like the TPC benchmarks, this will limit the entrance to full time database players. Anyway, we got a special mention in the intro of the challenge track.
The winner was Semaplorer, a federated SPARQL query system. There is some merit to this, as we ourselves are not convinced that centralization is always the right direction. As discussed in the DARQ Matter of Federation post, we have a notion of how to do this production-strength with our cluster engine, now also over wide area networks. We shall see.
The entries from Deri and LARKC (MaRVIN, "Massive RDF Versatile Inference Network") were doing materialization of inference results in a cluster environment. The thing they were not doing was joining across partitions. Thus, the data was partitioned on whatever criterion and then the data in each partition was further refined according to rules known to all partitions. Deri did not address joining further.
"Nature shall be the guide of the alchemist," goes the old adage. We can look at MaRVIN as an example of this dictum. Networks of people are low bandwidth, not nearly fully connected. Asking a colleague for information is expensive and subject to misunderstanding; asking another research group might never produce an answer.
Even looking at one individual, we have no reason to think that the human expert would do complete reasoning. Indeed, the brain is a sort of compute cluster, but it does not have flat latency point to point connectivity â some joins are fast; others are not even tried, for all we know.
A database running on a cluster is a sort of counter-example. A database with RDF workload will end up joining across partitions pretty much all of the time.
MaRVIN's approach to joining could be likened to a country dance: Boys get to take a whirl with different girls according to a complex pattern. For match-making, some matches are produced early but one never knows if the love of a lifetime might be just around the corner. Also, if the dancers are inexperienced, they will have little ability to evaluate how good a match they have with their partner. A few times around the dance floor are needed to get the hang of things.
The question is, at what point will it no longer be possible to join across the database? This depends on the interconnect latency. The higher the latency, the more useful the square-dancing approach becomes.
Another practical consideration is the fact that RDF reasoners are not usually built for distributed memory multiprocessors. If the reasoner must be a plug-in component, then it cannot be expected to be written for grids.
We can think of a product safety use case: Find cosmetics that have ingredients that are considered toxic in the amounts they are present in each product. This can be done as a database query with some transitive operations, like running through a cosmetics taxonomy and a poisons database. If the business logic deciding whether the presence of an ingredient in the product is a health hazard is very complex, we can get a lot of joins.
The MaRVIN way would be to set up a ball where each lipstick and eyeliner dances with every poison and then see if matches are made. The matching logic could be arbitrarily complex since it would run locally. Of course here, some domain knowledge is needed in order to set up the processing so that each product and poison carry all the associated information with them. Dancing with half a partner can bias one's perceptions: Again, it is like nature, sometimes not all cards are on the table.
It would seem that there is some setup involved before answering a question: Composition of partitions, frequency of result exchange, etc. How critical the domain knowledge implicit in the setup is for the quality of results is an interesting question.
The question is, at what point will a cluster using distributed database operations for inference become impractical? Of course, it is impractical from the get-go if the reasoners and query processors are not made for this. But what if they are? We are presently evaluating different message patterns for joining between partitions. The baseline is some 250,000 random single-triple lookups per second per core. Using a cluster increases this throughput. The increase is more or less linear depending on whether all intermediate results pass via one coordinating node (worst case) or whether each node can decide which other node will do the next join step for each result (best case). For example, a DISTINCT
operation requires that data passes through a single place but JOIN
ing and aggregation in general do not.
We will still publish numbers during this November.
]]>The meeting was about writing a charter for a working group that would define a standard for mapping relational databases to RDF, either for purposes of import into RDF stores or of query mapping from SPARQL to SQL. There was a lot of agreement and the meeting even finished ahead of the allotted time.
There was discussion concerning using the Entity Name Service from the Okkam project for assigning URIs to entities mapped from relational databases. This makes sense when talking about long-lived, legal entities, such as people or companies or geography. Of course, there are cases where this makes no sense; for example, a purchase order or maintenance call hardly needs an identifier registered with the ENS. The problem is, in practice, a CRM could mention customers that have an ENS registered ID (or even several such IDs) and others that have none. Of course, the CRM's reference cannot depend on any registration. Also, even when there is a stable URI for the entity, a CRM may need a key that specifies some administrative subdivision of the customer.
Also we note that an on-demand RDB-to-RDF mapping may have some trouble dealing with "same as" assertions. If names that are anything other than string forms of the keys in the system must be returned, there will have to be a lookup added to the RDB. This is an administrative issue. Certainly going over the network to ask for names of items returned by queries has a prohibitive cost. It would be good for ad hoc integration to use shared URIs when possible. The trouble of adding and maintaining lookups for these, however, makes this more expensive than just mapping to RDF and using literals for joining between independently maintained systems.
We talked about having a language for human consumption and another for discovery and machine processing of mappings. Would this latter be XML or RDF based? Describing every detail of syntax for a mapping as RDF is really tedious. Also such descriptions are very hard to query, just as OWL ontologies are. One solution is to have opaque strings embedded into RDF, just like XSLT has XPath in string form embedded into XML. Maybe it will end up in this way here also. Having a complete XML mapping of the parse tree for mappings, XQueryX-style, could be nice for automatic generation of mappings with XSLT from an XML view of the information schema. But then XSLT can also produce text, so an XML syntax that has every detail of a mapping language as distinct elements is not really necessary for this.
Another matter is then describing the RDF generated by the mapping in terms of RDFS or OWL. This would be a by-product of declaring the mapping. Most often, I would presume the target ontology to be given, though, reducing the need for this feature. But if RDF mapping is used for discovery of data, such a description of the exposed data is essential.
We agreed with Sören Auer that we could make Virtuoso's mapping language compatible with Triplify. Triplify is very simple, extraction only, no SPARQL, but does have the benefit of expressing everything in SQL. As it happens, I would be the last person to tell a web developer what language to program in. So if it is SQL, then let it stay SQL. Technically, a lot of the information the Virtuoso mapping expresses is contained in the Triplify SQL statements, but not all. Some extra declarations are needed still but can have reasonable defaults.
There are two ways of stating a mapping. Virtuoso starts with the triple and says which tables and columns will produce the triple. Triplify starts with the SQL statement and says what triples it produces. These are fairly equivalent. For the web developer, the latter is likely more self-evident, while the former may be more compact and have less repetition.
Virtuoso and Triplify alone would give us the two interoperable implementations required from a working group, supposing the language were annotations on top of SQL. This would be a guarantee of delivery, as we would be close enough to the result from the get go.
I gave a talk about the Virtuoso Cluster edition, wherein I repeated essentially the same ground facts as Mike and outlined how we (in spite of these) profit from distributed memory multiprocessing. To those not intimate with these questions, let me affirm that deriving benefit from threading in a symmetric multiprocessor box, let alone a cluster connected by a network, totally depends on having many relatively long running things going at a time and blocking as seldom as possible.
Further, Mike Dean talked about ASIO, the BBN suite of semantic web tools. His most challenging statement was about the storage engine, a network-database-inspired triple-store using memory-mapped files.
Will the CODASYL days come back, and will the linked list on disk be the way to store triples/quads? I would say that this will have, especially with a memory-mapped file, probably a better best-case as a B-tree but that this also will be less predictable with fragmentation. With Virtuoso, using a B-tree index, we see about 20-30% of CPU time spent on index lookup when running LUBM queries. With a disk-based memory-mapped linked-list storage, we would see some improvements in this while getting hit probably worse than now in the case of fragmentation. Plus compaction on the fly would not be nearly as easy and surely far less local, if there were pointers between pages. So it is my intuition that trees are a safer bet with varying workloads while linked lists can be faster in a query-dominated in-memory situation.
Chris Bizer presented the Berlin SPARQL Benchmark (BSBM), which has already been discussed here in some detail. He did acknowledge that the next round of the race must have a real steady-state rule. This just means that the benchmark must be run long enough for the system under test to reach a state where the cache is full and the performance remains indefinitely at the same level. Reaching steady state can take 20-30 minutes in some cases.
Regardless of steady state, BSBM has two generally valid conclusions:
Mike Dean asked whether BSBM was a case of a setup to have triple stores fail. Not necessarily, I would say; we should understand that one motivation of BSBM is testing mapping technologies. Therefore it must have a workload where mapping makes sense. Of course there are workloads where triples are unchallenged â take the Billion Triples Challenge data set for one.
Also, with BSBM, once should note that the query optimization time plays a fairly large role since most queries touch relatively little data. Also, even if the scale is large, the working set is not nearly the size of the database. This in fact penalizes mapping technologies against native SQL since the difference there is compiling the query, especially since parameters are not used. So, Chris, since we both like to map, let's make a benchmark that shows mapping closer to native SQL.
When we run Virtuoso relational against Virtuoso triple store with the TPC-H workload, we see that the relational case is significantly faster. These are long queries, thus query optimization time is negligible; we are here comparing memory-based access times. Why is this? The answer is that a single index lookup gives multiple column values with almost no penalty for the extra column. Also, since the number of total joins is lower, the overhead coming from moving from join to next join is likewise lower. This is just a meter of count of executed instructions.
A column store joins in principle just as much as a triple store. However, since the BI workload often consists of scanning over large tables, the joins tend to be local, the needed lookup can often use the previous location as a starting point. A triple store can do the same if queries have high locality. We do this in some SQL situations and can try this with triples also. The RDF workload is typically more random in its access pattern, though. The other factor is the length of control path. A column store has a simpler control flow if it knows that the column will have exactly one value per row. With RDF, this is not a given. Also, the column store's row is identified by a single number and not a multipart key. These two factors give the column store running with a fixed schema some edge over the more generic RDF quad store.
There was some discussion on how much closer a triple store could come to a relational one. Some gains are undoubtedly possible. We will see. For the ideal row store workload, the RDBMS will continue to have some edge. Large online systems typically have a large part of the workload that is simple and repetitive. There is nothing to prevent one having special indices for supporting such workload, even while retaining the possibility of arbitrary triples elsewhere. Some degree of application-specific data structure does make sense. We just need to show how this is done. In this way, we have a continuum and not an either/or choice of triples vs. tables.
Concerning the future direction of the workshop, there were a few directions suggested. One of the more interesting ones was Mike Dean's suggestion about dealing with a large volume of same-as assertions, specifically a volume where materializing all the entailed triples was no longer practical. Of course, there is the question of scale. This time, we were the only ones focusing on a parallel database with no restrictions on joining.
]]>I will here engage in some critical introspection as well as amplify on some answers given to Virtuoso-related questions in recent times.
I use some conversations from the Vienna Linked Data Practitioners meeting as a starting point. These views are mine and are limited to the Virtuoso server. These do not apply to the ODS (OpenLink Data Spaces) applications line, OAT (OpenLink Ajax Toolkit), or ODE (OpenLink Data Explorer).
Well, personally, I am all for core competence. This is why I do not participate in all the online conversations and groups as much as I could, for example. Time and energy are critical resources and must be invested where they make a difference. In this case, the real core competence is running in the database race. This in itself, come to think of it, is a pretty broad concept.
This is why we put a lot of emphasis on Linked Data and the Data Web for now, as this is the emerging game. This is a deliberate choice, not an outside imperative or built-in limitation. More specifically, this means exposing any pre-existing relational data as linked data plus being the definitive RDF store.
We can do this because we own our database and SQL and data access middleware and have a history of connecting to any RDBMS out there.
The principal message we have been hearing from the RDF field is the call for scale of triple storage. This is even louder than the call for relational mapping. We believe that in time mapping will exceed triple storage as such, once we get some real production strength mappings deployed, enough to outperform RDF warehousing.
There are also RDF middleware things like RDF-ization and demand-driven web harvesting (i.e, the so-called Sponger). These are SPARQL options, thus accessed via standard interfaces. We have little desire to create our own languages or APIs, or to tell people how to program. This is why we recently introduced Sesame- and Jena-compatible APIs to our RDF store. From what we hear, these work. On the other hand, we do not hesitate to move beyond the standards when there is obvious value or necessity. This is why we brought SPARQL up to and beyond SQL expressivity. It is not a case of E3 (Embrace, Extend, Extinguish).
Now, this message could be better reflected in our material on the web. This blog is a rather informal step in this direction; more is to come. For now we concentrate on delivering.
The conventional communications wisdom is to split the message by target audience. For this, we should split the RDF, relational, and web services messages from each other. We believe that a challenger, like the semantic web technology stack, must have a compelling message to tell for it to be interesting. This is not a question of research prototypes. The new technology cannot lack something the installed technology takes for granted.
This is why we do not tend to show things like how to insert and query a few triples: No business out there will insert and query triples for the sake of triples. There must be a more compelling story â for example, turning the whole world into a database. This is why our examples start with things like turning the TPC-H database into RDF, queries and all. Anything less is not interesting. Why would an enterprise that has business intelligence and integration issues way more complex than the rather stereotypical TPC-H even look at a technology that pretends to be all for integration and all for expressivity of queries, yet cannot answer the first question of the entry exam?
The world out there is complex. But maybe we ought to make some simple tutorials? So, as a call to the people out there, tell us what a good tutorial would be. The question is more about figuring out what is out there and adapting these and making a sort of compatibility list. Jena and Sesame stuff ought to run as is. We could offer a webinar to all the data web luminaries showing how to promote the data web message with Virtuoso. After all, why not show it on the best platform?
We should answer in multiple parts.
For general collateral, like web sites and documentation:
The web site gives a confused product image. For the Virtuoso product, we should divide at the top into
For each point, one simple statement. We all know what the above things mean?
Then we add a new point about scalability that impacts all the above, namely the Virtuoso version 6 Cluster, meaning that you can do all these things at 10 to 1000 times the scale. This means this much more data or in some cases this much more requests per second. This too is clear.
Far as I am concerned, hosting Java or .NET does not have to be on the front page. Also, we have no great interest in going against Apache when it comes to a web server only situation. The fact that we have a web listener is important for some things but our claim to fame does not rest on this.
Then for documentation and training materials: The documentation should be better. Specifically it should have more of a how-to dimension since nobody reads the whole thing anyhow. About online tutorials, the order of presentation should be different. They do not really reflect what is important at the present moment either.
Now for conference papers: Since taking the data web as a focus area, we have submitted some papers and had some rejected because these do not have enough references and do not explain what is obvious to ourselves.
I think that the communications failure in this case is that we want to talk about end to end solutions and the reviewers expect research. For us, the solution is interesting and exists only if there is an adequate functionality mix for addressing a specific use case. This is why we do not make a paper about query cost model alone because the cost model, while indispensable, is a thing that is taken for granted where we come from. So we mention RDF adaptations to cost model, as these are important to the whole but do not find these to be the justification for a whole paper. If we made papers on this basis, we would have to make five times as many. Maybe we ought to.
One thing that is not obvious from the Virtuoso packaging is that the minimum installation is an executable under 10MB and a config file. Two files.
This gives you SQL and SPARQL out of the box. Adding ODBC and JDBC clients is as simple as it gets. After this, there is basic database functionality. Tuning is a matter of a few parameters that are explained on this blog and elsewhere. Also, the full scale installation is available as an Amazon EC2 image, so no installation required.
Now for the difficult side:
Use SQL and SPARQL; use stored procedures whenever there is server side business logic. For some time critical web pages, use VSP. Do not use VSPX. Otherwise, use whatever you are used to â PHP or Java or anything else. For web services, simple is best. Stick to basics. "The engineer is one who can invent a simple thing." Use SQL statements rather than admin UI.
Know that you can start a server with no database file and you get an initial database with nothing extra. The demo database, the way it is produced by installers is cluttered.
We should put this into a couple of use case oriented how-tos.
Also, we should create a network of "friendly local virtuoso geeks" for providing basic training and services so we do not have to explain these things all the time. To all you data-web-ers out there â please sign up and we will provide instructions, etc. Contact Yrjänä Rankka (ghard[at-sign]openlinksw.com), or go through the mailing lists; do not contact me directly.
Now, what is good for one end is usually good for the other. Namely, a database, no matter the scale, needs to have space efficient storage, fast index lookup, and correct query plans. Then there are things that occur only at the high-end, like clustering, but these are separate things. For embedding, the initial memory footprint needs to be small. With Virtuoso, this is accomplished by leaving out some 200 built-in tables and 100,000 lines of SQL procedures that are normally in by default, supporting things such as DAV and diverse other protocols. After all, if SPARQL is all one wants these are not needed.
If one really wants to do one's server logic (like web listener and thread dispatching) oneself, this is not impossible but requires some advice from us. On the other hand, if one wants to have logic for security close to the data, then using stored procedures is recommended; these execute right next to the data, and support inline SPARQL and SQL. Depending on the license status of the other code, some special licensing arrangements may apply.
We are talking about such things with different parties at present.
"Webby means distributed, heterogeneous, open; not monolithic consolidation of everything."
We are philosophically webby. We come from open standards; we are after all called OpenLink; our history consists of connecting things. We believe in choice â the user should be able to pick the best of breed for components and have them work together. We cannot and do not wish to force replacement of existing assets. Transforming data on the fly and connecting systems, leaving data where it originally resides, is the first preference. For the data web, the first preference is a federation of independent SPARQL end points. When there is harvesting, we prefer to do it on demand, as with our Sponger. With the immense amount of data out there we believe in finding what is relevant when it is relevant, preferably close at hand, leveraging things like social networks. With a data web, many things which are now siloized, such as marketplaces and social networks, will return to the open.
Google-style crawling of everything becomes less practical if one needs to run complex ad hoc queries against the mass of data. For these types of scenarios, if one needs to warehouse, the data cloud will offer solutions where one pays for database on demand. While we believe in loosely coupled federation where possible, we have serious work on the scalability side for the data center and the compute-on-demand cloud.
Personally, I think we have the basics for the birth of a new inflection in the knowledge economy. The URI is the unit of exchange; its value and competitive edge lie in the data it links you with. A name without context is worth little, but as a name gets more use, more information can be found through that name. This is anything from financial statistics, to legal precedents, to news reporting or government data. Right now, if the SEC just added one line of markup to the XBRL template, this would instantaneously make all SEC-mandated reporting into linked data via GRDDL.
The URI is a carrier of brand. An information brand gets traffic and references, and this can be monetized in diverse ways. The key word is context. Information overload is here to stay, and only better context offers the needed increase in productivity to stay ahead of the flood.
Semantic technologies on the whole can help with this. Why these should be semantic web or data web technologies as opposed to just semantic is the linked data value proposition. Even smart islands are still islands. Agility, scale, and scope, depend on the possibility of combining things. Therefore common terminologies and dereferenceability and discoverability are important. Without these, we are at best dealing with closed systems even if they were smart. The expert systems of the 1980s are a case in point.
Ever since the .com era, the URL has been a brand. Now it becomes a URI. Thus, entirely hiding the URI from the user experience is not always desirable. The URI is a sort of handle on the provenance and where more can be found; besides, people are already used to these.
With linked data, information value-add products become easy to build and deploy. They can be basically just canned SPARQL queries combining data in a useful and insightful manner. And where there is traffic there can be monetization, whether by advertizing, subscription, or other means. Such possibilities are a natural adjunct to the blogosphere. To publish analysis, one no longer needs to be a think tank or media company. We could call this scenario the birth of a meshup economy.
For OpenLink itself, this is our roadmap. The immediate future is about getting our high end offerings like clustered RDF storage generally available, both on the cloud and for private data centers. Ourselves, we will offer the whole Linked Open Data cloud as a database. The single feature to come in version 2 of this is fully automatic partitioning and repartitioning for on-demand scale; now, you have to choose how many partitions you have.
This makes some things possible that were hard thus far.
On the mapping front, we go for real-scale data integration scenarios where we can show that SPARQL can unify terms and concepts across databases, yet bring no added cost for complex queries. Enterprises can use their existing warehouses and have an added level of abstraction, the possibility of cross systems interlinking, the advantages of using the same taxonomies and ontologies across systems, and so forth.
Then there will be developments in the direction of smarter web harvesting on demand with the Virtuoso Sponger, and federation of heterogeneous SPARQL end points. The federation is not so unlike clustering, except the time scales are 2 orders of magnitude longer. The work on SPARQL end point statistics and data set description and discovery is a good development in the community.
Then there will be NLP integration, as exemplified by the Open Calais linked data wrapper and more.
Can we pull this off or is this being spread too thin? We know from experience that all this can be accomplished. Scale is already here; we show it with the billion triples set. Mapping is here; we showed it last in the Berlin Benchmark. We will also show some TPC-H results after we get a little quiet after the ISWC event. Then there is ongoing maintenance but with this we have shown a steady turnaround and quick time to fix for pretty much anything.
]]>There are already wrappers producing RDF from many applications. Since any structured or semi-structured data can be converted to RDF and often there is even a pre-existing terminology for the application domain, the availability of the data per se is not the concern.
The triples may come from any application or database, but they will not come from the end user directly. There was a good talk about photograph annotation in Vienna, describing many ways of deriving metadata for photos. The essential wisdom is annotating on the spot and wherever possible doing so automatically. The consumer is very unlikely to go annotate photos after the fact. Further, one can infer that photos made with the same camera around the same time are from the same location. There are other such heuristics. In this use case, the end user does not need to see triples. There is some benefit though in using commonly used geographical terminology for linking to other data sources.
I'd say one will develop them much the same way as thus far. In PHP, for example. Whether one's query language is SPARQL or SQL does not make a large difference in how basic web UI is made.
A SPARQL end-point is no more an end-user item than a SQL command-line is.
A common mistake among techies is that they think the data structure and user experience can or ought to be of the same structure. The UI dialogs do not, for example, have to have a 1:1 correspondence with SQL tables.
The idea of generating UI from data, whether relational or data-web, is so seductive that generation upon generation of developers fall for it, repeatedly. Even I, at OpenLink, after supposedly having been around the block a couple of times made some experiments around the topic. What does make sense is putting a thin wrapper or HTML around the application, using XSLT and such for formatting. Since the model does allow for unforeseen properties of data, one can build a viewer for these alongside the regular forms. For this, Ajax technologies like OAT (the OpenLink AJAX Toolkit) will be good.
The UI ought not to completely hide the URIs of the data from the user. It should offer a drill down to faceted views of the triples for example. Remember when Xerox talked about graphical user interfaces in 1980? "Don't mode me in" was the slogan, as I recall.
Since then, we have vacillated between modal and non-modal interaction models. Repetitive workflows like order entry go best modally and are anyway being replaced by web services. Also workflows that are very infrequent benefit from modality; take personal network setup wizards, for example. But enabling the knowledge worker is a domain that by its nature must retain some respect for human intelligence and not kill this by denying access to the underlying data, including provenance and URIs. Face it: the world is not getting simpler. It is increasingly data dependent and when this is so, having semantics and flexibility of access for the data is important.
For a real-time task-oriented user interface like a fighter plane cockpit, one will not show URIs unless specifically requested. For planning fighter sorties though, there is some potential benefit in having all data such as friendly and hostile assets, geography, organizational structure, etc., as linked data. It makes for more flexible querying. Linked data does not per se mean open, so one can be joinable with open data through using the same identifiers even while maintaining arbitrary levels of security and compartmentalization.
For automating tasks that every time involve the same data and queries, RDF has no intrinsic superiority. Thus the user interfaces in places where RDF will have real edge must be more capable of ad hoc viewing and navigation than regular real-time or line of business user interfaces.
The OpenLink Data Explorer idea of a "data behind the web page" view goes in this direction. Read the web as before, then hit a switch to go to the data view. There are and will be separate clarifications and demos about this.
When SWEO was beginning, there was an endlessly protracted discussion of the so-called layer cake. This acronym jungle is not good messaging. Just say linked, flexibly repurpose-able data, and rich vocabularies and structure. Just the right amount of structure for the application, less rigid and easier to change than relational.
Do not even mention the different serialization formats. Just say that it fits on top of the accepted web infrastructure â HTTP, URIs, and XML where desired.
It is misleading to say inference is a box at some specific place in the diagram. Inference of different types may or may not take place at diverse points, whether presentation or storage, on demand or as a preprocessing step. Since there is structure and semantics, inference is possible if desired.
Yes, in principle, but what do you have in mind? The answer is very context dependent. The person posing the question had an E-learning system in mind, with things such as course catalogues, course material, etc. In such a case, RDF is a great match, especially since the user count will not be in the millions. No university has that many students and anyway they do not hang online browsing the course catalogue.
On the other hand, if I think of making a social network site with RDF as the exclusive data model, I see things that would be very inefficient. For example, keeping a count of logins or the last time of login would be by default several times less efficient than with a RDBMS.
If some application is really large scale and has a knowable workload profile, like any social network does, then some task-specific data structure is simply economical. This does not mean that the application language cannot be SPARQL but this means that the storage format must be tuned to favor some operations over others, relational style. This is a matter of cost more than of feasibility. Ten servers cost less than a hundred and have failures ten times less frequently.
In the near term we will see the birth of an application paradigm for the data web. The data will be open, exposed, first-class citizen; yet the user experience will not have to be in a 1:1 image of the data.
]]>Sören Auer asked me to say a few things about relational to RDF mapping. I will cite some highlights from this, as they pertain to the general scene. There was an "open hacking" session Wednesday night featuring lightning talks. I will use some of these too as a starting point.
The SWEO (Semantic Web Education and Outreach) interest group of the W3C spent some time looking for an elevator pitch for the Semantic Web. It became "Data Unleashed." Why not? Let's give this some context.
So, if we are holding a Semantic Web 101 session, where should we begin? I hazard to guess that we should not begin by writing a FOAF file in Turtle by hand, as this is one thing that is not likely to happen in the real world.
Of course, the social aspect of the Data Web is the most immediately engaging, so a demo might be to go make an account with myopenlink.net and see that after one has entered the data one normally enters for any social network, one has become a Data Web citizen. This means that one can be found, just like this, with a query against the set of data spaces hosted on the system. Then we just need a few pages that repurpose this data and relate it to other data. We show some samples of queries like this in our Billion Triples Challenge demo. We will make a webcast about this to make it all clearer.
Behold: The Data Web is about the world becoming a database; writing SPARQL queries or triples is incidental. You will write FOAF files by hand just as little as you now write SQL insert statements for filling in your account information on Myspace.
Every time there is a major shift in technology, this shift needs to be motivated by addressing a new class of problem. This means doing something that could not be done before. The last time this happened was when the relational database became the dominant IT technology. At that time, the questions involved putting the enterprise in the database and building a cluster of Line Of Business (LOB) applications around the database. The argument for the RDBMS was that you did not have to constrain the set of queries that might later be made, when designing the database. In other words, it was making things more ad hoc. This was opposed then on grounds of being less efficient than the hierarchical and network databases which the relational eventually replaced.
Today, the point of the Data Web is that you do not have to constrain what your data can join or integrate with, when you design your database. The counter-argument is that this is slow and geeky and not scalable. See the similarity?
A difference is that we are not specifically aiming at replacing the RDBMS. In fact, if you know exactly what you will query and have a well defined workload, a relational representation optimized for the workload will give you about 10x the performance of the equivalent RDF warehouse. OLTP remains a relational-only domain.
However, when we are talking about doing queries and analytics against the Web, or even against more than a handful of relational systems, the things which make RDBMS good become problematic.
The most reliable of human drives is the drive to make oneself known. This drives all, from any social scene to business communications to politics. Today, when you want to proclaim you exist, you do so first on the Web. The Web did not become the prevalent media because business loved it for its own sake, it became prevalent because business could not afford not to assert their presence there. If anything, the Web eroded the communications dominance of a lot of players, which was not welcome but still had to be dealt with, by embracing the Web.
Today, in a world driven by data, the Data Web will be catalyzed by similar factors: If your data is not there, you will not figure in query results. Search engines will play some role there but also many social applications will have reports that are driven by published data. Also consider any e-commerce, any marketplace, and so forth. The Data Portability movement is a case in point: Users want to own their own content; silo operators want to capitalize on holding it. Right now, we see these things in silos; the Data Web will create bridges between these, and what is now in silo data centers will be increasingly available on an ad hoc basis with Open Data.
Again, we see a movement from the specialized to the generic: What LinkedIn does in its data center can be done with ad hoc queries with linked open data. Of course, LinkedIn does these things somewhat more efficiently because their system is built just for this task, but the linked data approach has the built-in readiness to join with everything else at almost no cost, without making a new data warehouse for each new business question.
We could call this the sociological aspect of the thing. Getting to more concrete business, we see an economy that, we could say, without being alarmists, is confronted with some issues. Well, generally when times are bad, this results in consolidation of property and power. Businesses fail and get split up and sold off in pieces, government adds controls and regulations and so forth. This means ad hoc data integration, as control without data is just pretense. If times are lean, this also means that there is little readiness to do wholesale replacement of systems, which will take years before producing anything. So we must play with what there is and make it deliver, in ways and conditions that were not necessarily anticipated. The agility of the Data Web, if correctly understood, can be of great benefit there, especially on the reporting and business intelligence side. Specifically mapping line-of-business systems into RDF on the fly will help with integration, making the specialized warehouse the slower and more expensive alternative. But this too is needed at times.
But for the RDF community to be taken seriously there, the messaging must be geared in this direction. Writing FOAF files by hand is not where you begin the pitch. Well, what is more natural then having a global, queriable information space, when you have a global information driven economy?
The Data Web is about making this happen. First with doing this in published generally available data; next with the enterprises having their private data for their own use but still linking toward the outside, even though private data stays private: You can still use standard terms and taxonomies, where they apply, when talking of proprietary information.
At the lightning talks in Vienna, one participant said, "Man's enemy is not the lion that eats men, it's his own brother. Semantic Web's enemy is the XML Web services stack that ate its lunch." There is some truth to the first part. The second part deserves some comment. The Web services stack is about transactions. When you have a fixed, often repeating task, it is a natural thing to make this a Web service. Even though SOA is not really prevalent in enterprise IT, it has value in things like managing supply-chain logistics with partners, etc. Lots of standard messages with unambiguous meaning. To make a parallel with the database world: first there was OLTP; then there was business intelligence. Of course, you must first have the transactions, to have something to analyze.
SOA is for the transactions; the Data Web is for integration, analysis, and discovery. It is the ad hoc component of the real time enterprise, if you will. It is not a competitor against a transaction oriented SOA. In fact, RDF has no special genius for transactions. Another mistake that often gets made is stretching things beyond their natural niche. Doing transactions in RDF is this sort of over-stretching without real benefit.
"I made an ontology and it really did solve a problem. How do I convince the enterprise people, the MBA who says it's too complex, the developer who says it is not what he's used to, and so on?"
This is an education question. One of the findings of SWEO's enterprise survey was that there was awareness that difficult problems existed. There were and are corporate ontologies and taxonomies, diversely implemented. Some of these needs are recognized. RDF based technologies offer to make these more open standards based. open standards have proven economical in the past. What we also hear is that major enterprises do not even know what their information and human resources assets are: Experts can't be found even when they are in the next department, or reports and analysis gets buried in wikis, spreadsheets, and emails.
Just as when SQL took off, we need vendors to do workshops on getting started with a technology. The affair in Vienna was a step in this direction. Another type of event specially focusing on vertical problems and their Data Web solutions is a next step. For example, one could do a workshop on integrating supply chain information with Data Web technologies. Or one on making enterprise knowledge bases from HR, CRM, office automation, wikis, etc. The good thing is that all these things are additions to, not replacements of, the existing mission-critical infrastructure. And better use of what you already have ought to be the theme of the day.
]]>Firstly, we have a fairly scalable platform with Virtuoso 6 Cluster. It was most recently tested with the workload discussed in the previous Billion Triples post.
There is an updated version of the paper about this. This will be presented at the web scale workshop of ISWC 2008 in Karlsruhe.
Right now, we are polishing some things in Virtuoso 6 -- some optimizations for smarter balancing of interconnect traffic over multiple network interfaces, and some more SQL optimizations specific to RDF. The must-have basics, like parallel running of sub-queries and aggregates, and all-around unrolling of loops of every kind into large partitioned batches, is all there and proven to work.
We spent a lot of time around the Berlin SPARQL Benchmark story, so we got to the more advanced stuff like the Billion Triples Challenge rather late. We did along the way also run BSBM with an Oracle back-end, with Virtuoso mapping SPARQL to SQL. This merits its own analysis in the near future. This will be the basic how-to of mapping OLTP systems to RDF. Depending on the case, one can use this for lookups in real-time or ETL.
RDF will deliver value in complex situations. An example of a complex relational mapping use case came from Ordnance Survey, presented at the RDB2RDF XG. Examples of complex warehouses include the Neurocommons database, the Billion Triples Challenge, and the Garlik DataPatrol.
In comparison, the Berlin workload is really simple and one where RDF is not at its best, as amply discussed on the Linked Data forum. BSBM's primary value is as a demonstrator for the basic mapping tasks that will be repeated over and over for pretty much any online system when presence on the data web becomes as indispensable as presence on the HTML web.
I will now talk about the complex warehouse/web-harvesting side. I will come to the mapping in another post.
Now, all the things shown in the Billion Triples post can be done with a relational system specially built for each purpose. Since we are a general purpose RDBMS, we use this capability where it makes sense. For example, storing statistics about which tags or interests occur with which other tags or interests as RDF blank nodes makes no sense. We do not even make the experiment; we know ahead of time that the result is at least an order of magnitude in favor of the relational row-oriented solution in both space and time.
Whenever there is a data structure specially made for answering one specific question, like joint occurrence of tags, RDB and mapping is the way to go. With Virtuoso, this can fully-well coexist with physical triples, and can still be accessed in SPARQL and mixed with triples. This is territory that we have not extensively covered yet, but we will be giving some examples about this later.
The real value of RDF is in agility. When there is no time to design and load a new warehouse for every new question, RDF is unparalleled. Also SPARQL, once it has the necessary extensions of aggregating and sub-queries, is nicer than SQL, especially when we have sub-classes and sub-properties, transitivity, and "same as" enabled. These things have some run time cost and if there is a report one is hitting absolutely all the time, then chances are that resolving terms and identity at load-time and using materialized views in SQL is the reasonable thing. If one is inventing a new report every time, then RDF has a lot more convenience and flexibility.
We are just beginning to explore what we can do with data sets such as the online conversation space, linked data, and the open ontologies of UMBEL and OpenCyc. It is safe to say that we can run with real world scale without loss of query expressivity. There is an incremental cost for performance but this is not prohibitive. Serving the whole billion triples set from memory would cost about $32K in hardware. $8K will do if one can wait for disk part of the time. One can use these numbers as a basis for costing larger systems. For online search applications, one will note that running the indexes pretty much from memory is necessary for flat response time. For back office analytics this is not necessarily as critical. It all depends on the use case.
We expect to be able to combine geography, social proximity, subject matter, and named entities, with hierarchical taxonomies and traditional full text, and to present this through a simple user interface.
We expect to do this with online response times if we have a limited set of starting points and do not navigate more than 2 or 3 steps from each starting point. An example would be to have a full text pattern and news group, and get the cloud of interests from the authors of matching posts. Another would be to make a faceted view of the properties of the 1000 people most closely connected to one person.
Queries like finding the fastest online responders to questions about romance across the global board-scape, or finding the person who initiates the most long running conversations about crime, take a bit longer but are entirely possible.
The genius of RDF is to be able to do these things within a general purpose database, ad hoc, in a single query language, mostly without materializing intermediate results. Any of these things could be done with arbitrary efficiency in a custom built system. But what is special now is that the cost of access to this type of information and far beyond drops dramatically as we can do these things in a far less labor intensive way, with a general purpose system, with no redesigning and reloading of warehouses at every turn. The query becomes a commodity.
Still, one must know what to ask. In this respect, the self-describing nature of RDF is unmatched. A query like list the top 10 attributes with the most distinct values for all persons cannot be done in SQL. SQL simply does not allow the columns to be variable.
Further, we can accept queries as text, the way people are used to supplying them, and use structure for drill-down or result-relevance, and also recognize named entities and subject matter concepts in query text. Very simple NLP will go a long way towards keeping SPARQL out of the user experience.
The other way of keeping query complexity hidden is to publish hand-written SPARQL as parameter-fed canned reports.
Between now and ISWC 2008, the last week of October, we will put out demos showing some of these things. Stay tuned.
]]>We use Virtuoso 6 Cluster Edition to demonstrate the following:
The demo is based on a set of canned SPARQL queries that can be invoked using the OpenLink Data Explorer (ODE) Firefox extension.
The demo queries can also be run directly against the SPARQL end point.
The demo is being worked on at the time of submission and may be shown online by appointment.
Automatic annotation of the data based on named entity extraction is being worked on at the time of this submission. By the time of ISWC 2008 the set of sample queries will be enhanced with queries based on extracted named entities and their relationships in the UMBEL and Open CYC ontologies.
Also examples involving owl:sameAs are being added, likewise with similarity metrics and search hit scores.
The database consists of the billion triples data sets and some additions like Umbel. Also the Freebase extract is newer than the challenge original.
The triple count is 1115 million.
In the case of web harvested resources, the data is loaded in one graph per resource.
In the case of larger data sets like Dbpedia or the US census, all triples of the provenance share a data set specific graph.
All string literals are additionally indexed in a full text index. No stop words are used.
Most queries do not specify a graph. Thus they are evaluated against the union of all the graphs in the database. The indexing scheme is SPOG, GPOS, POGS, OPGS. All indices ending in S are bitmap indices.
The demo uses Virtuoso SPARQL extensions in most queries. These extensions consist on one hand of well known SQL features like aggregation with grouping and existence and value subqueries and on the other of RDF specific features. The latter include run time RDFS and OWL inferencing support and backward chaining subclasses and transitivity.
sparql select ?s ?p (bif:search_excerpt (bif:vector ('semantic', 'web'), ?o)) where { ?s ?p ?o . filter (bif:contains (?o, "'semantic web'")) } limit 10 ;
This looks up triples with semantic web in the object and makes a search hit summary of the literal, highlighting the search terms.
sparql select ?tp count(*) where { ?s ?p2 ?o2 . ?o2 a ?tp . ?s foaf:nick ?o . filter (bif:contains (?o, "plaid_skirt")) } group by ?tp order by desc 2 limit 40 ;
This looks at what sorts of things are referenced by the properties of the foaf handle plaid_skirt.
What are these things called?
sparql select ?lbl count(*) where { ?s ?p2 ?o2 . ?o2 rdfs:label ?lbl . ?s foaf:nick ?o . filter (bif:contains (?o, "plaid_skirt")) } group by ?lbl order by desc 2 ;
Many of these things do not have a rdfs:label. Let us use a more general concept of lable which groups dc:title, foaf:name and other name-like properties together. The subproperties are resolved at run time, there is no materialization.
sparql define input:inference 'b3s' select ?lbl count(*) where { ?s ?p2 ?o2 . ?o2 b3s:label ?lbl . ?s foaf:nick ?o . filter (bif:contains (?o, "plaid_skirt")) } group by ?lbl order by desc 2 ;
We can list sources by the topics they contain. Below we look for graphs that mention terrorist bombing.
sparql select ?g count(*) where { graph ?g { ?s ?p ?o . filter (bif:contains (?o, "'terrorist bombing'")) } } group by ?g order by desc 2 ;
Now some web 2.0 tagging of search results. The tag cloud of "computer"
sparql select ?lbl count (*) where { ?s ?p ?o . ?o bif:contains "computer" . ?s sioc:topic ?tg . optional { ?tg rdfs:label ?lbl } } group by ?lbl order by desc 2 limit 40 ;
This query will find the posters who talk the most about sex.
sparql select ?auth count (*) where { ?d dc:creator ?auth . ?d ?p ?o filter (bif:contains (?o, "sex")) } group by ?auth order by desc 2 ;
We look for people who are joined by having relatively uncommon interests but do not know each other.
sparql select ?i ?cnt ?n1 ?n2 ?p1 ?p2 where { { select ?i count (*) as ?cnt where { ?p foaf:interest ?i } group by ?i } filter ( ?cnt > 1 && ?cnt < 10) . ?p1 foaf:interest ?i . ?p2 foaf:interest ?i . filter (?p1 != ?p2 && !bif:exists ((select (1) where {?p1 foaf:knows ?p2 })) && !bif:exists ((select (1) where {?p2 foaf:knows ?p1 }))) . ?p1 foaf:nick ?n1 . ?p2 foaf:nick ?n2 . } order by ?cnt limit 50 ;
The query takes a fairly long time, mostly spent counting the interested in 25M interest triples. It then takes people that share the interest and checks that neither claims to know the other. It then sorts the results rarest interest first. The query can be written more efficently but is here just to show that database-wide scans of the population are possible ad hoc.
Now we go to SQL to make a tag co-occurrence matrix. This can be used for showing a Technorati-style related tags line at the bottom of a search result page. This showcases the use of SQL together with SPARQL. The half-matrix of tags t1, t2 with the co-occurrence count at the intersection is much more efficiently done in SQL, specially since it gets updated as the data changes. This is an example of materialized intermediate results based on warehoused RDF.
create table tag_count (tcn_tag iri_id_8, tcn_count int, primary key (tcn_tag)); alter index tag_count on tag_count partition (tcn_tag int (0hexffff00)); create table tag_coincidence (tc_t1 iri_id_8, tc_t2 iri_id_8, tc_count int, tc_t1_count int, tc_t2_count int, primary key (tc_t1, tc_t2)) alter index tag_coincidence on tag_coincidence partition (tc_t1 int (0hexffff00)); create index tc2 on tag_coincidence (tc_t2, tc_t1) partition (tc_t2 int (0hexffff00));
How many times each topic is mentioned?
insert into tag_count select * from (sparql define output:valmode "LONG" select ?t count (*) as ?cnt where { ?s sioc:topic ?t } group by ?t) xx option (quietcast);
Take all t1, t2 where t1 and t2 are tags of the same subject, store only the permutation where the internal id of t1 < that of t2.
insert into tag_coincidence (tc_t1, tc_t2, tc_count) select "t1", "t2", cnt from (select "t1", "t2", count (*) as cnt from (sparql define output:valmode "LONG" select ?t1 ?t2 where { ?s sioc:topic ?t1 . ?s sioc:topic ?t2 }) tags where "t1" < "t2" group by "t1", "t2") xx where isiri_id ("t1") and isiri_id ("t2") option (quietcast);
Now put the individual occurrence counts into the same table with the co-occurrence. This denormalization makes the related tags lookup faster.
update tag_coincidence set tc_t1_count = (select tcn_count from tag_count where tcn_tag = tc_t1), tc_t2_count = (select tcn_count from tag_count where tcn_tag = tc_t2);
Now each tag_coincidence row has the joint occurrence count and individual occurrence counts. A single select will return a Technorati-style related tags listing.
To show the URI's of the tags:
select top 10 id_to_iri (tc_T1), id_to_iri (tc_t2), tc_count from tag_coincidence order by tc_count desc;
We look at what interests people have
sparql select ?o ?cnt where { { select ?o count (*) as ?cnt where { ?s foaf:interest ?o } group by ?o } filter (?cnt > 100) } order by desc 2 limit 100 ;
Now the same for the Harry Potter fans
sparql select ?i2 count (*) where { ?p foaf:interest <http://www.livejournal.com/interests.bml?int=harry+potter> . ?p foaf:interest ?i2 } group by ?i2 order by desc 2 limit 20 ;
We see whether knows relations are symmmetrical. We return the top n people that others claim to know without being reciprocally known.
sparql select ?celeb, count (*) where { ?claimant foaf:knows ?celeb . filter (!bif:exists ((select (1) where { ?celeb foaf:knows ?claimant }))) } group by ?celeb order by desc 2 limit 10 ;
We look for a well connected person to start from.
sparql select ?p count (*) where { ?p foaf:knows ?k } group by ?p order by desc 2 limit 50 ;
We look for the most connected of the many online identities of Stefan Decker.
sparql select ?sd count (distinct ?xx) where { ?sd a foaf:Person . ?sd ?name ?ns . filter (bif:contains (?ns, "'Stefan Decker'")) . ?sd foaf:knows ?xx } group by ?sd order by desc 2 ;
We count the transitive closure of Stefan Decker's connections
sparql select count (*) where { { select * where { ?s foaf:knows ?o } } option (transitive, t_distinct, t_in(?s), t_out(?o)) . filter (?s = <mailto:stefan.decker@deri.org>) } ;
Now we do the same while following owl:sameAs links.
sparql define input:same-as "yes" select count (*) where { { select * where { ?s foaf:knows ?o } } option (transitive, t_distinct, t_in(?s), t_out(?o)) . filter (?s = <mailto:stefan.decker@deri.org>) } ;
The system runs on Virtuoso 6 Cluster Edition. The database is partitioned into 12 partitions, each served by a distinct server process. The system demonstrated hosts these 12 servers on 2 machines, each with 2 xXeon 5345 and 16GB memory and 4 SATA disks. For scaling, the processes and corresponding partitions can be spread over a larger number of machines. If each ran on its own server with 16GB RAM, the whole data set could be served from memory. This is desirable for search engine or fast analytics applications. Most of the demonstrated queries run in memory on second invocation. The timing difference between first and second run is easily an order of magnitude.
]]>To this effect, I recently summarized the group discussions and some of our own experiences around the topic at <http://esw.w3.org/topic/Rdb2RdfXG/ReqForMappingByOErling>.
I will here discuss this less formally and more in the light of our own experience. A working group goal statement must be neutral vis à vis the following points, even if any working group will unavoidably encounter these issues on the way. A blog post on the other hand can be more specific.
I gave a talk to the RDB2RDF XG this spring, with these slides.
The main point is that people would really like to map on-the-fly, if they only could. Making an RDF warehouse is not of value in itself, but it is true that in some cases this cannot be avoided.
At first sight, one would think that a mapping specification could be neutral as regards whether one stores the mapped triples as triples or makes them on demand. There is almost no comparison between the complexity of doing non-trivial mappings on-the-fly versus mapping as ETL. Some of this complexity spills over into the requirements for a mapping language.
We expect to have a situation where one virtual triple can have many possible sources. The mapping is a union of mapped databases. Any integration scenario will have this feature. In such a situation, if we are JOIN
ing using such triples, we end up with UNION
s of all databases that could produce the triples in question. This is generally not desired. Therefore, in the on-demand mapping case, there must be a lot of type inference logic that is not relevant in the ETL scenario.
To make the point clearer, suppose a query like "list the organizations whose representatives have published about xx." Suppose that there are three databases mapped, all of which have a table of organizations, a table of persons with affiliation to organizations, a table of publications by these persons, and finally a table of tags for the publications. Now, we want the laboratories that have published with articles with tag XX. It is a matter of common sense in this scenario that a publication will have the author and the author's affiliation in the same database. However, the RDB-to-RDF mapping does not necessarily know this, if all that it is told is that a table makes IRIs of publications by applying a certain pattern to the primary key of the publications table. To infer what needs to be inferred, the system must realize that IRIs from one mapping are disjoint from IRIs from another: A paper in database X will usually not have an author in database Y. The IDs in database Y, even if perchance equal to the IDs in X, do not mean the same thing, and there is no point joining across databases by them.
This entire question is a non-issue in the ETL scenario, but is absolutely vital in the real-time mapping. This is also something that must be stated, at least implicitly, in any mapping. If a mapping translates keys of one place to IRIs with one pattern, and keys from another using another pattern, it must be inferable from the patterns whether the sets of IRIs will be disjoint.
This is critical. Otherwise we will be joining everything to everything else, and there will be orders of magnitude of penalty compared to hand-crafted SQL over the same data sources.
SPARQL queries translate quite well to SQL when there is only one table that can produce a triple with a subject of a given class, when there are few columns that can map to a given predicate, and when classes and predicates are literals in the query.
Virtuoso has some SQL extensions for dealing with breaking a wide table into a row per column. This facilitates dealing with predicates that are not known at query compile time. If the table in question is not managed by Virtuoso, Virtuoso's SQL virtualization/federation takes care of the matter. If a mapping system goes directly to third-party SQL, no such tricks can be used.
The above example suggests that for supporting on-the-fly mapping without relying on owning the SQL underneath, some subsets of SPARQL may have to be defined. For example, one will probably have to require that all predicates be literals. The alternative is prohibitive run-time cost and complexity.
But we must not lose the baby with the bath-water. Aside from offering global identifiers, RDF's attractions include subclasses and sub-predicates. In relational terms, these translate to UNION
s and do involve some added cost. A mapping system just has to have means of dealing with this cost, and of recognizing cases where this cost is prohibitive. Some further work is likely to be required for defining well-behaved subsets of SPARQL and mappings.
Whether to warehouse or not? If one has hundreds of sources, of which some are not even relational, some ETL would seem necessary. Kashiup Vipul gave a position paper at last year's RDB-to-RDF mapping workshop in Cambridge, Massachusetts, about a system of relational mapping and on-demand RDF-izers of diverse semi-structured biomedical data, e.g., spreadsheets. The issue certainly exists, and any mapping work will likely encounter integration scenarios where one part is fairly neatly mapped from relational stores, and another part comes from a less structured repository of ETLed physical triples.
Our take is that if something is a large or very large relational store, then map; else, ETL. With Virtuoso, we can mix mapped and local triples, but this is not a generally available feature of triple stores and standardization will likely have to wait until there are more implementations.
UNION
s when integrating sources that talk of similar things.This was a quick summary, by no means comprehensive, on what an eventual RDB2RDF working group would come across. This is a sort of addendum to the requirements I outlined on the ESW wiki.
]]>I have mentioned on a couple of prior occasions that basic graph operations ought to be integrated into the SQL query language.
The history of databases is by and large about moving from specialized applications toward a generic platform. The introduction of the DBMS itself is the archetypal example. It is all about extracting the common features of applications and making these the features of a platform instead.
It is now time to apply this principle to graph traversal.
The rationale is that graph operations are somewhat tedious to write in a parallelize-able, latency-tolerant manner. Writing them as one would for memory-based data structures is easier but totally unscalable as soon as there is any latency involved, i.e., disk reads or messages between cluster peers.
The ad-hoc nature and very large volume of RDF data makes this a timely question. Up until now, the answer to this question has been to materialize any implied facts in RDF stores. If a was part of b, and b part of c, the implied fact that a is part of c would be inserted explicitly into the database as a pre-query step.
This is simple and often efficient, but tends to have the downside that one makes a specialized warehouse for each new type of query. The activity becomes less ad-hoc.
Also, this becomes next to impossible when the scale approaches web scale, or if some of the data is liable to be on-and-off included-into or excluded-from the set being analyzed. This is why with Virtuoso we have tended to favor inference on demand ("backward chaining") and mapping of relational data into RDF without copying.
The SQL world has taken steps towards dealing with recursion with the WITH - UNION
construct which allows definition of recursive views. The idea there is to define, for example, a tree walk as a UNION
of the data of the starting node plus the recursive walk of the starting node's immediate children.
The main problem with this is that I do not very well see how a SQL optimizer could effectively rearrange queries involving JOIN
s between such recursive views. This model of recursion seems to lose SQL's non-procedural nature. One can no longer easily rearrange JOIN
s based on what data is given and what is to be retrieved. If the recursion is written from root to leaf, it is not obvious how to do this from leaf to root. At any rate, queries written in this way are so complex to write, let alone optimize, that I decided to take another approach.
Take a question like "list the parts of products of category C which have materials that are classified as toxic." Suppose that the product categories are a tree, the product parts are a tree, and the materials classification is a tree taxonomy where "toxic" has a multilevel substructure.
Depending on the count of products and materials, the query can be evaluated as either going from products to parts to materials and then climbing up the materials tree to see if the material is toxic. Or one could do it in reverse, starting with the different toxic materials, looking up the parts containing these, going to the part tree to the product, and up the product hierarchy to see if the product is in the right category. One should be able to evaluate the identical query either way depending on what indices exist, what the cardinalities of the relations are, and so forth â regular cost based optimization.
Especially with RDF, there are many problems of this type. In regular SQL, it is a long-standing cultural practice to flatten hierarchies, but this is not the case with RDF.
In Virtuoso, we see SPARQL as reducing to SQL. Any RDF-oriented database-engine or query-optimization feature is accessed via SQL. Thus, if we address run-time-recursion in the Virtuoso query engine, this becomes, ipso facto, an SQL feature. Besides, we remember that SQL is a much more mature and expressive language than the current SPARQL recommendation.
We will here look at some simple social network queries. A later article will show how to do more general graph operations. We extend the SQL derived table construct, i.e., SELECT
in another SELECT
's FROM
clause, with a TRANSITIVE
clause.
Consider the data:
CREATE TABLE "knows" ("p1" INT, "p2" INT, PRIMARY KEY ("p1", "p2") ); ALTER INDEX "knows" ON "knows" PARTITION ("p1" INT); CREATE INDEX "knows2" ON "knows" ("p2", "p1") PARTITION ("p2" INT);
We represent a social network with the many-to-many relation "knows". The persons are identified by integers.
INSERT INTO "knows" VALUES (1, 2); INSERT INTO "knows" VALUES (1, 3); INSERT INTO "knows" VALUES (2, 4);
SELECT * FROM (SELECT TRANSITIVE T_IN (1) T_OUT (2) T_DISTINCT "p1", "p2" FROM "knows" ) "k" WHERE "k"."p1" = 1;
We obtain the result:
p1 p2 1 3 1 2 1 4
The operation is reversible:
SELECT * FROM (SELECT TRANSITIVE T_IN (1) T_OUT (2) T_DISTINCT "p1", "p2" FROM "knows" ) "k" WHERE "k"."p2" = 4;
p1 p2 2 4 1 4
Since now we give p2, we traverse from p2 towards p1. The result set states that 4 is known by 2 and 2 is known by 1.
To see what would happen if x knowing y also meant y knowing x, one could write:
SELECT * FROM (SELECT TRANSITIVE T_IN (1) T_OUT (2) T_DISTINCT "p1", "p2" FROM (SELECT "p1", "p2" FROM "knows" UNION ALL SELECT "p2", "p1" FROM "knows" ) "k2" ) "k" WHERE "k"."p2" = 4;
p1 p2 2 4 1 4 3 4
Now, since we know that 1 and 4 are related, we can ask how they are related.
SELECT * FROM (SELECT TRANSITIVE T_IN (1) T_OUT (2) T_DISTINCT "p1", "p2", T_STEP (1) AS "via", T_STEP ('step_no') AS "step", T_STEP ('path_id') AS "path" FROM "knows" ) "k" WHERE "p1" = 1 AND "p2" = 4;
p1 p2 via step path 1 4 1 0 0 1 4 2 1 0 1 4 4 2 0
The two first columns are the ends of the path. The next column is the person that is a step on the path. The next one is the number of the step, counting from 0, so that the end of the path that corresponds to the end condition on the column designated as input, i.e., p1, has number 0. Since there can be multiple solutions, the last column is a sequence number allowing distinguishing multiple alternative paths from each other.
For LinkedIn users, the friends ordered by distance and descending friend count query, which is at the basis of most LinkedIn search result views can be written as:
SELECT p2, dist, (SELECT COUNT (*) FROM "knows" "c" WHERE "c"."p1" = "k"."p2" ) FROM (SELECT TRANSITIVE t_in (1) t_out (2) t_distinct "p1", "p2", t_step ('step_no') AS "dist" FROM "knows" ) "k" WHERE "p1" = 1 ORDER BY "dist", 3 DESC;
p2 dist aggregate 2 1 1 3 1 0 4 2 0
The queries shown above work on Virtuoso v6. When running in cluster mode, several thousand graph traversal steps may be proceeding at the same time, meaning that all database access is parallelized and that the algorithm is internally latency-tolerant. By default, all results are produced in a deterministic order, permitting predictable slicing of result sets.
Furthermore, for queries where both ends of a path are given, the optimizer may decide to attack the path from both ends simultaneously. So, supposing that every member of a social network has an average of 30 contacts, and we need to find a path between two users that are no more than 6 steps apart, we begin at both ends, expanding each up to 3 levels, and we stop when we find the first intersection. Thus, we reach 2 * 30^3 = 54,000 nodes, and not 30^6 = 729,000,000 nodes.
Writing a generic database driven graph traversal framework on the application side, say in Java over JDBC, would easily be over a thousand lines. This is much more work than can be justified just for a one-off, ad-hoc query. Besides, the traversal order in such a case could not be optimized by the DBMS.
In a future blog post I will show how this feature can be used for common graph tasks like critical path, itinerary planning, traveling salesman, the 8 queens chess problem, etc. There are lots of switches for controlling different parameters of the traversal. This is just the beginning. I will also give examples of the use of this in SPARQL.
]]>We could call this query-driven crawling. The idea is intuitive â what one looks for, determines what one finds.
This does however raise certain questions pertaining to the nature and ultimate possibility of knowledge, i.e., epistemology.
The process of querying could be said to go from the few to the many, just like the process of harvesting data from the web, the way any search engine does. One follows links or makes joins and thereby increases one's reach.
The difference is that a query has no a priori direction. If I ask for the phone numbers of my friends and there are no phone numbers in the database, then it is valid to give an empty result without looking at my friends at all. Closed world, as it is said. Never mind that the friends would have had a "see also" link to a retrievable document that did have a phone number.
The problem is that a query execution plan determines what possible dereferenceable material the query will encounter during its execution. What is worse, a query plan tends toward the minimal, i.e., toward minimizing the chances of encountering something dereferenceable along the way. Where query and crawl appeared to have a similarity, in fact they have two opposite goals.
The user generally has no idea of the execution plan. In the general case, the user cannot have an idea of this plan. There are valid, over 40 year old reasons for leaving the query planning to the database. In exceptional situations the user can read or direct these, but this is really quite tedious and requires understanding that is basically never present.
So, given a query, how do we find data that will match it, short of having a pre-loaded database of absolutely everything? This is certainly a desirable goal, and all in the open world, distributed spirit of the web.
Let us limit ourselves to queries that have some literals in the object or subject positions. A SPARQL query is basically a graph. Its vertices are variables and literals, and its edges are triple patterns. An edge is labeled by a predicate. For now, we will consider the predicate to always be a literal. From each literal, we can draw a tree, following each edge starting at this literal and descending until we find another literal. Each tree is not always a spanning tree of the graph, but all the trees collectively span the graph.
Consider the query
{ <john> knows ?x . <mary> knows ?x . ?x label ?l }.
The starting points are the literals john
and mary
. The john
tree has one child, ?x
, which has the children mary
and ?l
. One could notate it as
{ <john> knows ?x . {{ <mary> knows ?x} UNION {?x label ?l}}}
That is, the head first, and if it has more than one child, a union listing them, recursively.
If one composed such queries for each literal in the original pattern and evaluated each as a breadth first walk of the tree, no query optimization tricks, and for each binding of each variable, recorded whether there was something to dereference, one would in a finite time have reached all the directly reachable data. Then one could evaluate the original query, using whatever plan was preferred.
The check for dereferenceable data applied to each IRI-valued binding formed in the above evaluation, would consist of looking for "see also", "same as", and other such properties of the IRI. It could also consult text based search engines. Since the evaluation is breadth first, it generates a large number of parallel tasks and is fairly latency tolerant, i.e., it will not die if it must retrieve a few pages from remote sources. We will leave the exact rewrite rules for unions, optionals, aggregates, subqueries, and so on, as an exercise; the general idea should be clear enough.
We have here shown a way of transforming SPARQL queries in such a way as to guarantee dereferencing of findable links, without requiring the end user to either explicitly specify or understand query plans.
The present Sponger does not work exactly in this manner but it will be developed in this direction. Fortunately, the algorithms outlined above are nothing complicated.
]]>I ran it with the 5M triples scale, which is the highest scale for which the authors give numbers.
I got a run time of 25 minutes for the 12 queries, giving an arithmetic mean of the query time of 125 seconds. This is better than the 800 or so seconds that the authors had measured. Also, Q6 of the set had failed for the authors, but we have since fixed this; the fix is in the v5.0.8 cut.
I also tried it with a scale of 25M, but this became I/O bound and took a bit longer. I will try this with v6 and v7 cluster later, which are vastly better at anything I/O bound.
The machine was a 2GHz Xeon with 8G RAM. The query text was the one from the authors, with an explicit FROM
clause added; the client was the command line Interactive SQL (iSQL).
If one does the test with the default index layout without specifying a graph, things will not work very well. Also, returning the million-row results of these queries over the SPARQL protocol is not practical.
I will say something more about SP2B when I get to have a closer look.
]]>For 8G RAM, in the [Parameters]
stanza of virtuoso.ini
, set â
[Parameters]
...
NumberOfBuffers = 550000
For 16G RAM, double thisâ
[Parameters]
...
NumberOfBuffers = 1100000
For most cases, certainly all RDF cases, Read Committed should be the default transaction isolation. In the [Parameters]
stanza of virtuoso.ini
, set â
[Parameters]
...
DefaultIsolation = 2
If ODBC, JDBC, or similarly connected client applications are used, there must be more ServerThreads
available than there will be client connections. In the [Parameters]
stanza of virtuoso.ini
, set â
[Parameters]
...
ServerThreads = 100
With web clients (unlike ODBC, JDBC, or similar clients), it may be justified to have fewer ServerThreads
than there are concurrent clients. The MaxKeepAlives
should be the maximum number of expected web clients. This can be more than the ServerThreads
count. In the [HTTPServer]
stanza of virtuoso.ini
, set â
[HTTPServer]
...
ServerThreads = 100
MaxKeepAlives = 1000
KeepAliveTimeout = 10
Note â The [HTTPServer] ServerThreads
are taken from the total pool made available by the [Parameters] ServerThreads
. Thus, the [Parameters] ServerThreads
should always be at least as large as (and is best set greater than) the [HTTPServer] ServerThreads
, and if using the closed-source Commercial Version, should not exceed the licensed thread count.
The basic rule is to use one stripe (file) per distinct physical device (not per file system), using no RAID. For example, one might stripe a database over 6 files (6 physical disks), with an initial size of 60000 pages (the files will grow as needed).
For the above described example, in the [Database]
stanza of virtuoso.ini
, set â
[Database]
...
Striping = 1
MaxCheckpointRemap = 2000000
â and in the [Striping]
stanza, on one line per SegmentName
, set â
[Striping]
...
Segment1 = 60000 , /virtdev/db/virt-seg1.db = q1 , /data1/db/virt-seg1-str2.db = q2 , /data2/db/virt-seg1-str3.db = q3 , /data3/db/virt-seg1-str4.db = q4 , /data4/db/virt-seg1-str5.db = q5 , /data5/db/virt-seg1-str6.db = q6
As can be seen here, each file gets a background IO thread (the = qxxx
clause). It should be noted that all files on the same physical device should have the same qxxx
value. This is not directly relevant to the benchmarking scenario above, because we have only one file per device, and thus only one file per IO queue.
If queries have lots of joins but access little data, as with the Berlin SPARQL Benchmark, the SQL compiler must be told not to look for better plans if the best plan so far is quicker than the compilation time expended so far. Thus, in the [Parameters]
stanza of virtuoso.ini
, set â
[Parameters]
...
StopCompilerWhenXOverRunTime = 1
]]>Of course, here we immediately hit the question of comparisons with relational databases. To this effect, BSBM also specifies a relational schema and can generate the data as either triples or SQL inserts.
The benchmark effectively simulates the case of exposing an existing RDBMS as RDF. OpenLink Software calls this RDF Views. Oracle is beginning to call this semantic covers. The RDB2RDF XG, a W3C incubator group, has been active in this area since Spring, 2008.
We believe this is relevant because RDF promises to be the interoperability factor between potentially all of traditional IS. If data is online for human consumption, it may be online via a SPARQL end-point as well. The economic justification will come from discoverability and from applications integrating multi-source structured data. Online shopping is a fine use case.
Warehousing all the world's publishable data as RDF is not our first preference, nor would it be the publisher's. Considerations of duplicate infrastructure and maintenance are reason enough. Consequently, we need to show that mapping can outperform an RDF warehouse, which is what we'll do here.
First, we found that making the query plan took much too long in proportion to the run time. With BSBM this is an issue because the queries have lots of joins but access relatively little data. So we made a faster compiler and along the way retouched the cost model a bit.
But the really interesting part with BSBM is mapping relational data to RDF. For us, BSBM is a great way of showing that mapping can outperform even the best triple store. A relational row store is as good as unbeatable with the query mix. And when there is a clear mapping, there is no reason the SPARQL could not be directly translated.
If Chris Bizer et al launched the mapping ship, we will be the ones to pilot it to harbor!
We filled two Virtuoso instances with a BSBM200000 data set, for 100M triples. One was filled with physical triples; the other was filled with the equivalent relational data plus mapping to triples. Performance figures are given in "query mixes per hour". (An update or follow-on to this post will provide elapsed times for each test run.)
With the unmodified benchmark we got:
Physical Triples: Â Â 1297 qmph Mapped Triples: Â Â 3144 qmph
In both cases, most of the time was spent on Q6, which looks for products with one of three words in the label. We altered Q6 to use text index for the mapping, and altered the databases accordingly. (There is no such thing as an e-commerce site without a text index, so we are amply justified in making this change.)
The following were measured on the second run of a 100 query mix series, single test driver, warm cache.
Physical Triples: Â Â 5746 qmph Mapped Triples: Â Â 7525 qmph
We then ran the same with 4 concurrent instances of the test driver. The qmph here is 400 / the longest run time.
Physical Triples: Â Â 19459 qmph Mapped Triples: Â Â 24531 qmph
The system used was 64-bit Linux, 2GHz dual-Xeon 5130 (8 cores) with 8G RAM. The concurrent throughputs are a little under 4 times the single thread throughput, which is normal for SMP due to memory contention. The numbers do not evidence significant overhead from thread synchronization.
The query compilation represents about 1/3 of total server side CPU. In an actual online application of this type, queries would be parameterized, so the throughputs would be accordingly higher. We used the StopCompilerWhenXOverRunTime = 1
option here to cut needless compiler overhead, the queries being straightforward enough.
We also see that the advantage of mapping can be further increased by more compiler optimizations, so we expect in the end mapping will lead RDF warehousing by a factor of 4 or so.
Reporting Rules. The benchmark spec should specify a form for disclosure of test run data, TPC style. This includes things like configuration parameters and exact text of queries. There should be accepted variants of query text, as with the TPC.
Multiuser operation. The test driver should get a stream number as parameter, so that each client makes a different query sequence. Also, disk performance in this type of benchmark can only be reasonably assessed with a naturally parallel multiuser workload.
Add business intelligence. SPARQL has aggregates now, at least with Jena and Virtuoso, so let's use these. The BSBM business intelligence metric should be a separate metric off the same data. Adding synthetic sales figures would make more interesting queries possible. For example, producing recommendations like "customers who bought this also bought xxx."
For the SPARQL community, BSBM sends the message that one ought to support parameterized queries and stored procedures. This would be a SPARQL protocol extension; the SPARUL syntax should also have a way of calling a procedure. Something like select proc (??, ??)
would be enough, where ??
is a parameter marker, like ?
in ODBC/JDBC.
Add transactions.Especially if we are contrasting mapping vs. storing triples, having an update flow is relevant. In practice, this could be done by having the test driver send web service requests for order entry and the SUT could implement these as updates to the triples or a mapped relational store. This could use stored procedures or logic in an app server.
The time of most queries is less than linear to the scale factor. Q6 is an exception if it is not implemented using a text index. Without the text index, Q6 will inevitably come to dominate query time as the scale is increased, and thus will make the benchmark less relevant at larger scales.
We include the sources of our RDF view definitions and other material for running BSBM with our forthcoming Virtuoso Open Source 5.0.8 release. This also includes all the query optimization work done for BSBM. This will be available in the coming days.
]]>So I spent a couple of days on the SPARQL/SQL compiler, to the effect of making it do a better guess of initial execution plan and streamlining some operations. In fact, many of the queries in BSBM are not particularly sensitive to execution plan, as they access a very small portion of the database. So to close the matter, I put in a flag that makes the SQL compiler give up on devising new plans if the time of the best plan so far is less than the time spent compiling so far.
With these changes, available now as a diff on top of 5.0.7, we run quite well, several times better than initially. With the compiler time cut-off in place (ini parameter StopCompilerWhenXOverRunTime = 1
), we get the following times, output from the BSBM test driver:
Starting test... 0: 1031.22 ms, total: 1151 ms 1: 982.89 ms, total: 1040 ms 2: 923.27 ms, total: 968 ms 3: 898.37 ms, total: 932 ms 4: 855.70 ms, total: 865 ms Scale factor: 10000 Number of query mix runs: 5 times min/max Query mix runtime: 0.8557 s / 1.0312 s Total runtime: 4.691 seconds QMpH: 3836.77 query mixes per hour CQET: 0.93829 seconds average runtime of query mix CQET (geom.): 0.93625 seconds geometric mean runtime of query mix Metrics for Query 1: Count: 5 times executed in whole run AQET: 0.012212 seconds (arithmetic mean) AQET(geom.): 0.009934 seconds (geometric mean) QPS: 81.89 Queries per second minQET/maxQET: 0.00684000s / 0.03115700s Average result count: 7.0 min/max result count: 3 / 10 Metrics for Query 2: Count: 35 times executed in whole run AQET: 0.030490 seconds (arithmetic mean) AQET(geom.): 0.029776 seconds (geometric mean) QPS: 32.80 Queries per second minQET/maxQET: 0.02467300s / 0.06753000s Average result count: 22.5 min/max result count: 15 / 30 Metrics for Query 3: Count: 5 times executed in whole run AQET: 0.006947 seconds (arithmetic mean) AQET(geom.): 0.006905 seconds (geometric mean) QPS: 143.95 Queries per second minQET/maxQET: 0.00580000s / 0.00795100s Average result count: 4.0 min/max result count: 0 / 10 Metrics for Query 4: Count: 5 times executed in whole run AQET: 0.008858 seconds (arithmetic mean) AQET(geom.): 0.008829 seconds (geometric mean) QPS: 112.89 Queries per second minQET/maxQET: 0.00804400s / 0.01019500s Average result count: 3.4 min/max result count: 0 / 10 Metrics for Query 5: Count: 5 times executed in whole run AQET: 0.087542 seconds (arithmetic mean) AQET(geom.): 0.087327 seconds (geometric mean) QPS: 11.42 Queries per second minQET/maxQET: 0.08165600s / 0.09889200s Average result count: 5.0 min/max result count: 5 / 5 Metrics for Query 6: Count: 5 times executed in whole run AQET: 0.131222 seconds (arithmetic mean) AQET(geom.): 0.131216 seconds (geometric mean) QPS: 7.62 Queries per second minQET/maxQET: 0.12924200s / 0.13298200s Average result count: 3.6 min/max result count: 3 / 5 Metrics for Query 7: Count: 20 times executed in whole run AQET: 0.043601 seconds (arithmetic mean) AQET(geom.): 0.040890 seconds (geometric mean) QPS: 22.94 Queries per second minQET/maxQET: 0.01984400s / 0.06012600s Average result count: 26.4 min/max result count: 5 / 96 Metrics for Query 8: Count: 10 times executed in whole run AQET: 0.018168 seconds (arithmetic mean) AQET(geom.): 0.016205 seconds (geometric mean) QPS: 55.04 Queries per second minQET/maxQET: 0.01097600s / 0.05066900s Average result count: 12.8 min/max result count: 6 / 20 Metrics for Query 9: Count: 20 times executed in whole run AQET: 0.043813 seconds (arithmetic mean) AQET(geom.): 0.043807 seconds (geometric mean) QPS: 22.82 Queries per second minQET/maxQET: 0.04274900s / 0.04504100s Average result count: 0.0 min/max result count: 0 / 0 Metrics for Query 10: Count: 15 times executed in whole run AQET: 0.030697 seconds (arithmetic mean) AQET(geom.): 0.029651 seconds (geometric mean) QPS: 32.58 Queries per second minQET/maxQET: 0.02072000s / 0.03975700s Average result count: 1.1 min/max result count: 0 / 4 real 0 m 5.485 s user 0 m 2.233 s sys 0 m 0.170 s
Of the approximately 5.5 seconds of running five query mixes, the test driver spends 2.2 s. The server side processing time is 3.1 s, of which SQL compilation is 1.35 s. The rest is miscellaneous system time. The measurement is on 64-bit Linux, 2GHz dual-Xeon 5130 (8 cores) with 8G RAM.
We note that this type of workload would be done with stored procedures or prepared, parameterized queries in the SQL world.
There will be some further tuning still but this addresses the bulk of the matter. There will be a separate message about the patch containing these improvements.
]]>LIMIT
and OFFSET
The thrust of the matter was that for newcomers the layer cake was confusing and did not clearly indicate the architecture. Why, sure. My point is that no rearranging of the boxes will cut it for the general case.
Any diagram containing the boxes of the layer cake (i.e., URI, XML, SPARQL, OWL, RIF, Crypto, etc., etc.) in whatever order or arrangement can at best be a sort of overview of how these standards reference each other.
Such diagrams are a little like saying that a car combines the combustion properties of fuel/air mixes with the tension and compression resistance properties of metals and composites for producing motion and secondly links to Newton's laws of motion and to aerodynamics.
Not false. But it does not say that a car is good for economical commute or showing off at the strip or any number of niches that a mature industry has grown to serve.
Now, talking of software engineering, modules and interfaces are good and even necessary. The trick is to know where to put the interface.
Such a thing cannot possibly be inferred from the standards' inter-reference picture. APIs, especially if these are Web service APIs, should go where there is low data volume and tolerance for latency. For example, either inference is a preprocessing step or it is embedded right inside a SPARQL engine. Such a thing cannot be seen from the picture. Same for trust. Trust is not an after-thought at the top of the picture, except maybe in the sense of referring to the other parts.
We hear it over and over. Scale and speed are critical. Arrange the blocks of any real system as makes sense for data flow; do not confuse literature references with control or data structure.
The even-more foundational issue is the promotion of the general concept of a Web of Data.
The core idea that the Web would be a query-able collection of data with meaningful reference between data of different provenance cannot be inferred from the picture, even though this should be its primary message. Or it is better to say that the first picture shown should stress this idea and then one could leave the layer cake, in whatever version, for explaining the standards' order of evolution or inter-reference.
So, the value proposition:
Why? Explosion of data volume, increased need of keeping up-to-date, increasing opportunity cost of not keeping in real time.
What? An architecture that is designed for unanticipated joining and evolution of data across heterogeneous sources, either at Web or enterprise scale.
How? URI everything and everything is cool, or, give things global names. Use RDF. Reuse names or ontologies where can. (An ontology is a set of classes and property names plus some more.) Map relational data on the fly or store as RDF, whichever works. Query with SPARQL, easier than SQL.
So, my challenge for the graphics people would be to make an illustration of the above. Forget the alphabet soup. Show the layer cake as a historical reference or literature guide. Do not imply that this proliferation of boxes equates to an equal proliferation of Web services, for example.
]]>voiD, Vocabulary of Interlinked Datasets, is an idea whose time has clearly come. By the end of the conference, many speakers had already adopted the meme.
The point is to describe what is inside the data sets. People may know this from having worked with the sets or from putting them together but to an outsider this is not evident.
The Semantic Sitemap says where there are files or end points for access. But it does not say what is inside these. Also for federation, it is important to be able to determine whether it makes sense to send a particular query to a particular end point.
If we play this right, this is what voiD will provide. I have to think of Dan Simmons' flamboyant Hyperion sci-fi series where the "void which binds" was a sort of hyperspace containing the thoughts of entities, past and present and even provided teleportation.
So what does the voiD hold, aside infinite potentialities?
The obvious part is DC-like provenance, version, authorship, license and such data set wide information. Also the subject matter could be classified by reference to UMBEL or the Yago classification of DBpedia.
More is needed, though. The simple part is listing the ontologies, if any. Also a set of namespaces would be an idea but this could be very large.
So let us look at what we'd like to be able to answer with the voiD set.
The below could be a sample of voiD questions?
What subjects are in the LOD cloud?
Given this URI, what set in the LOD cloud can tell me more? This is divided into asking a text index like Sindice for the location, getting the namespace or data set and then querying voiD.
What need I federate/load in order to combine all that is reachable from a given vocabulary? There could be for example a graph showing the data sets and edges between them, edges being qualified by a set of same as assertions, itself a voiD described set, if translations were needed.
What sets are from the same or equally trusted publisher as this one?
These things are roughly divided into description of the set and then some details on how it is stored on a given end point.
Given this set, in which other sets will I find use of the same URIs? For example, if I have language version x, I wish to know that language version y will have the same URIs insofar the things meant are the same.
Given this set, which sets of same as assertions will I have for mapping to which other sets? For example, if I have Geonames, I wish to know that set x will map at least some of the URIs in Geonames to DBpedia URIs.
Let me further point out that it is increasingly clear to the community that universal sameAs is dubious, hence sameAs assertions ought to be kept separate and included or excluded depending on the usage context.
Given this set, what are the interesting queries I can do? This is a sort of advertisement for human consumption. This is not a list of queries for crashing the end point. Denial of service can be done in SPARQL without knowing the end point content anyhow, so this is not an added risk exposer.
Vocabularies used. This is a reference to the OWL or RDFS resources giving the applicable ontologies, if present. Also, a complete list of classes whose direct instances actually occur in the set is useful.
Ballpark cardinality. Something like a DARQ optimization profile would be a good idea. I would say that there should be a possibility of just including a DARQ description file as is. This is a sort of baseline and since it already exist, we are spared the committee trouble of figuring out what it ought to contain and what not. If we start defining this from scratch, it will take long. Further, let this be optional. Quite Independently of this, query processors may make optimization related queries to remote end points insofar the specific end point supports these. This will come in time. For now, just the basics.
Along with this, LOD SPARQL end points could adopt a couple of basic conventions. The simplest would be to agree that each would host a graph with a given URI that would contain the voiD descriptions of the data sets contained, along with the graph URI used for each set, if different from the publisher's URI for the graph. There is a point to this since an end point may load multiple data sets into one graph.
We hope to have a good idea of the matter in a couple of weeks, certainly a general statement of direction to be published at Linked Data Planet in a couple of weeks.
]]>For the data web, it will be held together by federation, also an invisible factor. As in Minkowski space, so in cyberspace.
To take the astronomical analogy further, putting too much visible stuff in one place makes a black hole, whose chief properties are that it is very heavy, can only get heavier and that nothing comes out.
DARQ is Bastian Quilitz's federated extension of the Jena ARQ SPARQL processor. It has existed for a while and was also presented at ESWC2008. There is also SPARQL FED from Andy Seaborne, an explicit means of specifying which end point will process which fragment of a distributed SPARQL query. Still, for federation to deliver in an open, decentralized world, it must be transparent. For a specific application, with a predictable workload, it is of course OK to partition queries explicitly.
Bastian had split DBpedia among five Virtuoso servers and was querying this set with DARQ. The end result was that there was a rather frightful cost of federation as opposed to all the data residing in a single Virtuoso. The other result was that if selectivity of predicates was not correctly guessed by the federation engine, the proposition was a non-starter. With correct join order it worked, though.
Yet, we really want federation. Looking further down the road, we simply must make federation work. This is just as necessary as running on a server cluster for mid-size workloads.
Since we are convinced of the cause, let's talk about the means.
For DARQ as it now stands, there's probably an order of magnitude or even more to gain from a couple of simple tricks. If going to a SPARQL end point that is not the outermost in the loop join sequence, batch the requests together in one HTTP/1.1 message. So, if the query is "get me my friends living in cities of over a million people," there will be the fragment "get city where x lives" and later "ask if population of x greater than 1000000". If I have 100 friends, I send the 100 requests in a batch to each eligible server.
Further, if running against a server of known brand, use a client-server connection and prepared statements with array parameters. This can well improve the processing speed at the remote end point by another order of magnitude. This gain may however not be as great as the latency savings from message batching. We will provide a sample of how to do this with Virtuoso over JDBC so Bastian can try this if interested.
These simple things will give a lot of mileage and may even decide whether federation is an option in specific applications. For the open web however, these measures will not yet win the day.
When federating SQL, colocation of data is sort of explicit. If two tables are joined and they are in the same source, then the join can go to the source. For SPARQL this is also so but with a twist:
If a foaf:Person is found on a given server, this does not mean that the Person's geek code or email hash will be on the same server. Thus {?p name "Johnny" . ?p geekCode ?g . ?p emailHash ?h }
does not necessarily denote a colocated join if many servers serve items of the vocabulary.
However, in most practical cases, for obtaining a rapid answer, treating this as a colocated fragment will be appropriate. Thus, it may be necessary to be able to declare that geek codes will be assumed colocated with names. This will save a lot of message passing and offer decent, if not theoretically total recall. For search style applications, starting with such assumptions will make sense. If nothing is found, then we can partition each join step separately for the unlikely case that there were a server that gave geek codes but not names.
For Virtuoso, we find that a federated query's asynchronous, parallel evaluation model is not so different from that on a local cluster. So the cluster version could have the option of federated query. The difference is that a cluster is local and tightly coupled and predictably partitioned but a federated setting is none of these.
For description, we would take DARQ's description model and maybe extend it a little where needed. Also we would enhance the protocol to allow just asking for the query cost estimate given a query with literals specified. We will do this eventually.
We would like to talk to Bastian about large improvements to DARQ, specially when working with Virtuoso. We'll see.
Of course, one mode of federating is the crawl-as-you-go approach of the Virtuoso Sponger. This will bring in fragments following seeAlso or sameAs declarations or other references. This will however not have the recall of a warehouse or federation over well described SPARQL end-points. But up to a certain volume it has the speed of local storage.
The emergence of voiD (Vocabulary of Interlinked Data) is a step in the direction of making federation a reality. There is a separate post about this.
]]>From participating in the group for the few initial sessions, I get the following impressions.
There is a segment of users, for example from the biomedical community, who do heavy duty data integration and look to RDF for managing complexity. Unifying heterogeneous data under OWL ontologies, reasoning, and data integrity, are points of interest.
There is another segment that is concerned with semantifying the document web, which topic includes initiatives such as Triplify and semantic web search such as Sindice. The emphasis there is on minimizing entry cost and creating critical mass. The next one to come will clean up the semantics, if these need be cleaned up at all.
(Some cleanup is taking place with Yago and Zitgist, but this is a matter for a different post.)
Thus, technically speaking, the mapping landscape is diverse, but ETL (extract-transform-load) seems to predominate. The biomedical people make data warehouses for answering specific questions. The web people are interested in putting data out in the expectation that the next player will warehouse it and allow running complex meshups against the whole of the RDF-ized web.
As one would expect, these groups see different issues and needs. Roughly speaking, one is about quality and structure and the other is about volume.
Where do we stand?
We are with the research data warehousers in saying that the mapping question is very complex and that it would indeed be nice to bypass ETL and go to the source RDBMS(s) on demand. Projects in this direction are ongoing.
We are with the web people in building large RDF stores with scalable query answering for arbitrary RDF, for example, hosting a lot of the Linking Open Data sets, and working with Zitgist.
These things are somewhat different.
At present, both the research warehousers and the web scalers predominantly go for ETL.
This is fine by us as we definitely are in the large RDF store race.
Still, mapping has its point. A relational store will perform quite a bit faster than a quad store if it has the right covering indices or application-specific compressed columnar layout. Thus, there is nothing to block us from querying analytics in SPARQL, once the obviously necessary extensions of sub-query, expressions and aggregation are in place.
To cite an example, the Ordnance Survey of the UK has a GIS system running on Oracle with an entry pretty much for each mailbox, lamp post, and hedgerow in the country. According to Ordnance Survey, this would be 1 petatriple, 1e15 triples. "Such a big server farm that we'd have to put it on our map," as Jenny Harding put it at ESWC2008. I'd add that an even bigger map entry would be the power plant needed to run the 100,000 or so PCs this would take. This is counting 10 gigatriples per PC, which would not even give very good working sets.
So, on-the-fly RDBMS-to-RDF mapping in some cases is simply necessary. Still, the benefits of RDF for integration can be preserved if the translation middleware is smart enough. Specifically, this entails knowing what tables can be joined with what other tables and pushing maximum processing to the RDBMS(s) involved in the query.
You can download the slide set I used for the Virtuoso presentation for the RDB to RDF mapping incubator group (PPT; other formats coming soon). The main point is that real integration is hard and needs smart query splitting and optimization, as well as real understanding of the databases and subject matter from the information architect. Sometimes in the web space it can suffice to put data out there with trivial RDF translation and hope that a search engine or such will figure out how to join this with something else. For the enterprise, things are not so. Benefits are clear if one can navigate between disjoint silos but making this accurate enough for deriving business conclusions, as well as efficient enough for production, is a soluble and non-trivial question.
We will show the basics of this with the TPC-H mapping, and by joining this with physical triples. We will also make a set of TPC-H format table sets, make mappings between keys in one to keys in the other, and show joins between the two. The SPARQL querying of one such data store is a done deal, including the SPARQL extensions for this. There is even a demo paper, Business Intelligence Extensions for SPARQL (PDF; other formats coming soon), by us on the subject in the ESWC 2008 proceedings. If there is an issue left, it is just the technicality of always producing SQL that looks hand-crafted and hence is better understood by the target RDBMS(s). For example, Oracle works better if one uses an IN
sub-query instead of the equivalent existence test.
Follow this blog for more on the topic; published papers are always a limited view on the matter.
]]>We were invited at the last minute to give a Linked Open Data talk at Paolo Bouquet's Identity and Reference workshop. We also had a demo of SPARQL BI (PPT); other formats coming soon), our business intelligence extensions to SPARQL as well as joining between relational data mapped to RDF and native RDF data. i was also speaking at the social networks panel chaired by Harry Halpin.
I have gathered a few impressions that I will share in the next few posts (1 - RDF Mapping, 2 - DARQ, 3 - voiD, 4 - Paradigmata). Caveat: This is not meant to be complete or impartial press coverage of the event but rather some quick comments on issues of personal/OpenLink interest. The fact that I do not mention something does not mean that it is unimportant.
Linked Open Data was well represented, with Chris Bizer, Tom Heath, ourselves and many others. The great advance for LOD this time around is voiD, the Vocabulary of Interlinked Datasets, a means to describe what in fact is inside the LOD cloud, how to join it with what and so forth. Big time important if there is to be a web of federatable data sources, feeding directly into what we have been saying for a while about SPARQL end-point self-description and discovery. There is reasonable hope of having something by the date of Linked Data Planet in a couple of weeks.
Bastian Quilitz gave a talk about his DARQ, a federated version of Jena's ARQ.
Something like DARQ's optimization statistics should make their way into the SPARQL protocol as well as the voiD data set description.
We really need federation but more on this in a separate post.
Axel Polleres et al had a paper about XSPARQL, a merge of XQuery and SPARQL. While visiting DERI a couple of weeks back and again at the conference, we talked about OpenLink implementing the spec. It is evident that the engines must be in the same process and not communicate via the SPARQL protocol for this to be practical. We could do this. We'll have to see when.
Politically, using XQuery to give expressions and XML synthesis to SPARQL would be fitting. These things are needed anyhow, as surely as aggregation and sub-queries but the latter would not so readily come from XQuery. Some rapprochement between RDF and XML folks is desirable anyhow.
The social web panel presented the question of whether the sem web was ready for prime time with data portability.
The main thrust was expressed in Harry Halpin's rousing closing words: "Men will fight in a battle and lose a battle for a cause they believe in. Even if the battle is lost, the cause may come back and prevail, this time changed and under a different name. Thus, there may well come to be something like our semantic web, but it may not be the one we have worked all these years to build if we do not rise to the occasion before us right now."
So, how to do this? Dan Brickley asked the audience how many supported, or were aware of, the latest Web 2.0 things, such as OAuth and OpenID. A few were. The general idea was that research (after all, this was a research event) should be more integrated and open to the world at large, not living at the "outdated pace" of a 3 year funding cycle. Stefan Decker of DERI acquiesced in principle. Of course there is impedance mismatch between specialization and interfacing with everything.
I said that triples and vocabularies existed, that OpenLink had ODS (OpenLink Data Spaces, Community LinkedData) for managing one's data-web presence, but that scale would be the next thing. Rather large scale even, with 100 gigatriples (Gtriples) reached before one even noticed. It takes a lot of PCs to host this, maybe $400K worth at today's prices, without replication. Count 16G ram and a few cores per Gtriple so that one is not waiting for disk all the time.
The tricks that Web 2.0 silos do with app-specific data structures and app-specific partitioning do not really work for RDF without compromising the whole point of smooth schema evolution and tolerance of ragged data.
So, simple vocabularies, minimal inference, minimal blank nodes. Besides, note that the inference will have to be done at run time, not forward-chained at load time, if only because users will not agree on what sameAs and other declarations they want for their queries. Not to mention spam or malicious sameAs declarations!
As always, there was the question of business models for the open data web and for semantic technologies in general. As we see it, information overload is the factor driving the demand. Better contextuality will justify semantic technologies. Due to the large volumes and complex processing, a data-as-service model will arise. The data may be open, but its query infrastructure, cleaning, and keeping up-to-date, can be monetized as services.
For the identity and reference workshop, the ultimate question is metaphysical and has no single universal answer, even though people, ever since the dawn of time and earlier, have occupied themselves with the issue. Consequently, I started with the Genesis quote where Adam called things by nominibus suis, off-hand implying that things would have some intrinsic ontologically-due names. This would be among the older references to the question, at least in widely known sources.
For present purposes, the consensus seemed to be that what would be considered the same as something else depended entirely on the application. What was similar enough to warrant a sameAs for cooking purposes might not warrant a sameAs for chemistry. In fact, complete and exact sameness for URIs would be very rare. So, instead of making generic weak similarity assertions like similarTo or seeAlso, one would choose a set of strong sameAs assertions and have these in effect for query answering if they were appropriate to the granularity demanded by the application.
Therefore sameAs is our permanent companion, and there will in time be malicious and spam sameAs. So, nothing much should be materialized on the basis of sameAs assertions in an open world. For an app-specific warehouse, sameAs can be resolved at load time.
There was naturally some apparent tension between the Occam camp of entity name services and the LOD camp. I would say that the issue is more a perceived polarity than a real one. People will, inevitably, continue giving things names regardless of any centralized authority. Just look at natural language. But having a dictionary that is commonly accepted for established domains of discourse is immensely helpful.
The semantic search workshop was interesting, especially CYC's presentation. CYC is, as it were, the grand old man of knowledge representation. Over the long term, I would have support of the CYC inference language inside a database query processor. This would mostly be for repurposing the huge knowledge base for helping in search type queries. If it is for transactions or financial reporting, then queries will be SQL and make little or no use of any sort of inference. If it is for summarization or finding things, the opposite holds. For scaling, the issue is just making correct cardinality guesses for query planning, which is harder when inference is involved. We'll see.
I will also have a closer look at natural language one of these days, quite inevitably, since Zitgist (for example) is into entity disambiguation.
Garlic gave a talk about their Data Patrol and QDOS. We agree that storing the data for these as triples instead of 1000 or so constantly changing relational tables could well make the difference between next-to-unmanageable and efficiently adaptive.
Garlic probably has the largest triple collection in constant online use to date. We will soon join them with our hosting of the whole LOD cloud and Sindice/Zitgist as triples.
There is a mood to deliver applications. Consequently, scale remains a central, even the principal topic. So for now we make bigger centrally-managed databases. At the next turn around the corner we will have to turn to federation. The point here is that a planetary-scale, centrally-managed, online system can be made when the workload is uniform and anticipatable, but if it is free-form queries and complex analysis, we have a problem. So we move in the direction of federating and charging based on usage whenever the workload is more complex than making simple lookups now and then.
For the Virtuoso roadmap, this changes little. Next we make data sets available on Amazon EC2, as widely promised at ESWC. With big scale also comes rescaling and repartitioning, so this gets additional weight, as does further parallelizing of single user workloads. As it happens, the same medicine helps for both. At Linked Data Planet, we will make more announcements.
]]>Right now we are working on hosting the billion triples challenge data set on Amazon EC2 using Virtuoso Cluster. This will be the first publicly available instance of Virtuoso Cluster and all interested may then instantiate their own copy on the EC2 infrastructure.
Look for a separate announcement in the near future.
databasedatabasesrdfsemanticwebsparqlvirtuoso]]>Comparing numbers given by different parties is a constant problem. In the case reported here, we loaded the full DBpedia 3, all languages, with about 198M triples, onto Virtuoso v5 and Virtuoso Cluster v6, all on the same 4 core 2GHz Xeon with 8G RAM. All databases were striped on 6 disks. The Cluster configuration was with 4 processes in the same box.
We ran the queries in two variants:
FROM
clause, using the default indices.The times below are for the sequence of 5 queries; individual query times are not reported. I did not do a line-by-line review of the execution plans since they seem to run well enough. We could get some extra mileage from cost model tweaks, especially for the numeric range conditions, but we will do this when somebody comes up with better times.
First, about Virtuoso v5: Because there is a query in the set that specifies no condition on S or O and only P, this simply cannot be done with the default indices. With Virtuoso Cluster v6 it sort-of can, because v6 is more space efficient.
So we added the index:
create bitmap index rdf_quad_pogs on rdf_quad (p, o, g, s);
 | Virtuoso v5 with gspo, ogps, pogs |
Virtuoso Cluster v6 with gspo, ogps |
Virtuoso Cluster v6 with gspo, ogps, pogs |
cold | 210 s | 136 s | 33.4 s |
warm | 0.600 s | 4.01 s | 0.628 s |
OK, so now let us do it without a graph being specified. For all platforms, we drop any existing indices, and --
create table r2 (g iri_id_8, s, iri_id_8, p iri_id_8, o any, primary key (s, p, o, g))
alter index R2 on R2 partition (s int (0hexffff00));
log_enable (2);
insert into r2 (g, s, p, o) select g, s, p, o from rdf_quad;
drop table rdf_quad;
alter table r2 rename RDF_QUAD;
create bitmap index rdf_quad_opgs on rdf_quad (o, p, g, s) partition (o varchar (-1, 0hexffff));
create bitmap index rdf_quad_pogs on rdf_quad (p, o, g, s) partition (o varchar (-1, 0hexffff));
create bitmap index rdf_quad_gpos on rdf_quad (g, p, o, s) partition (o varchar (-1, 0hexffff));
The code is identical for v5 and v6, except that with v5 we use
iri_id (32 bit)
for the type, not iri_id_8 (64 bit)
. We note that
we run out of IDs with v5 around a few billion triples, so with v6
we have double the ID length and still manage to be vastly more
space efficient.
With the above 4 indices, we can query the data pretty much in any combination without hitting a full scan of any index. We note that all indices that do not begin with s end with s as a bitmap. This takes about 60% of the space of a non-bitmap index for data such as DBpedia.
If you intend to do completely arbitrary RDF queries in Virtuoso, then chances are you are best off with the above index scheme.
 | Virtuoso v5 with gspo, ogps, pogs |
Virtuoso Cluster v6 with spog, pogs, opgs, gpos |
warm | 0.595 s | 0.617 s |
The cold times were about the same as above, so not reproduced.
It is in the SPARQL spirit to specify a graph and for pretty much any application, there are entirely sensible ways of keeping the data in graphs and specifying which ones are concerned by queries. This is why Virtuoso is set up for this by default.
On the other hand, for the open web scenario, dealing with an unknown large number of graphs, enumerating graphs is not possible and questions like which graph of which source asserts x become relevant. We have two distinct use cases which warrant different setups of the database, simple as that.
The latter use case is not really within the SPARQL spec, so implementations may or may not support this. For example Oracle or Vertica would not do this well since they partition data according to graph or predicate, respectively. On the other hand, stores that work with one quad table, which is most of the ones out there, should do it maybe with some configuring, as shown above.
Frameworks like Jena are not to my knowledge geared towards having a wildcard for graph, although I would suppose this can be arranged by adding some "super-graph" object, a graph of all graphs. I don't think this is directly supported and besides most apps would not need it.
Once the indices are right, there is no difference between specifying a graph and not specifying a graph with the queries considered. With more complex queries, specifying a graph or set of graphs does allow some optimizations that cannot be done with no graph specified. For example, bitmap intersections are possible only when all leading key parts are given.
The best warm cache time is with v5; the five queries run under 600 ms after the first go. This is noted to show that all-in-memory with a single thread of execution is hard to beat.
Cluster v6 performs the same queries in 623 ms. What is gained in parallelism is lost in latency if all operations complete in microseconds. On the other hand, Cluster v6 leaves v5 in the dust in any situation that has less than 100% hit rate. This is due to actual benefit from parallelism if operations take longer than a few microseconds, such as in the case of disk reads. Cluster v6 has substantially better data layout on disk, as well as fewer pages to load for the same content.
This makes it possible to run the queries without the pogs index on Cluster v6 even when v5 takes prohibitively long.
The morale of the story is to have a lot of RAM and space-efficient data representation.
The DBpedia benchmark does not specify any random access pattern that would give a measure of sustained throughput under load, so we are left with the extremes of cold and warm cache of which neither is quite realistic.
Chris Bizer and I have talked on and off about benchmarks and I have made suggestions that we will see incorporated into the Berlin SPARQL benchmark, which will, I believe, be much more informative.
For reference, the query texts specifying the graph are below. To
run without specifying the graph, just drop the FROM
<http://dbpedia.org>
from each query. The returned row counts are indicated
below each query's text.
sparql SELECT ?p ?o FROM <http://dbpedia.org> WHERE {
<http://dbpedia.org/resource/Metropolitan_Museum_of_Art> ?p ?o };
-- 1337 rows
sparql PREFIX p: <http://dbpedia.org/property/>
SELECT ?film1 ?actor1 ?film2 ?actor2
FROM <http://dbpedia.org> WHERE {
?film1 p:starring <http://dbpedia.org/resource/Kevin_Bacon> .
?film1 p:starring ?actor1 .
?film2 p:starring ?actor1 .
?film2 p:starring ?actor2 . };
-- 23910 rows
sparql PREFIX p: <http://dbpedia.org/property/>
SELECT ?artist ?artwork ?museum ?director FROM <http://dbpedia.org>
WHERE {
?artwork p:artist ?artist .
?artwork p:museum ?museum .
?museum p:director ?director };
-- 303 rows
sparql PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?s ?homepage FROM <http://dbpedia.org> WHERE {
<http://dbpedia.org/resource/Berlin> geo:lat ?berlinLat .
<http://dbpedia.org/resource/Berlin> geo:long ?berlinLong .
?s geo:lat ?lat .
?s geo:long ?long .
?s foaf:homepage ?homepage .
FILTER (
?lat <= ?berlinLat + 0.03190235436 &&
?long >= ?berlinLong - 0.08679199218 &&
?lat >= ?berlinLat - 0.03190235436 &&
?long <= ?berlinLong + 0.08679199218) };
-- 56 rows
sparql PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX p: <http://dbpedia.org/property/>
SELECT ?s ?a ?homepage FROM <http://dbpedia.org> WHERE {
<http://dbpedia.org/resource/New_York_City> geo:lat ?nyLat .
<http://dbpedia.org/resource/New_York_City> geo:long ?nyLong .
?s geo:lat ?lat .
?s geo:long ?long .
?s p:architect ?a .
?a foaf:homepage ?homepage .
FILTER (
?lat <= ?nyLat + 0.3190235436 &&
?long >= ?nyLong - 0.8679199218 &&
?lat >= ?nyLat - 0.3190235436 &&
?long <= ?nyLong + 0.8679199218) };
-- 13 rows
]]>The session drew a good crowd, including many users and developers. The wishes were largely as expected, with a few new ones added. Many of the wishes already had diverse implementations, however most often without interop. I will below give some comments on the main issues discussed.
SPARQL Update - This is likely the most universally agreed upon extension. Implementations exist, largely along the lines of Andy Seaborne's SPARUL spec, which is also likely material for a W3C member submission. The issue is without much controversy; transactions fall outside the scope, which is reasonable enough. With triple stores, we can define things as combinations of inserts and deletes, and isolation we just leave aside. If anything, operating on a transactional platform such as Virtuoso, one wishes to disable transactions for any operations such as bulk loads and long-running inserts and deletes. Transactionality has pretty much no overhead for a few hundred rows, but for a few hundred million rows the cost of locking and rollback is prohibitive. With Virtuoso, we have a row auto-commit mode which we recommend for use with RDF: It commits by itself now and then, optionally keeping a roll forward log, and is transactional enough not to leave half triples around, i.e., inserted in one index but not another.
As far as we are concerned, updating physical triples along the SPARUL lines is pretty much a done deal.
The matter of updating relational data mapped to RDF is a whole other kettle of fish. On this, I should say that RDF has no special virtues for expressing transactions but rather has a special genius for integration. Updating is best left to web service interfaces that use SQL on the inside. Anyway, updating union views, which most mappings will be, is complicated. Besides, for transactions, one usually knows exactly what one wishes to update.
Full Text - Many people expressed a desire for full text access. Here we run into a deplorable confusion with regexps. The closest SPARQL has to full text in its native form is regexps, but these are not really mappable to full text except in rare special cases and I would despair of explaining to an end user what exactly these cases are. So, in principle, some regexps are equivalent to full text but in practice I find it much preferable to keep these entirely separate.
It was noted that what the users want is a text box for search words. This is a front end to the CONTAINS predicate of most SQL implementations. Ours is MS SQL Server compatible and has a SPARQL version called bif:contains
. One must still declare which triples one wants indexed for full text, though. This admin overhead seems inevitable, as text indexing is a large overhead and not needed by all applications.
Also, text hits are not boolean; usually they come with a hit score. Thus, a SPARQL extension for this could look like
select * where { ?thing has_description ?d . ?d ftcontains "gizmo" ftand "widget" score ?score . }
This would return all the subjects, descriptions, and scores, from subjects with a has_description property containing widget and gizmo. Extending the basic pattern is better than having the match in a filter, since the match binds a variable.
The XQuery/XPath groups have recently come up with a full-text spec, so I used their style of syntax above. We already have a full-text extension, as do some others. but for standardization, it is probably most appropriate to take the XQuery work as a basis. The XQuery full-text spec is quite complex, but I would expect most uses to get by with a small subset, and the structure seems better thought out, at first glance, than the more ad-hoc implementations in diverse SQLs.
Again, declaring any text index to support the search, as well as its timeliness or transactionality, are best left to implementations.
Federation - This is a tricky matter. ARQ has a SPARQL extension for sending a nested set of triple patterns to a specific end-point. The DARQ project has something more, including a selectivity model for SPARQL.
With federated SQL, life is simpler since after the views are expanded, we have a query where each table is at a known server and has more or less known statistics. Generally, execution plans where as much work as possible is pushed to the remote servers are preferred, and modeling the latencies is not overly hard. With SPARQL, each triple pattern could in principle come from any of the federated servers. Associating a specific end-point to a fragment of the query just passes the problem to the user. It is my guess that this is the best we can do without getting very elaborate, and possibly buggy, end-point content descriptions for routing federated queries.
Having said this, there remains the problem of join order. I suggested that we enhance the protocol by allowing asking an end-point for the query cost for a given SPARQL query. Since they all must have a cost model for optimization, this should not be an impossible request. A time cost and estimated cardinality would be enough. Making statistics available à la DARQ was also discussed. Being able to declare cardinalities expected of a remote end-point is probably necessary anyway, since not all will implement the cost model interface. For standardization, agreeing on what is a proper description of content and cardinality and how fine grained this must be will be so difficult that I would not wait for it. A cost model interface would nicely hide this within the end-point itself.
With Virtuoso, we do not have a federated SPARQL scheme but we could have the ARQ-like service construct. We'd use our own cost model with explicit declarations of cardinalities of the remote data for guessing a join order. Still, this is a bit of work. We'll see.
For practicality, the service construct coupled with join order hints is the best short term bet. Making this pretty enough for standardization is not self-evident, as it requires end-point description and/or cost model hooks for things to stay declarative.
End-point description - This question has been around for a while; I have blogged about it earlier, but we are not really at a point where there would be even rough consensus about an end-point ontology. We should probably do something on our own to demonstrate some application of this, as we host lots of linked open data sets.
SQL equivalence - There were many requests for aggregation, some for subqueries and nesting, expressions in select, negation, existence and so on. I would call these all SQL equivalence. One use case was taking all the teams in the database and for all with over 5 members, add the big_team class and a property for member count.
With Virtuoso, we could write this as --
construct { ?team a big_team . ?team member_count ?ct } from ... where {?team a team . { select ?team2 count (*) as ?ct where { ?m member_of ?team2 } . filter (?team = ?team2 and ? ct > 5) }}
We have pretty much all the SQL equivalence features, as we have been working for some time at translating the TPC-H workload into SPARQL.
The usefulness of these things is uncontested but standardization could be hard as there are subtle questions about variable scope and the like.
Inference - The SPARQL spec does not deal with transitivity or such matters because it is assumed that these are handled by an underlying inference layer. This is however most often not so. There was interest in more fine grained control of inference, for example declaring that just one property in a query would be transitive or that subclasses should be taken into account in only one triple pattern. As far as I am concerned, this is very reasonable, and we even offer extensions for this sort of thing in Virtuoso's SPARQL. This however only makes sense if the inference is done at query time and pattern by pattern. For instance, if forward chaining is used, this no longer makes sense. Specifying that some forward chaining ought to be done at query time is impractical, as the operation can be very large and time consuming and it is the DBA's task to determine what should be stored and for how long, how changes should be propagated, and so on. All these are application dependent and standardizing will be difficult.
Support for RDF features like lists and bags would all fall into the functions an underlying inference layer should perform. These things are of special interest when querying OWL models, for example.
Path expressions - Path expressions were requested by a few people. We have implemented some, as in
?product+?has_supplier+>s_name = "Gizmos, Inc.".
This means that one supplier of product has name "Gizmo, Inc.". This is a nice shorthand but we run into problems if we start supporting repetitive steps, optional steps, and the like.In conclusion, update, full text, and basic counting and grouping would seem straightforward at this point. Nesting queries, value subqueries, views, and the like should not be too hard if an agreement is reached on scope rules. Inference and federation will probably need more experimentation but a lot can be had already with very simple fine grained control of backward chaining, if such applies, or with explicit end-point references and explicit join order. These are practical but not pretty enough for committee consensus, would be my guess. Anyway, it will be a few months before anything formal will happen.
]]>Tim Berners-Lee showed a read-write version of Tabulator. This raises the question of updating on the Data Web. The consensus was that one could assert what one wanted in one's own space but that others' spaces would be read-only. What spaces one considered relevant would be the user's or developer's business, as in the document web.
It seems to me that a significant use case of LOD is an open-web situation where the user picks a broad read-only "data wallpaper" or backdrop of assertions, and then uses this combined with a much smaller, local, writable data set. This is certainly the case when editing data for publishing, as in Tim's demo. This will also be the case when developing mesh-ups combining multiple distinct data sets bound together by sets of SameAs assertions, for example. Questions like, "What is the minimum subset of n data sets needed for deriving the result?" will be common. This will also be the case in applications using proprietary data combined with open data.
This means that databases will have to deal with queries that specify large lists of included graphs, all graphs in the store or all graphs with an exclusion list. All this is quite possible but again should be considered when architecting systems for an open linked data web.
"There is data but what can we really do with it? How far can we trust it, and what can we confidently decide based on it?"
As an answer to this question, Zitgist has compiled the UMBEL taxonomy using SKOS. This draws on Wikipedia, Open CYC, Wordnet, and YAGO, hence the acronym WOWY. UMBEL is both a taxononmy and a set of instance data, containing a large set of named entities, including persons, organizations, geopolitical entities, and so forth. By extracting references to this set of named entities from documents and correlating this to the taxonomy, one gets a good idea of what a document (or part thereof) is about.
Kingsley presented this in the Zitgist demo. This is our answer to the criticism about DBpedia having errors in classification. DBpedia, as a bootstrap stage, is about giving names to all things. Subsequent efforts like UMBEL are about refining the relationships.
"Should there be a global URI dictionary?"
There was a talk by Paolo Bouquet about Entity Name System, a a sort of data DNS, with the purpose of associating some description and rough classification to URIs. This would allow discovering URIs for reuse. I'd say that this is good if it can cut down on the SameAs proliferation and if this can be widely distributed and replicated for resilience, Ã la DNS. On the other hand, it was pointed out that this was not quite in the LOD spirit, where parties would mint their own dereferenceable URIs, in their own domains. We'll see.
"What to do when identity expires?"
Giovanni of Sindice said that a document should be removed from search if it was no longer available. Kingsley pointed out that resilience of reference requires some way to recover data. The data web cannot be less resilient than the document web, and there is a point to having access to history. He recommended hooking up with the Internet Archive, since they make long term persistence their business. In this way, if an application depends on data, and the URIs on which it depends are no longer dereferenceable or or provide content from a new owner of the domain, those who need the old version can still get it and host it themselves.
It is increasingly clear that OWL SameAs is both the blessing and bane of linked data. We can easily have tens of URIs for the same thing, especially with people. Still, these should be considered the same.
Returning every synonym in a query answer hardly makes sense but accepting them as input seems almost necessary. This is what we do with Virtuoso's SameAs support. Even so, this can easily double query times even when there are no synonyms.
Be that as it may, SameAs is here to stay; just consider the mapping of DBpedia to Geonames, for example.
Also, making aberrant SameAs statements can completely poison a data set and lead to absurd query results. Hence choosing which SameAs assertions from which source will be considered seems necessary. In an open web scenario, this leads inevitably to multi-graph queries that can be complex to write with regular SPARQL. By extension, it seems that a good query would also include the graphs actually used for deriving each result row. This is of course possible but has some implications on how databases should be organized.
Yves Raymond gave a talk about deriving identity between Musicbrainz and Jamendo. I see the issue as a core question of linked data in general. The algorithm Yves presented started with attribute value similarities and then followed related entities. Artists would be the same if they had similar names and similar names of albums with similar song titles, for example. We can find the same basic question in any analysis, for example, looking at how news reporting differs between media, supposing there is adequate entity extraction.
There is basic graph diffing in RDFSync, for example. But here we are expanding the context significantly. We will traverse references to some depth, allow similarity matches, SameAs, and so forth. Having presumed identity of two URIs, we can then look at the difference in their environment to produce a human readable summary. This could then be evaluated for purposes of analysis or of combining content.
At first sight, these algorithms seem well parallelizable, as long as all threads have access to all data. For scaling, this means a probably message-bound distributed algorithm. This is something to look into for the next stage of linked data.
Some inference is needed, but if everybody has their own choice of data sets to query, then everybody would also have their own entailed triples. This will make for an explosion of entailed graphs if forward chaining is used. Forward chaining is very nice because it keeps queries simple and easy to optimize. With Virtuoso, we still favor backward chaining since we expect a great diversity of graph combinations and near infinite volume in the open web scenario. With private repositories of slowly changing data put together for a special application, the situation is different.
In conclusion, we have a real LOD movement with actual momentum and a good idea of what to do next. The next step is promoting this to the broader community, starting with Linked Data Planet in New York in June.
]]>Thus did one SPARQL user describe the task of a semantic/data web search engine.
In a previous post, I suggested that if the data web were the size of the document web, we'd be looking at two orders of magnitude more search complexity. It just might be so.
In the conversation, I pointed out that a search engine might have a copy of everything and even a capability to do SPARQL and full text on it all, yet still the users would be better off doing the queries against the SPARQL end-points of the data publishers. It is a bit like the fact that not all web browsing runs off Google's cache. With the data web, the point is even more pronounced, as serving a hit from Google's cache is a small operation but a complex query might be a very large one.
Yet, the data web is about ad-hoc joining between data sets of different origins. Thus a search engine of the data web ought to be capable of joining also, even if large queries ought to be run against individual publishers' end-points or the user's own data warehouse.
For ranking, the general consensus was that no single hit-ranking would be good for the data web. Thus word frequency-based hit-scores are OK for text hits but more is not obvious. I would think that some link analysis could apply but this will take some more experimentation.
For search summaries, if we have splitting of data sets into small fragments à la Sindice, search summaries are pretty much the same as with just text search. If we store triples, then we can give text style summaries of text hits in literals and Fresnel lens views of the structured data around the literal. For showing a page of hits, the lenses must abbreviate heavily but this is still feasible. The engine would know about the most common ontologies and summarize instance data accordingly.
Chris Bizer pointed out that trust and provenance are critical, especially if an answer is arrived at by joining multiple data sets. The trust of the conclusion is no greater than that of the weakest participating document. Different users will have different trusted sources.
A mature data web search engine would combine a provenance/trust specification, a search condition consisting of SPARQL or full text or both, and a specification for hit rank. Again, most searches would use defaults, but these three components should in principle be orthogonally specifiable.
Many places may host the same data set either for download or SPARQL access. The URI of the data set is not its URL. Different places may further host multiple data sets on one end-point. Thus the search engine ought to return all end-points where the set is to be found. The end-points themselves ought to be able to say what data sets they contain, under what graph IRIs. Since there is no consensus about end-point self description, this too would be left to the search engine. In practice, this could be accomplished by extending Sindice's semantic site map specification. A possible query would be to find an end-point containing a set of named data sets. If none were found, the search engine itself could run a query joining all the sets since it at least would hold them all.
Since many places will host sets like Wordnet or Uniprot, indexing these once for each copy hardly makes sense. Thus a site should identify its data by the data set's URI and not the copy's URL.
It came up in the discussion that search engines should share a ping format so that a single message format would be enough to notify any engine about data being updated. This is already partly the case with Sindice and PTSW (PingTheSemanticWeb) sharing a ping format.
Further, since it is no trouble to publish a copy of the 45G Uniprot file but a fair amount of work to index it, search engines should be smart about processing requests to index things, since these can amount to a denial of service attack.
Probably very large data sets should be indexed only in the form supplied by their publisher, and others hosting copies would just state that they hold a copy. If the claim to the copy proved false, users could complain and the search engine administrator would remove the listing. It seems that some manual curating cannot be avoided here.
It seems there can be an overlap between the data web search and the data web hosting businesses. For example, Talis rents space for hosting RDF data with SPARQL access. A search engine should offer basic indexing of everything for free, but could charge either data publishers or end users for running SPARQL queries across data sets. These do not have the nicely anticipatable and fairly uniform resource consumption of text lookups. In this manner, a search provider could cost-justify the capacity for allowing arbitrary queries.
The value of the data web consists of unexpected joining. Such joining takes place most efficiently if the sources are at least in some proximity, for example in the same data center. Thus the search provider could monetize functioning as the database provider for mesh-ups. In the document web, publishing pages is very simple and there is no great benefit from co-locating search and pages, rather the opposite. For the data web, the hosting with SPARQL and all is more complex and resembles providing search. Thus providing search can combine with providing SPARQL hosting, once we accept in principle that search should have arbitrary inter-document joining, even if it is at an extra premium.
The present search business model is advertising. If the data web is to be accessed by automated agents such as mesh-up code, display of ads is not self-evident. This is quite separate from the fact that semantics can lead to better ad targeting.
One model would be to do text lookups for free from a regular web page but show ads, just a la Google search ads. Using the service via web services for text or SPARQL would have a cost paid by the searching or publishing party and would not be financed by advertising.
In the case of data used in value-add data products (mesh-ups) that have financial value to their users, the original publisher of the data could even be paid for keeping the data up-to-date. This would hold for any time-sensitive feeds like news or financial feeds. Thus the hosting/search provider would be a broker of data-use fees and the data producer would be in the position of an AdSense inventory owner, i.e., a web site which shows AdSense ads. Organizing this under a hub providing back-office functions similar to an ad network could make sense even if the actual processing were divided among many sites.
Kingsley has repeatedly formulated the core value proposition of the semantic web in terms of dealing with information overload: There is the real-time enterprise and the real-time individual and both are beasts of perception. Their image is won and lost in the Internet online conversation space. We know that allegations, even if later proven false, will stick if left unchallenged. The function of semantics on the web is to allow one to track and manage where one stands. In fact, Garlik has made a business of just this, but now from a privacy and security angle. The Garlik DataPatrol harvests data from diverse sources and allows assessing vulnerability to identity theft, for example.
If one is in the business of collating all the structured data in the world, as a data web search engine is, then providing custom alerts for both security or public image management is quite natural. This can be a very valuable service if it works well.
At OpenLink, we will now experiment with the Sindice/Zitgist/PingTheSemanticWeb content. This is a regular part of the productization of Virtuoso's cluster edition. We expect to release some results in the next 4 weeks.
]]>For my part, I will draw some architecture conclusions from the different talks and extrapolate about the requirements on database platforms for linked data.
Chris Bizer predicted that 2008 would be the year of data web search, if 2007 was the year of SPARQL. This may be the case, as linked data is now pretty much a reality and the questions of discovery become prevalent. There was a birds-of-a-feather session on this and I will make some comments on what we intend to explore in bridging between the text index based semantic web search engines and SPARQL.
Andy Seaborne convened a birds-of-a-feather session on the future of SPARQL. Many of the already anticipated and implemented requirements were confirmed and a few were introduced. A separate blog post will discuss these further.
From the various discussions held throughout the conference, we conclude that plug-and-play operation with the major semantic web frameworks of Jena, Sesame, and Redland, is our major immediate-term deliverable. Our efforts in this direction thus far are insufficient and we will next have these done with the right supervision and proper interop testing. The issues are fortunately simple but doing things totally right require some small server side support and some JDBC/ODBC tweaks, so to the interested, we advise to wait for an update to be published on this blog.
I further had a conversation with Andy Seaborne about using Jena reasoning capabilities with Virtuoso and generally the issues of "impedance mismatch" between reasoning and typical database workloads. More on this later.
]]>On the RDF side, the bitmap intersection join has been improved quite a bit so that it is now almost always more than 2x more efficient than the equivalent nested loop join.
XML trees in the object position in RDF quads were in some cases incorrectly indexed, leading to failure to retrieve quads. This is fixed and should problems occur in existing databases, they can be corrected by simply dropping and re-creating an index.
Also the cost model has been further tuned. We have run the TPC-H queries with larger databases and have profiled it extensively. There are improvements to locking, especially for concurrency of transactions with large shared lock sets, as is the case in the TPC-H queries. The rules stipulate that these have to be run with repeatable read. There are also optimizations for decimal floating point.
A sampling of TPC-H queries translated into SPARQL comes with the new demo database. These show a live sample of the TPC-H schema translated into linked data, complete with SPARQL translations of the original queries. Some work is still ongoing there but the relational to RDF mapping is mature enough for real business intelligence applications now.
On the closed source side, we have some adjustments to the virtual database. When using Virtuoso as a front end to Oracle, using the TPC-H queries as a metric, the virtual database overhead is minimal. Previously, we had some overhead because some queries were rewritten in a way that Oracle would not optimize as well as the original TPC-H text. Specifically, turning an IN sub-query predicate into an equivalent EXISTS did not sit well with Oracle.
]]>This is an example of using Virtuoso's relational-to-RDF mapping for publishing business data, for browsing using the linked data principles and opening it to analytics queries in SPARQL.
As noted before, we have extended SPARQL with aggregation and nested queries, thus making it a viable SQL substitute for decision support queries.
The article at http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VOSTPCHLinkedData gives details and the source code for the implementation.
We are still working on some aspects of the more complex TPC-H queries, thus the demo is not complete with all the 22 queries. This is however enough to see a representative sample of how analytics queries work with SPARQL and Virtuoso's SQL-to-RDF mapping. The demo will be part of the next Virtuoso Open Source download, probably out next week.
]]>A blow-by-blow analysis of the performance of a complex system such as a DBMS is more than fits within the scope of human attention at one go. This is why this all must be abbreviated into a single metric. Only when thus abbreviated, can this information be used in context. The metric's practical value is relative to how well it predicts the performance of the system in some real task. This means a task not likely to be addressed by an alternative technology, unless the challenger clearly beats the incumbent.
A benchmark is promotional material, both well as for the technology being benchmarked as a whole. This is why the benchmark, whatever it does, should do something that the technology does well, surely better than any alternative technology. A case in point is that one ought not to take a pure relational workload and RDF-ize it, for then the relational variant is likely to come out on top.
In this regard LUBM is not so bad because its reliance on class and property hierarchies and the occasional transitivity or inference rule makes the workload typically RDF, a little ways apart from a purely relational implementation of the task.
RDF's claim to fame is linked data. This means giving things globally unique names and thereby making anything joinable with anything else, insofar there is agreement on the names. RDF is a key to a new class of problems, call it web scale database. Web scale here refers first to heterogeneity and multiplicity of independent sources and secondly to volume of data.
Now there are plenty of relational applications with very large volumes of data. On the non-relational side, there are even larger applications, such as web search engines. All these have a set schema and a specific workload they are meant to address. RDF versions of such are conceivable but hold no intrinsic advantage if considered in the specific niche alone.
The claim to fame of RDF is not to outperform these on their home turf but to open another turf altogether, allowing agile joining and composing of all these resources.
This is why a benchmark, i.e., an an advertisement for the RDF value proposition, should not just take a relational workload and RDF-ize it. The benchmark should carry some of the web in it.
If we just intend to measure how well an RDF store joins triples to other triples, LUBM is almost good enough. If it defined a query mix with different frequencies for short and long queries and a concurrent query metric, it would be pretty much there. Our adaptation of it is adequate for counting joins per second. But joins per second is not a value proposition.
So we have two questions:
If we just take the RDF model and SPARQL, how do we make a benchmark that fills in what LUBM does not cover?
The answers to the first are not very complex:
Add some optionals. Have different frequencies of occurrence for some properties.
Add different graphs. Make queries joining between graphs and drawing on different graphs. Querying against all graphs of the store is not a part of the language. Still this would be useful but leave it out for now.
Add some filters and arithmetic. Not much can be done there, though because expressions cannot be returned and there is no aggregation or grouping.
Split the workload into short and long queries. The short should be typical for online use and the long ones for analysis. Different execution frequencies for different queries is a must. Analysis is limited by lack of grouping, expressions or aggregation. Still, something can be contrived by looking for a pattern that does not exist or occurs extremely rarely. Producing result sets of millions of rows is not realistic.
Many of the LUBM queries return thousands of rows, even when scoped to a single university. This is not very realistic. No user interface displays that sort of quantity. Of course, the intermediate results can be large as you please but the output must be somehow ranked. SPARQL has order by and limit, so these will have to be used. TPC H for example has almost always a group by/order by combination and sometimes a result rows limit.
The degree of inference in LUBM is about right, mostly sub-classes and sub-properties, nothing complex. We certainly regard this as a database benchmark more than a knowledge representation or rule system one.
LUBM does an OK job of defining a scale factor. I think that a concurrent query metric can just be so many queries per time at a given scale. The number of clients, I would say, can be decided by the test sponsor, taking whatever works best. A load balancer or web server can always be tuned to enforce some limit on concurrency. I don't think that a scale rule like in TPC C, where it says that only so many transactions per minute are allowed per warehouse is needed here. The effect of this is that when reporting a higher throughput, one has to automatically have a bigger database.
There is nothing to prevent these improvements from being put into a subsequent version of LUBM.
Building something that shows RDF at its best is a slightly different proposition. For this, we cannot be limited to the SPARQL recommendation and must allow custom application code and language extensions. Examples would be scripting similar to SQL stored procedures and extensions such as we have made for sub-queries and aggregation, explained a couple of posts back.
Maybe the Billion Triples challenge produces some material that we can use for this. We need to go for spaces that are not easily reached with SQL, have distributed computing, federation, discovery, demand driven import of data and such like.
I'll write more about ways of making RDF shine in some future post.
There are two kinds of workloads: online and offline. Online is what must be performed in an interactive situation, without significant human perceptible delay, i.e. within 500 ms. Anything else is offline.
Because this is how any online system is designed, this should be reflected in the benchmark. Ideally we would make two benchmarks.
]]>Below is the Virtuoso SPARQL version of Q2.
sparql define sql:signal-void-variables 1 prefix tpcd: <http://www.openlinksw.com/schemas/tpcd#> prefix oplsioc: <http://www.openlinksw.com/schemas/oplsioc#> prefix sioc: <http://rdfs.org/sioc/ns#> prefix foaf: <http://xmlns.com/foaf/0.1/> select ?supp+>tpcd:acctbal, ?supp+>tpcd:name, ?supp+>tpcd:has_nation+>tpcd:name as ?nation_name, ?part+>tpcd:partkey, ?part+>tpcd:mfgr, ?supp+>tpcd:address, ?supp+>tpcd:phone, ?supp+>tpcd:comment from <http://example.com/tpcd> where { ?ps a tpcd:partsupp ; tpcd:has_supplier ?supp ; tpcd:has_part ?part . ?supp+>tpcd:has_nation+>tpcd:has_region tpcd:name 'EUROPE' . ?part tpcd:size 15 . ?ps tpcd:supplycost ?minsc . { select ?p min(?ps+>tpcd:supplycost) as ?minsc where { ?ps a tpcd:partsupp ; tpcd:has_part ?p ; tpcd:has_supplier ?ms . ?ms+>tpcd:has_nation+>tpcd:has_region tpcd:name 'EUROPE' . } } filter (?part+>tpcd:type like '%BRASS') } order by desc (?supp+>tpcd:acctbal) ?supp+>tpcd:has_nation+>tpcd:name ?supp+>tpcd:name ?part+>tpcd:partkey ;
Note the pattern
{ ?ms+>tpcd:has_nation+>tpcd:has_region tpcd:name 'EUROPE' }
which is a shorthand for
{ ?ms tpcd:has_nation ?t1 . ?t1 tpcd:has-region ?t2 . ?t2 tpcd:has_region ?t3 . ?t3 tpcd:name "EUROPE" }
Also note a sub-query is used for determining the lowest supply cost for a part.
The SQL text of the query can be found in the TPC H benchmark specification, reproduced below:
select s_acctbal, s_name, n_name, p_partkey, p_mfgr, s_address, s_phone, s_comment from part, supplier, partsupp, nation, region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and p_size = 15 and p_type like '%BRASS' and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'EUROPE' and ps_supplycost = ( select min(ps_supplycost) from partsupp, supplier, nation, region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'EUROPE') order by s_acctbal desc, n_name, s_name, p_partkey;
For brevity we have omitted the declarations for mapping the TPC H schema to its RDF equivalent. The mapping is straightforward, with each column mapping to a predicate and each table to a class.
This is now part of the next Virtuoso Open Source cut, due around next week.
As of this writing we are going through the TPC H query by query and testing with mapping going to Virtuoso and Oracle databases.
Also we have been busy measuring Virtuoso 6. Even after switching from 32-bit to 64-bit IDs for IRIs and objects, the new databases are about half the size of the same Virtuoso 5.0.2 databases. This does not include any stream compression like gzip for disk pages. The load and query speeds are higher because of better working set. For all in memory, they are about even with 5.0.2. So now on an 8G box, we load 1067 million LUBM triples at 39.7 Kt/s instead of 29 Kt/s with 5.0.2. Right now we experimenting with clusters at Amazon EC2. We'll write about that in a bit.
]]>A year ago, I was in the middle of redoing the Virtuoso database engine for better SMP performance. We redid the way traversal of index structures and cache buffers was serialized for SMP, and generally compared Virtuoso and Oracle engines function by function. We had just returned from the ISWC 2006 in Athens, Georgia, and the Virtuoso database was becoming a usable triple store.
Soon thereafter, we confirmed that all this worked when we put out the first cut of Dbpedia with Chris Bizer, et al, and were working with Alan Ruttenberg on what would become the Banff health care and life sciences demo.
The WWW 2007 conference in Banff, Canada, was a sort of kick-off for the Linking Open Data movement, which started as a community project under SWEO, the W3C interest group for Semantic Web Education and Outreach, and has gained a life of its own since.
Right after WWW 2007, the Virtuoso development effort split onto two tracks: one for enhancing the then new 5.0 release; and one for building a new generation of Virtuoso, notably featuring clustering and double storage density for RDF.
The first track produced constant improvements to the relational-to-RDF mapping functionality, SPARQL enhancements, and Redland-, Jena- and Sesame-compatible client libraries with Virtuoso as as a triple store. These things have been out with testers for a while and are all generally available as of this writing.
The second track started with adding key compression to the storage engine, specifically with regard to RDF, even though there are some gains in relational applications as well. With RDF, the space consumption drops to about half, all without recourse to any non-random access compatible compression like gzip. Since the start of August, we turned to clustering and are now code complete, pretty much with all the tricks one would expect, of course full function SQL and taking advantage of co-located joins and doing aggregation and generally all possible processing where the data is. I have covered details of this along the way in previous posts. The key point is that now the thing is written and works with test cases.
In late October, we were at the W3C workshop for mapping relational data to RDF. For us, this confirmed the importance of mapping and scalability in general. Ivan Herman proposed forming a W3C incubator group on benchmarking. Also a W3C incubator group of relational to RDF mapping is being formed.
Now, scalability has two sides. One is dealing with volume, and the other is dealing with complexity. Volume alone will not help if interesting queries cannot be formulated. Hence, we recently extended SPARQL with sub-queries so that we can now express at least any SQL workloads, which was previously not the case. It is sort of a contradiction in terms to say that SPARQL is the universal language for information integration while not being able to express, for example, the TPC-H queries. Well, we fixed this. A separate post will highlight how. The W3C process will eventually follow, as the necessity of these things is undeniable, on the unimpeachable authority of the whole SQL world. Anyway, for now, SPARQL as it is ought to become a recommendation and extensions can be addressed later.
For now, the only RDF benchmark that seems to be out there is the loading part of the LUBM. We did a couple of enhancements of our own for that just recently, but much bigger things are on the way. Also, the billion triples challenge is an interesting initiative in the area. We all recognize that loading any number of triples is a finite problem with known solutions. The challenge is running interesting queries on large volumes.
Our present emphasis is demonstrating both RDF data warehousing and RDF mapping with complex queries and large data. We start with the TPC-H benchmark and doing the queries both through mapping to SQL against any RDBMS â Oracle, DB2, Virtuoso or other â and by querying the physical RDF rendition of the data in Virtuoso. From there, we move to querying a collection of RDBMS hosting similar data.
Doing this with performance at the level of direct SQL in the case of mapping and not very much slower with physical triples is an important milestone on the way to a real world enterprise data web. Real life has harder and more unexpected issues than a benchmark, but at any rate doing the benchmark without breaking a sweat is a step on the way. We sent a paper to ESWC 2008 about that but it was rather incomplete. By the time of the VLDB submissions deadline in March we'll have more meat.
Another tack soon to start is a re-architecting of Zitgist around clustered Virtuoso. Aside matters of scale, we will make a number of qualitatively new things possible. Again, more will be released in the first quarter of 2008.
Beyond these short and mid-term goals we have the introduction of entirely dynamic and demand driven partitioning, a la Google Bigtable or Amazon Dynamo. Now, regular partitioning will do for a while yet but this is the future when we move the the vision of linked data everywhere.
In conclusion, this year we have built the basis and the next year is about deployment. The bulk of really new development is behind us and now we start applying. Also, the community will find adoption easier due to our recent support of the common RDF APIs.
databasedatabaseslubmbenchmarkingsemanticwebsparql]]>The scenario is meant to capture a case of an existing database infrastructure and translating this into RDF for ad hoc query and integration. This may serve to publish some of the data, such as product catalog information and order status as an outside accessible SPARQL end point. The RDF modeling of the data may also facilitate in-house analytics. The tentative workload is intended to represent both aspects.
The below is very tentative, all imaginable caveats apply.
To avoid reinventing the wheel, we start with the TPC C and H databases. The C database is an online order processing system and the H database is a data warehouse of orders and shipments. Both databases have orders, products and customers but are clearly not meant to be joined together. The keys for the basic entities consist of different numbers of parts, for example.
Basing this benchmark on the TPC work saves the design of the schemas and use cases.
Given this, what could we measure? From the RDF angle, we are not concerned with update transactions.
The RDF mapping benchmark must expose any information that could be had from the relational sources by themselves plus demonstrate a layer of unification that facilitates querying without significant performance overhead.
First, let us see about preserving existing functionality:
The TPC-H queries cannot be represented with SPARQL since they all involve GROUP BY
. They can however be made into parameterized views of some sort and these views could be mapped into RDF. For example, Q1 is a report on order lines grouped by shipped and returned status over a period of time. The period of time does not figure in the result set. One cannot parameterize the view by just specifying a range for the time, as in
SELECT * FROM q1_view WHERE dt BETWEEN ? AND ?
There simply is no time in the output. If the time were added as a min and max of times, then one could have a condition on this and the query processor could infer that only order lines from between these times would be considered. In practice, we expect a parametric view of some vendor dependent type to be used.
How would one invoke a parameterized view from SPARQL? Like
{ ?r a q1_report . ?r1 start_time ?dt1 . r1 end_time dt2 . ?r1 returnflag ?rf . ?r1 extended_price ?price . }
So r1
would be instantiated to each row of the report view and the start and end times would be considered specially, requiring dt1
and dt2
to be bound.
In this way RDF could be used for accessing the TPC-H queries and for specifying values for the substitution parameters. How each implementation achieves this is left to their discretion.
Now, let us see what more is enabled by RDF.
As said, the C and H schemas were designed separately and are not intended for joining. A mapping between C and H is conceptually meaningful between customer, geographic area and product. Orders and order lines exist in both. For purposes of our benchmark, we consider the C database to be the recent history and the H database the longer term history, with possible overlap between the two.
To make things more interesting, we could have multiple separately populated C and H databases, for example split according to product type or geographical area. We may consider this later, for now let us just consider one C database and one H database. We will assume that these are hosted by different DBMS instances and that the the DBMS managing one cannot join with the other within a single SQL query. Thus, distributed joining is to be accomplished by the RDF layer.
The queries to be answered are among the following:
Browsing - Any primary key is to be an IRI. Any foreign key relation is to be navigable through a predicate representing this. Take
?customer has_order ?order
This is to be mapped to both customer to order relations, from the C as well as the H databases. Thus the query
?c has_name "ACME" . ?c has_order ?o . ?o has_line ?l . ?l has_product ?p . ?p has_name ?n .
should get all names of products ever ordered by ACME in either database.
The has_order
and other predicates will have sub-predicates specifically referring to the customer to order relationship in one of the databases.
Periodically any of the TPC-H queries. This is evaluated by the RDBMS and the RDF layer is not expected to affect the performance. This is included simply in order to demonstrate that this can be done.
Advertising - Given the customer's purchase history, the query selects advertising to show when the user connects to the E-commerce web site running off the C database.
Product family - The products in both databases are classified by a SKOS vocabulary. Both databases can list products based on queries using the SKOS terminology. The SKOS taxonomy is mapped into the different fields of the source product tables, to be specified.
Product recommendations - Find customers with a similar purchasing history and recommend products, excluding ones the user has already purchased
Ad-hoc describe of product or customer, fetching information from both databases. For example, get the shipping delay of orders of products of category c now and this time last year.
Faceted browsing of products and recent orders. Start with a category and drill down into individual orders and customers. This would represent a customer browsing the catalog and personal order information through a web user interface.
Extraction - Given a time interval, retrieve all orders and order lines placed within this interval as well as and customers, products, suppliers reachable from these orders and export as RDF text. The rationale is load the data into an RDF warehouse, presumably for off-line joining with other data sets.
We cannot here go to the details of the workload and ad hoc experimentation will be needed to see what interesting queries can be made against the data and what other data structures are needed for supporting these queries, for example a cube of pre-computed totals for faceted drill-down.
The databases should be scaled so that the H database has 1.5 x so many products, as there may exist discontinued products or versions that still figure in the history but are no longer stocked.
The H database should have the same number of customers as the C database.
The H database should have 18x the number of orders. If the C database covers the two last months, roughly the longest delay between order and shipping, the H database has 3 years history.
Since the primary keys in the databases are different, we use the customer names and phone numbers to establish identity.
Products are joined based on product name.
The data must be tweaked so that these will match for a sufficient number of rows.
We expect custom code to be required for most of the above tasks. All such code should be disclosed.
The metric should not recapture the TPC-H metric. Hence the H queries per se are not measured, they are rather something that must be possible through the RDF interface.
It is reasonable to publish the run time of each of the 22 queries when submitted as SQL and when submitted through the RDF view interface. We do not expect great divergence.
Two actual metrics can be thought of. One for the extraction and the other for short ad hoc lookups.
We expect that persistent structures not present in the C and H databases will be needed for supporting the queries. Implementing these should not be predicated on having write access to the relational databases past the stage of setting up the benchmark data sets.
The costing should not include the C and H databases and supporting systems. These will already be paid for in the usage scenario. Any extra hardware and software should be included. The major part of the cost is expected to be in terms of custom coding and configuration. These should be reported as lines of code, specifying lines per each language used. Monetary cost of these will depend on the proficiency of the parties with the technologies and cannot be accurately measured.
The problem with these is that it is not really possible to ask interesting questions over a large database without involving some sort of counting and grouping. I feel that we simply cannot make a representative benchmark without these, quite regardless of the fact that SPARQL in its present form does not have these features. Hence I have simply stated the questions and left any implementation open. If this seems like an interesting direction, the nascent W3C benchmarking XG (experimental group) can refine the business questions, relative query frequencies, exact data set composition, etc.
by Orri Erling
This benchmark model's use of RDF for representing and analyzing use of social software by user communities. The benchmark consists of a scalable synthetic data set, a feed of updates to the data set, and a query mix. The data set reflects the common characteristics of the social web, with realistic distribution of connections, user contributed content, commenting, tagging, and other social web activities. The data set is expressed in the FOAF and SIOC vocabularies. The query mix is divided between relatively short, dashboard or search engine style lookups, and longer running analytics queries.
The system being modeled is an an aggregator of social web content; we could liken it to an RDF-based Technorati with some extra features.
Users can publish their favorite queries or mesh-ups as logical views served by the system. In this manner, queries come to depend on other queries, somewhat like SQL VIEWs can reference each other.
There is a small qualification data set that can be tested against the queries to validate that the system under test (SUT) produces the correct results.
The benchmark is scaled by number of users. To facilitate comparison, some predefined scales are offered, i.e., 100K, 300K, 1M, 3M, 10M users. Each simulated user both produces and consumes content. The level of activity of users is unevenly divided.
There are two work mixes â the browsing mix, which consists of a mix of lookups and contributing content, and the analytics mix, which consists of long-running queries for tracking the state of the network. For each 100 browsing mixes, one analytics mix is performed.
A benchmark run is at least 1h real-time in duration. The metric is calculated by the number of browsing mixes completed during the test window. This simulates 10% of the users being online at any one time, thus for a scale of 1M users, 100K browsing mixes will be simultaneously proceeding.
The test driver submits the work via HTTP. What load balancing or degree of parallel serving of the requests is used is left up to the SUT.
The metric is expressed as queries per second, taking the total number of queries executed by completed browsing mixes and dividing this by the real time of the measurement window. The metric is called qpsSW, for queries per second, socialweb. The cost metric is $/qpsSW, calculated by the costing rules of the TPC. If compute-on-demand infrastructure is used, the costing will be $/qpsSW/day.
The test sponsor is the party contributing the result. The contribution consists of the metric and of a full disclosure report (FDR), written following a template given in the benchmark specification. The disclosure requirements follow the TPC practices, including publishing any configuration scripts, data definition language statements, timing for warm-up and test window, times for individual queries etc. All details of the hardware and software are disclosed.
The software consists of the data generator and of a test driver. The test driver calls functions supplied by the test sponsor for performing the diverse operations in the test. Source code for any modifications of the test driver is to be published as part of the FDR.
Any hardware/software combination â including single machines, clusters, clusters rented from computer providers like Amazon EC2 â is eligible.
The SUT must produce correct answers for the validation queries against the validation data set.
The implementation of the queries is not restricted. These can be any SPARQL or other queries, application server based logic, stored procedures or other, in any language, provided full source code is provided in the FDR.
The data set is provided as serialized RDF. The means of storage are left up to the SUT. The basic intention is to use a triple store of some form, but the specific indexing, use of property tables, materialized views, and so forth, is left up to the test sponsor. All tuning and configuration is to be published in the FDR.
For each operation of each mix, the specification shall present:
The logical intent of the operation, the business question, e.g., What is the hot topic among my friends?
The question or update expressed in terms of the data in the data set.
Sample text of a query answering the question or pseudo-code for deriving the answer.
Result set layout, if applicable.
The relative frequencies of the queries are given in the query mix summary.
The browsing mix consists of the following operations:
Make a blog post.
Make a blog comment.
Make a new social contact.
For one new social contact, there are 10 posts and 20 comments.
What are the 10 most recent posts by somebody in my friends or their friends? This would be a typical dashboard item.
What are the authoritative bloggers on topic x? This is a moderately complex ad-hoc query. Take posts tagged with the topic, count links to them, take the blogs containing them, show the 10 most cited blogs with the most recent posts with the tag. This would be typical of a stored query, like a parameterizable report.
How do I contact person x? Calculate the chain of common acquaintances best for reaching person x. For practicality, we do not do a full walk of anything but just take the distinct persons in 2 steps of the user and in 2 steps of x and see the intersection.
Who are the people like me? Find the top 10 people ranked by count of tags in common in the person's tag cloud. The tag cloud is the set of interests and the set of tags in blog posts of the person.
Who react to or talk about me? Count of replies to material by the user, grouped by the commenting user and the site of the comment, top 20, sorted by count descending.
Who are my fans that I do not know? Same as above, excluding people within 2 steps.
Who are my competitors? Most prolific posters on topics of my interest that do not cite me.
Where is the action? On forums where I participate, what are the top 5 threads, as measured by posts in the last day. Show count of posts in the last day and the day before that.
How do I get there? Who are the people active around both topic x and y? This is defined by a person having participated during the last year in forums of x as well as of y. Forums are tagged by topics. The most active users are first. The ranking is proportional to the sum of the number of posts in x and y.
These queries are typical questions about the state of the conversation space as a whole and can for example be published as a weekly summary page.
The fastest propagating idea - What is the topic with the most users who have joined in the last day? A user is considered to have joined if the user was not discussing this in the past 10 days.
Prime movers - What users start conversations? A conversation is the set of material in reply to or citing a post. The reply distance can be arbitrarily long, the citing distance is a direct link to the original post or a reply there to. The number and extent of conversations contribute towards the score.
Geography - Over the last 10 days, for each geographic area, show the top 50 tags. The location is the location of the poster.
Social hubs - For each community, get the top 5 people who are central to it in terms of number of links to other members of the same community and in terms of being linked from posts. A community is the set of forums that have a specific topic.
The common feeling was that mapping everything to RDF and querying it in terms of a generic domain ontology, mapped on demand into whatever line of business systems, would be very good if it only could be done. However, since this is not so easily done, the next best is to extract the data and then warehouse it as RDF.
The obstacles perceived were of the following types:
Lack of quality in the data. The different line of business systems do not in and of themselves hold enough semantics. If the meaning of data columns in relational tables were really known and explicit, these could be meaningfully used for joining across systems. But this is more complex than just mapping the metal lead to the chemical symbol Pb and back.
Lack of performance in RDF storage. Data sets even in the tens-of-millions of triples do not run very well in some stores. Well, we had the Banff life sciences demo with 450M triples in a small server box running Virtuoso, so this is not universal, plus of course we are coming up with a whole different order of magnitude, as often discussed on this blog.
Lack of functionality in mapping and possibly lack of pushing through enough of the query processing to the underlying data stores.
Personally, I am quite aware of what to do with regard to performance of mapping and storage, and see these as eminently solvable issues. After all, we have a great investment of talent in databases in general and it can be well deployed towards RDF, as we have been doing these past couple of years. So we talk about the promise of a 360-degree view of information, with RDF being the top layer. Everybody agrees that this is a nice concept. But this is a nice concept especially when it can do the things that are the most common baseline expectation of any regular DBMS, i.e., aggregation, grouping, sub-queries, VIEWs. Now, I would not go sell a DBMS that has no COUNT
operator to a data warehousing shop.
The fact that OpenLink and Oracle allow RDF inside SQL, and OpenLink even adds native aggregates and grouping to SPARQL, fixes the problem with regard to specific products, but leaves the standardization issue open. Of course, any vendor will solve these questions one way or another because a database with no aggregation is a non-starter.
I talked to Lee Feigenbaum, chair of the W3C DAWG, about the question of aggregates and general BI capabilities in SPARQL. He told me that, prior to his time with the DAWG, these were left out because they conflicted with the open-world assumption around RDF: You cannot count a set because by definition you do not know that you have all the members, the world being open and all that.
Say what? Talk about the road to hell being paved with good intentions. Now, this is in no way Lee's or the present day DAWG's fault; as a member myself, I can attest to the good work and would under no circumstances wish any delays or revisions to SPARQL at this point. I am just pointing out a matter that all implementations should address, as a sort of precondition of entry into the real world IS space. If this can be done interoperably, so much the better.
Now, out of the deliberations at the Boston workshop arose at least two ideas for follow-up activity.
The first was an incubator group for RDF store and mapping benchmarking. This is very appropriate in order to dispel the bad name RDF storage and querying performance has been saddled with. As a first step in this direction, I will outline a social web oriented benchmark on this blog.
The second activity was an incubator group for preparing standardization of mapping methodologies from relational schemas to RDF. We will be active on this as well.
The two offshoots appear logically separate but are not necessarily so in practice. A benchmark is after all something that is supposed to promote a technology to a user base. The user base seems to wish to put all online systems and data warehouses under a common top level RDF model and then query away, introducing no further replication of data or performance cost or ETL latencies.
Updating would also be nice but even query only would be very good. Personally, I'd say the RDF strength is all on the query side. Transactions are taken care of well enough by what there already is, RDF stands out in integration and the ad-hoc and discovery side of the matter. Given this, we expect the value to be consumed in a heterogeneous, multi-database, federated environment. Thus a benchmark should measure this aspect of the use-case. With the right mapping and queries, we could probably demonstrate the added cost of RDF to be very low, as long as we could push all queries that can be answered by a single source to the responsible DBMS. For distributed joins, we are back at the question of optimizing distributed queries but this is a familiar one and RDF is not the principal cost factor.
The subject does become quite complex at this point. We would have to take supposedly representative synthetic OLTP and BI data sets (like the ones in TPC-D, TPC-E, and TPC-H), and invent queries across them that would both make sense and be implementable in SPARQL extended with aggregates and sub-queries. Reliance on SPARQL extensions is simply unavoidable. Setting up the test systems would be non-trivial, even though there is a lot of industry experience in these matters on the database side.
So, while this is probably the benchmark most relevant to the target audience, we may have to start with a simpler one. I will next outline something to the effect.
]]>But first on some support technology. We have an entire new disk allocation and IO system. It is basically operational but needs some further tuning. It offers much better locality and much better sequential access speeds.
Specially for dealing with large RDF databases, we will introduce data compression. We have over the years looked at different key compression possibilities but have never been very excited by them since thy complicate random access to index pages and make for longer execution paths, require scraping data for one logical thing from many places, and so on. Anyway, now we will compress pages before writing them to disk, so the cache is in machine byte order and alignment and disk is compressed. Since multiple processors are commonplace on servers, they can well be used for compression, that being such a nicely local operation, all in cache and requiring no serialization with other things.
Of course, what was fixed length now becomes variable length, but if the compression ratio is fairly constant, we reserve space for the expected compressed size, and deal with the rare overflows separately. So no complicated shifting data around when something grows.
Once we are done with this, this could well be a separate intermediate release.
Now about clusters. We have for a long time had various plans for clusters but have not seen the immediate need for execution. With the rapid growth in the Linking Open Data movement and questions on web scale knowledge systems, it is time to get going.
How will it work? Virtuoso remains a generic DBMS, thus the clustering support is an across the board feature, not something for RDF only. So we can join Oracle, IBM DB2, and others at the multi-terabyte TPC races.
We introduce hash partitioning at the index level and allow for redundancy, where multiple nodes can serve the same partition, allowing for load balancing read and replacement of failing nodes and growth of cluster without interruption of service.
The SQL compiler, SPARQL, and database engine all stay the same. There is a little change in the SQL run time, not so different from what we do with remote databases at present in the context of our virtual database federation. There is a little extra complexity for distributed deadlock detection and sometimes multiple threads per transaction. We remember that one RPC round trip Is 3-4 index lookups, so we pipeline things so as to move requests in batches, a few dozen at a time.
The cluster support will be in the same executable and will be enabled by configuration file settings. Administration is limited to one node, but Web and SQL clients can connect to any node and see the same data. There is no balancing between storage and control nodes because clients can simply be allocated round robin for statistically even usage. In relational applications, as exemplified by TPC-C, if one partitions by fields with an application meaning (such as warehouse ID), and if clients have an affinity to a particular chunk of data, they will of course preferentially connect to nodes hosting this data. With RDF, such affinity is unlikely, so nodes are basically interchangeable.
In practice, we develop in June and July. Then we can rent a supercomputer maybe from Amazon EC2 and experiment away.
We should just come up with a name for this. Maybe something astronomical, like star cluster. Big, bright but in this case not far away.
]]>IN
predicate. The IN predicate with a list of values will now use an index if available. This is useful for SPARQL queries with multiple FROM
graphs, for example.SELECT * FROM graph WHERE {?s ?p ?o}
into a UNION
of SELECT *
’s from multiple tables of different width. Each term of the UNION
will simply produce multiple 3 column result rows for each actual row while not having to run through the tables multiple times. Together with this, we have also fixed a number of things with the relational-to-RDF mapping. We have been testing this extensively with the Musicbrainz mapping by Fred Giasson. These changes are small and to be released shortly.
There are also some larger things in the works, to be released during this summer, the next post gives an overview of these.
]]>But these sides do not meet very well, because SPARQL does not support aggregation and grouping, which is the very basic idea of faceted browsing.
So we have looked at possible solutions. The first and most obvious is to add SQL style aggregation to SPARQL. We do this already since one can embed SPARQL into SQL but a SPARQL-only syntax for this would be nice, so as to avoid the need for a SQL client login and so as to use the SPARQL protocol.
The first part of the solution is to allow aggregates in a SPARQL select. The aggregates are directly inherited from SQL, meaning count, min, max, sum, avg, with an optional distinct modifier. All terms of a select that are not aggregates are considered grouping columns, so one gets an implicit group by when combining variables and aggregates in a single select.
This is straightforward and is already a big improvement. But for very large result sets in a browsing situation, we can do something better.
Let's think about DBpedia as an example. We have data about people â things like names, birth/death dates, birthplaces, text descriptions. Some drill-down could be supported with a traditional OLAP cube. But in the semantic web situation, the number of dimensions can be high and the dimensions will usually be sparse. Besides, the number of dimensions is liable to change. Precomputed indices of everything are not the best choice here even though for well-defined analytics they are fine.
For an overview of the data, we can precompute some things. For example, the set of distinct graphs and their respective triple counts, also for each graph/predicate combination. These do not have to be absolutely up-to-date, and provide a quick first level of directory. Also, the count of instances of all classes is a candidate for precomputing. Same with count of triples grouped by class of S, graph.
Once we start talking about specific Ss or non-class Os, it should be either possible to count the triples or count them up to a maximum. A count aggregate function that stops the query when the count reaches a certain maxc might be useful. This would give small counts with precision but for larger cardinalities it would simply say that it is more than a given limit. In some cases this can be extrapolated to an actual count with fair precision but not always.
The more factors are given, the faster it is to count. For example, if we have a query that looks for an S with P1 = O1 and P2 = O2, this is a merge intersection which is very quick, especially if the intersection is a small fraction of the number of rows.
We can also look at the use of full text conditions with browsing. This makes precomputed counts pretty much useless once there is a text condition in the mix since a text condition is not something that can be mapped on a dimension of a cube. In this situation, it would seem appropriate to count to a certain maximum and then stop. Finding the first few matches with an AND of text and equalities on RDF graph objects or subjects is always quick. Depending on the case, one may even scale the count upwards to an estimate by looking at how far one got on the outermost loop of the joins before reaching the count ceiling. There is a little more to this since all joins are not nested loops but this is the general idea.
Relational DBMS often use random sampling of tables for optimization statistics. With RDF, this does not seem to work so well. Or rather, the usual types of stats are quite useless if all triples are in the same table. This is why we take actual samples at optimize time whenever leading key parts have literal values. Further, we take a small random sample of all triples and remember the distinct P and G values, as these are of low cardinality with highly uneven distribution. This gives us a fairly reliable idea of the relative frequencies of the more common Gs and Ps. This is good for query optimization but not necessarily for browsing. Good enough for sorting by frequency but not quite good enough for showing counts.
Anyway, we will start by introducing the basic SQL aggregates and then add a thing for limiting how much gets counted. This will be precise for small counts and give an order of magnitude for larger counts.
]]>The updates involve:
Soon to follow are:
Existing databases will be automatically upgraded when started with the new Virtuoso 5.0 server. Note that after upgrade, the RDF data is not backward compatible.
We will be rolling out more Virtuoso hosted semantic web content in the Linking Open Data project, part of our participation in the Semantic Web Education and Outreach activity at W3C.
]]>I was at the ISWC 2006 conference a week back. One of the items discussed there, at least informally, was the topic of SPARQL end point discovery. I have below put together a summary of points that were discussed and of my own views on their possible resolution.
This is intended as a start for conversation and as a summary of ideas.
Self-description of end points may serve at least the following purposes:
We will look at each one in turn.
The end point should give a ballpark cardinality for the following combinations of G, S, P, O.
Based on our experience, these are the most interesting questions but for completeness, the entry point might offer an API allowing specifying a constant or wildcard for each of the four parts of a quad. If the information is not readily available, "unknown" could be returned, together with the count of triples in the whole end point or the graph, if the graph is specified. Even if the end point does not support real time sampling of data for cardinality estimates, it would at least have an idea of the count of triples per graph, which is still far better than nothing.
Given the full SPARQL request, the end point could return the following data, without executing the query itself.
All these elements would be optional.
This somewhat overlaps with the optimization questions but it may still be the case that it is more efficient to support a special interface for the optimization related questions.
]]>An RDF benchmark suite should meet the following criteria:
The query load should illustrate the following types of operations:
If we take an application like LinkedIn as a model, we can get a reasonable estimate of the relative frequency of different queries. For the queries per second metric, we can define the mix similarly to TPC C. We count executions of the main query and divide by running time. Within this time, for every 10 executions of the main query there are varying numbers of executions of secondary queries, typically more complex ones.
The report contains basic TPC-like items such as:
These can go into a summary spreadsheet that is just like the TPC ones.
Additionally, the full report should include:
OpenLink has a multithreaded C program that simulates n web users multiplexed over m threads. For example, 10000 users with 100 threads, each user with its own state, so that they carry out their respective usage patterns independently, getting served as soon as the server is available, still having no more than m requests going at any time. The usage pattern is something like go check the mail, browse the catalogue, add to shopping cart etc. This can be modified to browse a social network database and produce the desired query mix. This generates HTTP requests, hence would work against a SPARQL end point or any set of dynamic web pages.
The program produces a running report of the clicks per second rate and statistics at the end, listing the min/avg/max times per operation.
This can be packaged as a separate open source download once the test spec is agreed upon.
For generating test data, a modification of the LUBM generator is probably the most convenient choice.
This area is somewhat more complex than triple storage.
At least the following factors enter into the evaluation:Â
The rationale for mapping relational data to RDF is often data integration. Even in simple cases like the OpenLink Data Spaces applications, a single SPARQL query will often result in a union of queries over distinct relational schemas, each somewhat similar but different in its details.
A test for mapping should represent this aspect. Of course, translating a column into a predicate is easy and useful, specially when copying data. Still, the full power of mapping seems to involve a single query over disparate sources with disparate schemas.
A real world case is OpenLink's ongoing work for mapping WordPress, Mediawiki, phpBB, Drupal, and possibly other popular web applications into SIOC.
Using this as a benchmark might make sense because the source schemas are widely known, there is a lot of real world data in these systems, and the test driver might even be the same as with the above proposed triple store benchmark. The query mix might have to be somewhat tailored.
Another "enterprise style" scenario might be to take the TPC C and TPC D databases â after all both have products, customers and orders â and map them into a common ontology. Then there could be queries sometimes running on only one, sometimes joining both.
Considering the times and the audience, the WordPress/Mediawiki scenario might be culturally more interesting and more fun to demo.
The test has two aspects: Throughput and coverage. I think these should be measured separately.
The throughput can be measured with queries that are generally sensible, such as "get articles by an author that I know with tags t1 and t2."
Then there are various pathological queries that work specially poorly with mapping. For example, if the types of subjects are not given, if the predicate is known at run time only, if the graph is not given, we get a union of everything joined with another union of everything and many of the joins between the terms of the different unions are identically empty but the software may not know this.
In a real world case, I would simply forbid such queries. In the benchmarking case, these may be of some interest. If the mapping is clever enough, it may survive cases like "list all predicates and objects of everything called gizmo where the predicate is in the product ontology".
It may be good to divide the test into a set of straightforward mappings and special cases and measure them separately. The former will be queries that a reasonably written application would do for producing user reports.
]]>The resulting database is 75G, 74 bytes per triple which is not bad. It will shrink a little more if explicitly compacted by merging adjacent partly filled pages. See Advances in Virtuoso RDF Triple Storage for an in-depth treatment of the subject.
The real question of RDF scalability is finding a way of having more than one CPU on the same index tree without them hitting the prohibitive penalty of waiting for a mutex. The sure solution is partitioning, would probably have to be by range of the whole key. but before we go to so much trouble, we'll look at dropping a couple of critical sections from index random access. Also some kernel parameters may be adjustable, like a spin count before calling the scheduler when trying to get an occupied mutex. Still we should not waste too much time on platform specifics. We'll see.
We just updated the Virtuoso Open Source cut. The latest RDF refinements are not in, so maybe the cut will have to be refreshed shortly.
We are also now applying the relational to RDF mapping discussed in Declarative SQL Schema to RDF Ontology Mapping to the ODS applications.
There is a form of the mapping in the VOS cut on the net but it is not quite ready yet. We must first finish testing it through mapping all the relational schemas of the ODS apps before we can really recommend it. This is another reason for a VOS update in the near future.
We will be looking at the query side of LUBM after the ISWC 2006 conference. So far, we find queries compile OK for many SIOC use cases with the cost model that there is now. A more systematic review of the cost model for SPARQL will come when we get to the queries.
We put some ideas about inferencing in the Advances in Triple Storage paper. The question is whether we should forward chain such things as class subsumption and subproperties. If we build these into the SQL engine used for running SPARQL, we probably can do these as unions at run time with good performance and better working set due to not storing trivial entailed triples. Some more thought and experimentation needs to go into this.
]]>On the applications side, we have been keeping up with the SIOC RDF ontology developments. All ODS applications now make their data available as SIOC graphs for download and SPARQL query access.
What is most exciting however is our advance in mapping relational data into RDF. We now have a mapping language that makes arbitrary legacy data in Virtuoso or elsewhere in the relational world RDF query-able. We will put out a white paper on this in a few days.
Also we have some innovations in mind for optimizing the physical storage of RDF triples. We keep experimenting, now with our sights set to the high end of triple storage, towards billion triple data sets. We are experimenting with a new more space efficient index structure for better working set behavior. Next week will yield the first results.
]]>This marks the coming of age of our RDF and SPARQL efforts. We have the new SQL cost model with SPARQL awareness, we have applications which present much of their data as SIOC, FOAF, ATOM OWL and other formats.
We continue refining these technologies. Our next roadmap item is mapping relational data into RDF and offering SPARQL access to relational data without data duplication. Expect a white paper about this soon.
]]>This paper summarizes how we have extended Virtuoso's SQL and database engine to better accommodate storing RDF triples and optimizing queries of RDF data. This is the first of a series. The next will concern mapping relational databases onto RDF ontologies for SPARQL access.
This paper concerns the next Virtuoso Open Source release, to be available for download a few days from this posting.
]]>Microsoft quite explicitly states that their thrust is to decouple the client side representation of data as .NET objects from the relational schema on the database. This is a worthy goal.
But we can also see other possible applications of the technology when we move away from strictly relational back ends. This can go in two directions: Towards object oriented database (OODBMS) and towards making applications for the semantic web.
In the OODBMS direction, we could equate Virtuoso table hierarchies with .NET classes and create a tighter coupling between client and database, going as it were in the other direction from Microsoft's intended decoupling. For example, we could do typical OODBMS tricks such as pre-fetch of objects based on storage clustering. The simplest case of this is like virtual memory, where the request for one byte brings in the whole page or group of pages. The basic idea is that what is created together probably gets used together and if all objects are modeled as subclasses of (sub-tables) of a common superclass, then, regardless of instance type, what is created together (has consecutive IDs) will indeed tend to cluster on the same page. These tricks can deliver good results in very navigational applications like GIS or CAD. But these are rather specialized things and we do not see OODBMS making any great comeback.
But what is more interesting and more topical in the present times is making clients for the RDF world. There, the OWL ontology could be used to make the .NET classes and the DBMS could, when returning URIs serving as subjects of triple include specified predicates on these subjects, enough to allow instantiating .NET instances as "proxies" of these RDF objects. Of course, only predicates for which the client has a representation are relevant, thus some client-server handshake is needed at the start. What data could be pre-fetched is like the intersection of a concise bounded description and what the client has classes for. The rest of the mapping would be very simple, with IRIs becoming pointers, multi-valued predicates lists, and so on. IRIs for which the RDF type is not known or inferable could be left out or represented as a special class with name-value pairs for its attributes, same with blank nodes.
In this way, .NET's considerable UI capabilities could directly be exploited for visualizing RDF data, only given that the data complies reasonably well with a known ontology.
If a SPARQL query returned a result-set, IRI type columns would be returned as .NET instances and the server would pre-fetch enough data for filling them in. For a CONSTRUCT, a collection object could be returned with the objects materialized inside. If the interfaces allow passing an Entity SQL string, these could possibly be specialized to allow for a SPARQL string instead. LINQ might have to be extended to allow for SPARQL type queries, though.
Many of these questions will be better answerable as we get more details on Microsoft's forthcoming ADO .NET release. We hope that sufficient latitude exists for exploring all these interesting avenues of development.
]]>The new SQL compiler takes samples of index population on demand, thus always works with up-to-date statistics. Further, when there are constant leading key parts, it can get an estimate of the selectivity of the constant criteria with a single lookup.
This is especially important for processing RDF. Since all triples go to one table unless otherwise declared, normal SQL statistics are not very useful for determining the join order for a SPARQL query. However, nearly always, SPARQL queries have a constant graph, constant predicate, sometimes constant subjects and objects. For example, using the index P, G, O, S, the compiler can know how many triples will have a given predicate within a given graph. This is done with a single lookup, without needing to count the actual triples, which would defeat the purpose. Also there is no need to do periodic statistics collection runs or to maintain counts of distinct combinations for multiple key parts. This makes for virtual certainty of getting reasonable join orders even for recently inserted or fast changing data sets.
This will be part of the next Virtuoso Open Source release, probably in the next couple of weeks. There will also be a technical article with examples of how the dynamic statistics feature helps with RDF queries.
]]>This release also enhances the SPARQL support with better inlining of SPARQL in SQL and other features and fixes.
]]>The launch of Virtuoso Open Source Edition (VOS) marks a new period in our participation in the database world. We will henceforth be much more active, publish much more material, have a faster release cycle and actively reach out to the various areas of the open source community.
We have years worth of demos, white papers, articles, a suite of Virtuoso based applications, and much more that we will be unveiling over the following months.
We will track different aspects of Virtuoso work on this and related blogs. In the middle term, we will talk about the following:
There is a whole suite of next generation file server features to be unveiled. These include items such as automatic metadata extraction and logical views on content based on its metadata, permissions etc.
In the immediate future, we will:
The VOS development CVS will be updated at high frequency, in some areas even weekly. Stable snapshots will be made available 3 or 4 times a year.
We will have a very exciting spring, with radically more participation in the database and open source worlds than ever. Look for frequent updates on this blog.
]]>