Details

Virtuso Data Space Bot
Burlington, United States

Subscribe

Post Categories

Recent Articles

Display Settings

articles per page.
order.
Showing posts in all categories RefreshRefresh
Provenance and Reification in Virtuoso

These days, data provenance is a big topic across the board, ranging from the linked data web, to RDF in general, to any kind of data integration, with or without RDF. Especially with scientific data we encounter the need for metadata and provenance, repeatability of experiments, etc. Data without context is worthless, yet the producers of said data do not always have a model or budget for metadata. And if they do, the approach is often a proprietary relational schema with web services in front.

RDF and linked data principles could evidently be a great help. This is a large topic that goes into the culture of doing science and will deserve a more extensive treatment down the road.

For now, I will talk about possible ways of dealing with provenance annotations in Virtuoso at a fairly technical level.

If data comes many-triples-at-a-time from some source (e.g., library catalogue, user of a social network), then it is often easiest to put the data from each source/user into its own graph. Annotations can then be made on the graph. The graph IRI will simply occur as the subject of a triple in the same or some other graph. For example, all such annotations could go into a special annotations graph.

On the query side, having lots of distinct graphs does not have to be a problem if the index scheme is the right one, i.e., the 4 index scheme discussed in the Virtuoso documentation. If the query does not specify a graph, then triples in any graph will be considered when evaluating the query.

One could write queries like —

SELECT  ?pub 
  WHERE 
    { 
      GRAPH  ?g 
        { 
          ?person  foaf:knows  ?contact 
        } 
      ?contact  foaf:name         "Alice"  . 
      ?g        xx:has_publisher  ?pub 
    }

This would return the publishers of graphs that assert that somebody knows Alice.

Of course, the RDF reification vocabulary can be used as-is to say things about single triples. It is however very inefficient and is not supported by any specific optimization. Further, reification does not seem to get used very much; thus there is no great pressure to specially optimize it.

If we have to say things about specific triples and this occurs frequently (i.e., for more than 10% or so of the triples), then modifying the quad table becomes an option. For all its inefficiency, the RDF reification vocabulary is applicable if reification is a rarity.

Virtuoso's RDF_QUAD table can be altered to have more columns. The problem with this is that space usage is increased and the RDF loading and query functions will not know about the columns. A SQL update statement can be used to set values for these additional columns if one knows the G,S,P,O.

Suppose we annotated each quad with the user who inserted it and a timestamp. These would be columns in the RDF_QUAD table. The next choice would be whether these were primary key parts or dependent parts. If primary key parts, these would be non-NULL and would occur on every index. The same quad would exist for each distinct user and time this quad had been inserted. For loading functions to work, these columns would need a default. In practice, we think that having such metadata as a dependent part is more likely, so that G,S,P,O are the unique identifier of the quad. Whether one would then include these columns on indices other than the primary key would depend on how frequently they were accessed.

In SPARQL, one could use an extension syntax like —

SELECT  * 
  WHERE 
    { ?person      foaf:knows  ?connection 
                   OPTION ( time  ?ts )     . 
      ?connection  foaf:name   "Alice"      . 
      FILTER ( ?ts > "2009-08-08"^^xsd:datetime ) 
    }

This would return everybody who knows Alice since a date more recent than 2009-08-08. This presupposes that the quad table has been extended with a datetime column.

The OPTION (time ?ts) syntax is not presently supported but we can easily add something of the sort if there is user demand for it. In practice, this would be an extension mechanism enabling one to access extension columns of RDF_QUAD via a column ?variable syntax in the OPTION clause.

If quad metadata were not for every quad but still relatively frequent, another possibility would be making a separate table with a key of GSPO and a dependent part of R, where R would be the reification URI of the quad. Reification statements would then be made with R as a subject. This would be more compact than the reification vocabulary and would not modify the RDF_QUAD table. The syntax for referring to this could be something like —

SELECT * 
  WHERE 
    { ?person   foaf:knows         ?contact 
                OPTION ( reify  ?r )          . 
      ?r        xx:assertion_time  ?ts       . 
      ?contact  foaf:name          "Alice"   . 
      FILTER ( ?ts > "2008-8-8"^^xsd:datetime ) 
    }

We could even recognize the reification vocabulary and convert it into the reify option if this were really necessary. But since it is so unwieldy I don't think there would be huge demand. Who knows? You tell us.

# PermaLink Comments [0]
09/01/2009 10:44 GMT-0500 Modified: 09/01/2009 11:20 GMT-0500
Web Scale and Fault Tolerance

One concern about Virtuoso Cluster is fault tolerance. This post talks about the basics of fault tolerance and what we can do with this, from improving resilience and optimizing performance to accommodating bulk loads without impacting interactive response. We will see that this is yet another step towards a 24/7 web-scale Linked Data Web. We will see how large scale, continuous operation, and redundancy are related.

It has been said many times — when things are large enough, failures become frequent. In view of this, basic storage of partitions in multiple copies is built into the Virtuoso cluster from the start. Until now, this feature has not been tested or used very extensively, aside from the trivial case of keeping all schema information in synchronous replicas on all servers.

Approaches to Fault Tolerance

Fault tolerance has many aspects but it starts with keeping data in at least two copies. There are shared-disk cluster databases like Oracle RAC that do not depend on partitioning. With these, as long as the disk image is intact, servers can come and go. The fault tolerance of the disk in turn comes from mirroring done by the disk controller. Raids other than mirrored disk are not really good for databases because of write speed.

With shared-nothing setups like Virtuoso, fault tolerance is based on multiple servers keeping the same logical data. The copies are synchronized transaction-by-transaction but are not bit-for-bit identical nor write-by-write synchronous as is the case with mirrored disks.

There are asynchronous replication schemes generally based on log shipping, where the replica replays the transaction log of the master copy. The master copy gets the updates, the replica replays them. Both can take queries. These do not guarantee an entirely ACID fail-over but for many applications they come close enough.

In a tightly coupled cluster, it is possible to do synchronous, transactional updates on multiple copies without great added cost. Sending the message to two places instead of one does not make much difference since it is the latency that counts. But once we go to wide area networks, this becomes as good as unworkable for any sort of update volume. Thus, wide area replication must in practice be asynchronous.

This is a subject for another discussion. For now, the short answer is that wide area log shipping must be adapted to the application's requirements for synchronicity and consistency. Also, exactly what content is shipped and to where depends on the application. Some application-specific logic will likely be involved; more than this one cannot say without a specific context.

Basics of Partition Fail-Over

For now, we will be concerned with redundancy protecting against broken hardware, software slowdown, or crashes inside a single site.

The basic idea is simple: Writes go to all copies; reads that must be repeatable or serializable (i.e., locking) go to the first copy; reads that refer to committed state without guarantee of repeatability can be balanced among all copies. When a copy goes offline, nobody needs to know, as long as there is at least one copy online for each partition. The exception in practice is when there are open cursors or such stateful things as aggregations pending on a copy that goes down. Then the query or transaction will abort and the application can retry. This looks like a deadlock to the application.

Coming back online is more complicated. This requires establishing that the recovering copy is actually in sync. In practice this requires a short window during which no transactions have uncommitted updates. Sometimes, forcing this can require aborting some transactions, which again looks like a deadlock to the application.

When an error is seen, such as a process no longer accepting connections and dropping existing cluster connections, we in practice go via two stages. First, the operations that directly depended on this process are aborted, as well as any computation being done on behalf of the disconnected server. At this stage, attempting to read data from the partition of the failed server will go to another copy but writes will still try to update all copies and will fail if the failed copy continues to be offline. After it is established that the failed copy will stay off for some time, writes may be re-enabled — but now having the failed copy rejoin the cluster will be more complicated, requiring an atomic window to ensure sync, as mentioned earlier.

For the DBA, there can be intermittent software crashes where a failed server automatically restarts itself, and there can be prolonged failures where this does not happen. Both are alerts but the first kind can wait. Since a system must essentially run itself, it will wait for some time for the failed server to restart itself. During this window, all reads of the failed partition go to the spare copy and writes give an error. If the spare does not come back up in time, the system will automatically re-enable writes on the spare but now the failed server may no longer rejoin the cluster without a complex sync cycle. This all can happen in well under a minute, faster than a human operator can react. The diagnostics can be done later.

If the situation was a hardware failure, recovery consists of taking a spare server and copying the database from the surviving online copy. This done, the spare server can come on line. Copying the database can be done while online and accepting updates but this may take some time, maybe an hour for every 200G of data copied over a network. In principle this could be automated by scripting, but we would normally expect a human DBA to be involved.

As a general rule, reacting to the failure goes automatically without disruption of service but bringing the failed copy online will usually require some operator action.

Levels of Tolerance and Performance

The only way to make failures totally invisible is to have all in duplicate and provisioned so that the system never runs at more than half the total capacity. This is often not economical or necessary. This is why we can do better, using the spare capacity for more than standby.

Imagine keeping a repository of linked data. Most of the content will come in through periodic bulk replacement of data sets. Some data will come in through pings from applications publishing FOAF and similar. Some data will come through on-demand RDFization of resources.

The performance of such a repository essentially depends on having enough memory. Having this memory in duplicate is just added cost. What we can do instead is have all copies store the whole partition but when routing queries, apply range partitioning on top of the basic hash partitioning. If one partition stores IDs 64K - 128K, the next partition 128K - 192K, and so forth, and all partitions are stored in two full copies, we can route reads to the first 32K IDs to the first copy and reads to the second 32K IDs to the second copy. In this way, the copies will keep different working sets. The RAM is used to full advantage.

Of course, if there is a failure, then the working set will degrade, but if this is not often and not for long, this can be quite tolerable. The alternate expense is buying twice as much RAM, likely meaning twice as many servers. This workload is memory intensive, thus servers should have the maximum memory they can have without going to parts that are so expensive one gets a new server for the price of doubling memory.

Background Bulk Processing

When loading data, the system is online in principle, but query response can be quite bad. A large RDF load will involve most memory and queries will miss the cache. The load will further keep most disks busy, so response is not good. This is the case as soon as a server's partition of the database is four times the size of RAM or greater. Whether the work is bulk-load or bulk-delete makes little difference.

But if partitions are replicated, we can temporarily split the database so that the first copies serve queries and the second copies do the load. If the copies serving on line activities do some updates also, these updates will be committed on both copies. But the load will be committed on the second copy only. This is fully appropriate as long as the data are different. When the bulk load is done, the second copy of each partition will have the full up to date state, including changes that came in during the bulk load. The online activity can be now redirected to the second copies and the first copies can be overwritten in the background by the second copies, so as to again have all data in duplicate.

Failures during such operations are not dangerous. If the copies doing the bulk load fail, the bulk load will have to be restarted. If the front end copies fail, the front end load goes to the copies doing the bulk load. Response times will be bad until the bulk load is stopped, but no data is lost.

This technique applies to all data intensive background tasks — calculation of entity search ranks, data cleansing, consistency checking, and so on. If two copies are needed to keep up with the online load, then data can be kept just as well in three copies instead of two. This method applies to any data-warehouse-style workload which must coexist with online access and occasional low volume updating.

Configurations of Redundancy

Right now, we can declare that two or more server processes in a cluster form a group. All data managed by one member of the group is stored by all others. The members of the group are interchangeable. Thus, if there is four-servers-worth of data, then there will be a minimum of eight servers. Each of these servers will have one server process per core. The first hardware failure will not affect operations. For the second failure, there is a 1/7 chance that it stops the whole system, if it falls on the server whose pair is down. If groups consist of three servers, for a total of 12, the two first failures are guaranteed not to interrupt operations; for the third, there is a 1/10 chance that it will.

We note that for big databases, as said before, the RAM cache capacity is the sum of all the servers' RAM when in normal operation.

There are other, more dynamic ways of splitting data among servers, so that partitions migrate between servers and spawn extra copies of themselves if not enough copies are online. The Google File System (GFS) does something of this sort at the file system level; Amazon's Dynamo does something similar at the database level. The analogies are not exact, though.

If data is partitioned in this manner, for example into 1K slices, each in duplicate, with the rule that the two duplicates will not be on the same physical server, the first failure will not break operations but the second probably will. Without extra logic, there is a probability that the partitions formerly hosted by the failed server have their second copies randomly spread over the remaining servers. This scheme equalizes load better but is less resilient.

Maintenance and Continuity

Databases may benefit from defragmentation, rebalancing of indices, and so on. While these are possible online, by definition they affect the working set and make response times quite bad as soon as the database is significantly larger than RAM. With duplicate copies, the problem is largely solved. Also, software version changes need not involve downtime.

Present Status

The basics of replicated partitions are operational. The items to finalize are about system administration procedures and automatic synchronization of recovering copies. This must be automatic because if it is not, the operator will find a way to forget something or do some steps in the wrong order. This also requires a management view that shows what the different processes are doing and whether something is hung or failing repeatedly. All this is for the recovery part; taking failed partitions offline is easy.

# PermaLink Comments [0]
04/01/2009 10:18 GMT-0500 Modified: 04/01/2009 11:18 GMT-0500
Beyond Applications - Introducing the Planetary Datasphere (Part 2)

We have looked at the general implications of the DataSphere, a universal, ubiquitous database infrastructure, on end-user experience and application development and content. Now we will look at what this means at the back end, from hosting to security to server software and hardware.

Application Hosting

For the infrastructure provider, hosting the DataSphere is no different from hosting large Web 2.0 sites. This may be paid for by users, as in the cloud computing model where users rent capacity for their own purposes, or by advertisers, as in most of Web 2.0.

Clouds play a role in this as places with high local connectivity. The DataSphere is the atmosphere; the Cloud is an atmospheric phenomenon.

What of Proprietary Data and its Security?

Having proprietary data does not imply using a proprietary language. I would say that for any domain of discourse, no matter how private or specialized, at least some structural concepts can be borrowed from public, more generic sources. This lowers training thresholds and facilitates integration. Being able to integrate does not imply opening one's own data. To take an analogy, if you have a bunker with closed circuit air recycling, you still breathe air, even if that air is cut off from the atmosphere at large. For places with complex existing RDBMS security, the best is to map the RDBMS to RDF on the fly, always running all requests through the RDBMS. This implicitly preserves any policy or label based security schemes.

What of Individual Privacy on the Open Web?

The more complex situations will be found in environments with mixed security needs, as in social networking with partly-open and partly-closed profiles. The FOAF+SSL solution with https:// URIs is one approach. For query processing, we have a question of enforcing instance-level policies. In the DataSphere, granting privileges on tables and views no longer makes sense. In SQL, a policy means that behind the scenes the DBMS will add extra criteria to queries and updates depending on who is issuing them. The query processor adds conditions like getting the user's department ID and comparing it to the department ID on the payroll record. Labeled security is a scheme where data rows themselves contain security tags and the DBMS enforces these, row by row.

I would say that these techniques are suited for highly-structured situations where the roles, compartments, and needs are clear, and where the organization has the database know-how to write, test, and deploy such rules by the table, row, and column. This does not sit well with schema-last. I would not bet much on an average developer's capacity for making airtight policies on RDF data where not even 100% schema-adherence is guaranteed.

Doing security at the RDF graph level seems more appropriate. In many use cases, the graph is analogous to a photo album or a file system directory. A Data Space can be divided into graphs to provide more granularity for expressing topic, provenance, or security. If policy conditions apply mostly to the graph, then things are not as likely to slip by, for example, policy rules missing some infrequent misuse of the schema. In these cases, the burden on the query processor is also not excessive: Just as with documents, the container (table, graph) is the object of access grants, not the individual sentences (DBMS records, RDF triples) in the document.

It is left to the application to present a choice of graph level policies to the user. Exactly what these will be depends on the domain of discourse. A policy might restrict access to a meeting in a calendar to people whose OpenIDs figure in the attendee list, or limit access to a photo album to people mentioned in the owner's social network. Defining such policies is typically a task for the application developer.

The difference between the Document Web and the Linked Data Web is that while the Document Web enforces security when a thing is returned to the user, Linked Data Web enforcement must occur whenever a query references something, even if this is an intermediate result not directly shown to the user.

The DataSphere will offer a generic policy scheme, filtering what graphs are accessed in a given query situation. Other applications may then verify the safety of one's disclosed information using the same DataSphere infrastructure. Of course, the user must rely on the infrastructure provider to correctly enforce these rules. Then again, some users will operate and audit their own infrastructure anyway.

Federation vs. Centralization

On the open web, there is the question of federation vs. centralization. If an application is seen to be an interface to a vocabulary, it becomes more agnostic with respect to this. In practice, if we are talking about hosted services, what is hosted together joins much faster. Data Spaces with lots of interlinking, such as closely connected social networks, will tend to cluster together on the same cloud to facilitate joint operation. Data is ubiquitous and not location-conscious, but what one can efficiently do with it depends on location. Joint access patterns favor joint location. Due to technicalities of the matter, single database clusters will run complex queries within the cluster 100 to 1000 times faster than between clusters. The size of such data clouds may be in the hundreds-of-billions of triples. It seems to make sense to have data belonging to same-type or jointly-used applications close together. In practice, there will arise partitioning by type of usage, user profile, etc., but this is no longer airtight and applications more-or-less float on top of all of this.

A search engine can host a copy of the Document Web and allow text lookups on it. But a text lookup is a single well-defined query that happens to parallelize and partition very well. A search engine can also have all the structured public data copied, but the problem there is that queries are a lot less predictable and may take orders of magnitude more resources than a single text lookup. As a partial answer, even now, we can set up a database so that the first million single-row joins cost the user nothing, but doing more requires a special subscription.

The cost for hosting a trillion triples will vary radically in function of what throughput is promised. This may result in pricing per service level, a bit like ISP pricing varies in function of promised connectivity. Queries can be run for free if no throughput guarantee applies, and might cost more if the host promises at least five-million joins-per-second including infrequently-accessed data.

Performance and cost dynamics will probably lead to the emergence of domain-specific clusters of colocated Data Spaces. The landscape will be hybrid, where usage drives data colocation. A single Google is not a practical solution to the world's spectrum of query needs.

What is the Cost of Schema-Last?

The DataSphere proposition is predicated on a worldwide database fabric that can store anything, just like a network can transport anything. It cannot enforce a fixed schema, just like TCP/IP cannot say that it will transport only email. This is continuous schema evolution. Well, TCP/IP can transport anything but it does transport a lot of HTML and email. Similarly, the DataSphere can optimize for some common vocabularies.

We have seen that an application-specific relational schema is often 10 times more efficient than an equivalent completely generic RDF representation of the same thing. The gap may narrow, but task specific representations will keep an edge. We ought to know, as we do both.

While anything can be represented, the masses are not that creative. For any data-hosting provider, making a specialized representation for the top 100 entities may cut data size in half or better. This is a behind-the-scenes optimization that will in time be a matter of course.

Historically, our industry has been driven by two phenomena:

  1. New PCs every 2 years. To make this necessary, Windows has been getting bigger and bigger, and not upgrading is not an option if one must exchange documents with new data formats and keep up with security.
  2. Agility, or ad hoc over planned. The reason the RDBMS won over CODASYL network databases was that one did not have to define what queries could be made when creating the database. With the Linked Data Web, we have one more step in this direction when we say that one does not have to decide what can be represented when creating the database.

To summarize, there is some cost to schema-last, but then our industry needs more complexity to keep justifying constant investment. The cost is in this sense not all bad.

Building the DataSphere may be the next great driver of server demand. As a case in point, Cisco, whose fortune was made when the network became ubiquitous, just entered the server game. It's in the air.

DataSphere Precursors

Right now, we have the Linked Open Data movement with lots of new data being added. We have the drive for data- and reputation-portability. We have Freebase as a demonstrator of end-users actually producing structured data. We have convergence of terminology around DBpedia, FOAF, SIOC, and more. We have demonstrators of useful data integration on the RDF stack in diverse fields, especially life sciences.

We have a totally ubiquitous network for the distribution of this, plus database technology to make this work.

We have a practical need for semantics, as search is getting saturated, email is getting killed by spam, and information overload is a constant. Social networks can be leveraged for solving a lot of this, if they can only be opened.

Of course, there is a call for transparency in society at large. Well, the battle of transparency vs. spin is a permanent feature of human existence but even there, we cannot ignore the possibilities of open data.

Databases and Servers

Technically, what does this take? Mostly, this takes a lot of memory. The software is there and we are productizing it as we speak. As with other data intensive things, the key is scalable querying over clusters of commodity servers. Nothing we have not heard before. Of course, the DBMS must know about RDF specifics to get the right query plans and so on but this we have explained elsewhere.

This all comes down to the cost of memory. No amount of CPU or network speed will make any difference if data is not in memory. Right now, a board with 8G and a dual core AMD X86-64 and 4 disks may cost about $700. 2 x 4 core Xeon and 16G and 8 disks may be $4000, counting just the components. In our experience, about 32G per billion triples is a minimum. This must be backed by a few independent disks so as to fill the cache in parallel. A cluster with 1 TB of RAM would be under $100K if built from low end boards.

The workload is all about large joins across partitions. The queries parallelize well, thus using the largest and most expensive machines for building blocks is not cost efficient. Having absolutely everything in RAM is also not cost efficient, but it is necessary to have many disks to absorb the random access load. Disk access is predominantly random, unlike some analytics workloads that can read serially. If SSD's get a bit cheaper, one could have SSD for the database and disk for backup.

With large data centers, redundancy becomes an issue. The most cost effective redundancy is simply storing partitions in duplicate or triplicate on different commodity servers. The DBMS software should handle the replication and fail-over.

For operating such systems, scaling-on-demand is necessary. Data must move between servers, and adding or replacing servers should be an on-the-fly operation. Also, since access is essentially never uniform, the most commonly accessed partitions may benefit from being kept in more copies than less frequently accessed ones. The DBMS must be essentially self administrating since these things are quite complex and easily intractable if one does not have in depth understanding of this rather complex field.

The best price point for hardware varies with time. Right now, the optimum is to have many basic motherboards with maximum memory in a rack unit, then another unit with local disks for each motherboard. Much cheaper than SAN's and Infiniband fabrics.

Conclusions and Next Steps

The ingredients and use cases are there. If server clusters with 1TB RAM begin under $100K, the cost of deployment is small compared to personnel costs.

Bootstrapping the DataSphere from current Linked Open Data, such as DBpedia, OpenCYC, Freebase, and every sort of social network, is feasible. Aside from private data integration and analytics efforts and E-science, the use cases are liberating social networks and C2C and some aspects of search from silos, overcoming spam, and mass use of semantics extracted from text. Emergent effects will then carry the ball to places we have not yet been.

The Linked Data Web has its origins in Semantic Web research, and many of the present participants come from these circles. Things may have been slowed down by a disconnect, only too typical of human activity, between Semantic Web research on one hand and database engineering on the other. Right now, the challenge is one of engineering. As documented on this blog, we have worked quite a bit on cluster databases, mostly but not exclusively with RDF use cases. The actual challenges of this are however not at all what is discussed in Semantic Web conferences. These have to do with complexities of parallelism, timing, message bottlenecks, transactions, and the like, i.e., hardcore engineering. These are difficult beyond what the casual onlooker might guess but not impossible. The details that remain to be worked out are nothing semantic, they are hardcore database, concerning automatic provisioning and such matters.

It is as if the Semantic Web people look with envy at the Web 2.0 side where there are big deployments in production, yet they do not seem quite ready to take the step themselves. Well, I will write some other time about research and engineering. For now, the message is &mdash go for it. Stay tuned for more announcements, as we near production with our next generation of software.

Related

# PermaLink Comments [0]
03/25/2009 10:50 GMT-0500 Modified: 03/25/2009 12:31 GMT-0500
"E Pluribus Unum", or "Inversely Functional Identity", or "Smooshing Without the Stickiness" (re-updated)

What a terrible word, smooshing... I have understood it to mean that when you have two names for one thing, you give each all the attributes of the other. This smooshes them together, makes them interchangeable.

This is complex, so I will begin with the point and the interested may read on for the details and implications. Starting with soon to be released version 6, Virtuoso allows you to say that two things, if they share a uniquely identifying property, are the same. Examples of uniquely identifying properties would be a book's ISBN number, or a person's social security plus full name. In relational language this is a unique key, and in RDF parlance, an inverse functional property.

In most systems, such problems are dealt with as a preprocessing step before querying. For example, all the items that are considered the same will get the same properties or at load time all identifiers will be normalized according to some application rules. This is good if the rules are clear and understood. This is so in closed situations, where things tend to have standard identifiers to begin with. But on the open web this is not so clear cut.

In this post, we show how to do these things ad hoc, without materializing anything. At the end, we also show how to materialize identity and what the consequences of this are with open web data. We use real live web crawls from the Billion Triples Challenge data set.

On the linked data web, there are independently arising descriptions of the same thing and thus arises the need to smoosh, if these are to be somehow integrated. But this is only the beginning of the problems.

To address these, we have added the option of specifying that some property will be considered inversely functional in a query. This is done at run time and the property does not really have to be inversely functional in the pure sense. foaf:name will do for an example. This simply means that for purposes of the query concerned, two subjects which have at least one foaf:name in common are considered the same. In this way, we can join between FOAF files. With the same database, a query about music preferences might consider having the same name as "same enough," but a query about criminal prosecution would obviously need to be more precise about sameness.

Our ontology is defined like this:

-- Populate a named graph with the triples you want to use in query time inferencing
ttlp ( ' @prefix foaf: <xmlns="http" xmlns.com="xmlns.com" foaf="foaf"> </> @prefix owl: <xmlns="http" www.w3.org="www.w3.org" owl="owl"> </> foaf:mbox_sha1sum a owl:InverseFunctionalProperty . foaf:name a owl:InverseFunctionalProperty . ', 'xx', 'b3sifp' );
-- Declare that the graph contains an ontology for use in query time inferencing
rdfs_rule_set ( 'http://example.com/rules/b3sifp#', 'b3sifp' );

Then use it:

sparql 
   DEFINE input:inference "http://example.com/rules/b3sifp#" 
   SELECT DISTINCT ?k ?f1 ?f2 
   WHERE { ?k   foaf:name     ?n                   . 
           ?n   bif:contains  "'Kjetil Kjernsmo'"  . 
           ?k   foaf:knows    ?f1                  . 
           ?f1  foaf:knows    ?f2 
         };
VARCHAR VARCHAR VARCHAR ______________________________________ _______________________________________________ ______________________________
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/dajobe http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/net_twitter http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/amyvdh http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/pom http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/mattb http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/davorg http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/distobj http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/perigrin ....

Without the inference, we get no matches. This is because the data in question has one graph per FOAF file, and blank nodes for persons. No graph references any person outside the ones in the graph. So if somebody is mentioned as known, then without the inference there is no way to get to what that person's FOAF file says, since the same individual will be a different blank node there. The declaration in the context named b3sifp just means that all things with a matching foaf:name or foaf:mbox_sha1sum are the same.

Sameness means that two are the same for purposes of DISTINCT or GROUP BY, and if two are the same, then both have the UNION of all of the properties of both.

If this were a naive smoosh, then the individuals would have all the same properties but would not be the same for DISTINCT.

If we have complex application rules for determining whether individuals are the same, then one can materialize owl:sameAs triples and keep them in a separate graph. In this way, the original data is not contaminated and the materialized volume stays reasonable — nothing like the blow-up of duplicating properties across instances.

The pro-smoosh argument is that if every duplicate makes exactly the same statements, then there is no great blow-up. Best and worst cases will always depend on the data. In rough terms, the more ad hoc the use, the less desirable the materialization. If the usage pattern is really set, then a relational-style application-specific representation with identity resolved at load time will perform best. We can do that too, but so can others.

The principal point is about agility as concerns the inference. Run time is more agile than materialization, and if the rules change or if different users have different needs, then materialization runs into trouble. When talking web scale, having multiple users is a given; it is very uneconomical to give everybody their own copy, and the likelihood of a user accessing any significant part of the corpus is minimal. Even if the queries were not limited, the user would typically not wait for the answer of a query doing a scan or aggregation over 1 billion blog posts or something of the sort. So queries will typically be selective. Selective means that they do not access all of the data, hence do not benefit from ready-made materialization for things they do not even look at.

The exception is corpus-wide statistics queries. But these will not be done in interactive time anyway, and will not be done very often. Plus, since these do not typically run all in memory, these are disk bound. And when things are disk bound, size matters. Reading extra entailment on the way is just a performance penalty.

Enough talk. Time for an experiment. We take the Yahoo and Falcon web crawls from the Billion Triples Challenge set, and do two things with the FOAF data in them:

  1. Resolve identity at insert time. We remove duplicate person URIs, and give the single URI all the properties of all the duplicate URIs. We expect these to be most often repeats. If a person references another person, we normalize this reference to go to the single URI of the referenced person.
  2. Give every duplicate URI of a person all the properties of all the duplicates. If these are the same value, the data should not get much bigger, or so we think.

For the experiment, we will consider two people the same if they have the same foaf:name and are both instances of foaf:Person. This gets some extra hits but should not be statistically significant.

The following is a commented SQL script performing the smoosh. We play with internal IDs of things, thus some of these operations cannot be done in SPARQL alone. We use SPARQL where possible for readability. As the documentation states, iri_to_id converts from the qualified name of an IRI to its ID and id_to_iri does the reverse.

We count the triples that enter into the smoosh:

-- the name is an existence because else we'd get several times more due to 
-- the names occurring in many graphs 
sparql SELECT COUNT(*) WHERE { { SELECT DISTINCT ?person WHERE { ?person a foaf:Person } } . FILTER ( bif:exists ( SELECT (1) WHERE { ?person foaf:name ?nn } ) ) . ?person ?p ?o };
-- We get 3284674

We make a few tables for intermediate results.

-- For each distinct name, gather the properties and objects from 
-- all subjects with this name 
CREATE TABLE name_prop ( np_name ANY, np_p IRI_ID_8, np_o ANY, PRIMARY KEY ( np_name, np_p, np_o ) ); ALTER INDEX name_prop ON name_prop PARTITION ( np_name VARCHAR (-1, 0hexffff) );
-- Map from name to canonical IRI used for the name
CREATE TABLE name_iri ( ni_name ANY PRIMARY KEY, ni_s IRI_ID_8 ); ALTER INDEX name_iri ON name_iri PARTITION ( ni_name VARCHAR (-1, 0hexffff) );
-- Map from person IRI to canonical person IRI
CREATE TABLE pref_iri ( i IRI_ID_8, pref IRI_ID_8, PRIMARY KEY ( i ) ); ALTER INDEX pref_iri ON pref_iri PARTITION ( i INT (0hexffff00) );
-- a table for the materialization where all aliases get all properties of every other
CREATE TABLE smoosh_ct ( s IRI_ID_8, p IRI_ID_8, o ANY, PRIMARY KEY ( s, p, o ) ); ALTER INDEX smoosh_ct ON smoosh_ct PARTITION ( s INT (0hexffff00) );
-- disable transaction log and enable row auto-commit. This is necessary, otherwise -- bulk operations are done transactionally and they will run out of rollback space.
LOG_ENABLE (2);
-- Gather all the properties of all persons with a name under that name. -- INSERT SOFT means that duplicates are ignored
INSERT SOFT name_prop SELECT "n", "p", "o" FROM ( sparql DEFINE output:valmode "LONG" SELECT ?n ?p ?o WHERE { ?x a foaf:Person . ?x foaf:name ?n . ?x ?p ?o } ) xx ;
-- Now choose for each name the canonical IRI
INSERT INTO name_iri SELECT np_name, ( SELECT MIN (s) FROM rdf_quad WHERE o = np_name AND p = IRI_TO_ID ('http://xmlns.com/foaf/0.1/name') ) AS mini FROM name_prop WHERE np_p = IRI_TO_ID ('http://xmlns.com/foaf/0.1/name') ;
-- For each person IRI, map to the canonical IRI of that person
INSERT SOFT pref_iri (i, pref) SELECT s, ni_s FROM name_iri, rdf_quad WHERE o = ni_name AND p = IRI_TO_ID ('http://xmlns.com/foaf/0.1/name') ;
-- Make a graph where all persons have one iri with all the properties of all aliases -- and where person-to-person refs are canonicalized
INSERT SOFT rdf_quad (g,s,p,o) SELECT IRI_TO_ID ('psmoosh'), ni_s, np_p, COALESCE ( ( SELECT pref FROM pref_iri WHERE i = np_o ), np_o ) FROM name_prop, name_iri WHERE ni_name = np_name OPTION ( loop, quietcast ) ;
-- A little explanation: The properties of names are copied into rdf_quad with the name -- replaced with its canonical IRI. If the object has a canonical IRI, this is used as -- the object, else the object is unmodified. This is the COALESCE with the sub-query.
-- This takes a little time. To check on the progress, take another connection to the -- server and do
STATUS ('cluster');
-- It will return something like -- Cluster 4 nodes, 35 s. 108 m/s 1001 KB/s 75% cpu 186% read 12% clw threads 5r 0w 0i -- buffers 549481 253929 d 8 w 0 pfs
-- Now finalize the state; this makes it permanent. Else the work will be lost on server -- failure, since there was no transaction log
CL_EXEC ('checkpoint');
-- See what we got
sparql SELECT COUNT (*) FROM <psmoosh> WHERE {?s ?p ?o};
-- This is 2253102
-- Now make the copy where all have the properties of all synonyms. This takes so much -- space we do not insert it as RDF quads, but make a special table for it so that we can -- run some statistics. This saves time.
INSERT SOFT smoosh_ct (s, p, o) SELECT s, np_p, np_o FROM name_prop, rdf_quad WHERE o = np_name AND p = IRI_TO_ID ('http://xmlns.com/foaf/0.1/name') ;
-- as above, INSERT SOFT so as to ignore duplicates
SELECT COUNT (*) FROM smoosh_ct;
-- This is 167360324
-- Find out where the bloat comes from
SELECT TOP 20 COUNT (*), ID_TO_IRI (p) FROM smoosh_ct GROUP BY p ORDER BY 1 DESC;

The results are:

54728777          http://www.w3.org/2002/07/owl#sameAs
48543153          http://xmlns.com/foaf/0.1/knows
13930234          http://www.w3.org/2000/01/rdf-schema#seeAlso
12268512          http://xmlns.com/foaf/0.1/interest
11415867          http://xmlns.com/foaf/0.1/nick
6683963           http://xmlns.com/foaf/0.1/weblog
6650093           http://xmlns.com/foaf/0.1/depiction
4231946           http://xmlns.com/foaf/0.1/mbox_sha1sum
4129629           http://xmlns.com/foaf/0.1/homepage
1776555           http://xmlns.com/foaf/0.1/holdsAccount
1219525           http://xmlns.com/foaf/0.1/based_near
305522            http://www.w3.org/1999/02/22-rdf-syntax-ns#type
274965            http://xmlns.com/foaf/0.1/name
155131            http://xmlns.com/foaf/0.1/dateOfBirth
153001            http://xmlns.com/foaf/0.1/img
111130            http://www.w3.org/2001/vcard-rdf/3.0#ADR
52930             http://xmlns.com/foaf/0.1/gender
48517             http://www.w3.org/2004/02/skos/core#subject
45697             http://www.w3.org/2000/01/rdf-schema#label
44860             http://purl.org/vocab/bio/0.1/olb

Now compare with the predicate distribution of the smoosh with identities canonicalized

sparql 
     SELECT COUNT (*) ?p 
       FROM <psmoosh> 
      WHERE { ?s ?p ?o } 
   GROUP BY ?p 
   ORDER BY 1 DESC 
      LIMIT 20;

Results are:

748311            http://xmlns.com/foaf/0.1/knows
548391            http://xmlns.com/foaf/0.1/interest
140531            http://www.w3.org/2000/01/rdf-schema#seeAlso
105273            http://www.w3.org/1999/02/22-rdf-syntax-ns#type
78497             http://xmlns.com/foaf/0.1/name
48099             http://www.w3.org/2004/02/skos/core#subject
45179             http://xmlns.com/foaf/0.1/depiction
40229             http://www.w3.org/2000/01/rdf-schema#comment
38272             http://www.w3.org/2000/01/rdf-schema#label
37378             http://xmlns.com/foaf/0.1/nick
37186             http://dbpedia.org/property/abstract
34003             http://xmlns.com/foaf/0.1/img
26182             http://xmlns.com/foaf/0.1/homepage
23795             http://www.w3.org/2002/07/owl#sameAs
17651             http://xmlns.com/foaf/0.1/mbox_sha1sum
17430             http://xmlns.com/foaf/0.1/dateOfBirth
15586             http://xmlns.com/foaf/0.1/page
12869             http://dbpedia.org/property/reference
12497             http://xmlns.com/foaf/0.1/weblog
12329             http://blogs.yandex.ru/schema/foaf/school

We can drop the owl:sameAs triples from the count, so the bloat is a bit less by that but it still is tens of times larger than the canonicalized copy or the initial state.

Now, when we try using the psmoosh graph, we still get different results from the results with the original data. This is because foaf:knows relations to things with no foaf:name are not represented in the smoosh. The exist:

sparql 
SELECT COUNT (*) 
   WHERE { ?s foaf:knows ?thing . 
           FILTER ( !bif:exists ( SELECT (1) 
                                   WHERE { ?thing foaf:name ?nn }
                                )
                  ) 
         };
-- 1393940

So the smoosh graph is not an accurate rendition of the social network. It would have to be smooshed further to be that, since the data in the sample is quite irregular. But we do not go that far here.

Finally, we calculate the smoosh blow up factors. We do not include owl:sameAs triples in the counts.

select (167360324 - 54728777) / 3284674.0;
34.290022997716059
select 2229307 / 3284674.0; = 0.678699621332284

So, to get a smoosh that is not really the equivalent of the original, either multiply the original triple count by 34 or 0.68, depending on whether synonyms are collapsed or not.

Making the smooshes does not take very long, some minutes for the small one. Inserting the big one would be longer, a couple of hours maybe. It was 33 minutes for filling the smoosh_ct table. The metrics were not with optimal tuning so the performance numbers just serve to show that smooshing takes time. Probably more time than allowable in an interactive situation, no matter how the process is optimized.

# PermaLink Comments [0]
12/16/2008 14:14 GMT-0500 Modified: 12/16/2008 15:01 GMT-0500
Virtuoso Anytime: No Query Is Too Complex (updated)

A persistent argument against the linked data web has been the cost, scalability, and vulnerability of SPARQL end points, should the linked data web gain serious mass and traffic.

As we are on the brink of hosting the whole DBpedia Linked Open Data cloud in Virtuoso Cluster, we have had to think of what we'll do if, for example, somebody decides to count all the triples in the set.

How can we encourage clever use of data, yet not die if somebody, whether through malice, lack of understanding, or simple bad luck, submits impossible queries?

Restricting the language is not the way; any language beyond text search can express queries that will take forever to execute. Also, just returning a timeout after the first second (or whatever arbitrary time period) leaves people in the dark and does not produce an impression of responsiveness. So we decided to allow arbitrary queries, and if a quota of time or resources is exceeded, we return partial results and indicate how much processing was done.

Here we are looking for the top 10 people whom people claim to know without being known in return, like this:

SQL> sparql 
SELECT ?celeb, 
       COUNT (*)
WHERE { ?claimant foaf:knows ?celeb .
        FILTER (!bif:exists ( SELECT (1) 
                              WHERE { ?celeb foaf:knows ?claimant }
                            )
               )
      } 
GROUP BY ?celeb 
ORDER BY DESC 2 
LIMIT 10;
celeb callret-1 VARCHAR VARCHAR ________________________________________ _________
http://twitter.com/BarackObama 252 http://twitter.com/brianshaler 183 http://twitter.com/newmediajim 101 http://twitter.com/HenryRollins 95 http://twitter.com/wilw 81 http://twitter.com/stevegarfield 78 http://twitter.com/cote 66 mailto:adam.westerski@deri.org 66 mailto:michal.zaremba@deri.org 66 http://twitter.com/dsifry 65
*** Error S1TAT: [Virtuoso Driver][Virtuoso Server]RC...: Returning incomplete results, query interrupted by result timeout. Activity: 1R rnd 0R seq 0P disk 1.346KB / 3 messages
SQL> sparql SELECT ?celeb, COUNT (*) WHERE { ?claimant foaf:knows ?celeb . FILTER (!bif:exists ( SELECT (1) WHERE { ?celeb foaf:knows ?claimant } ) ) } GROUP BY ?celeb ORDER BY DESC 2 LIMIT 10;
celeb callret-1 VARCHAR VARCHAR ________________________________________ _________
http://twitter.com/JasonCalacanis 496 http://twitter.com/Twitterrific 466 http://twitter.com/ev 442 http://twitter.com/BarackObama 356 http://twitter.com/laughingsquid 317 http://twitter.com/gruber 294 http://twitter.com/chrispirillo 259 http://twitter.com/ambermacarthur 224 http://twitter.com/t 219 http://twitter.com/johnedwards 188
*** Error S1TAT: [Virtuoso Driver][Virtuoso Server]RC...: Returning incomplete results, query interrupted by result timeout. Activity: 329R rnd 44.6KR seq 342P disk 638.4KB / 46 messages

The first query read all data from disk; the second run had the working set from the first and could read some more before time ran out, hence the results were better. But the response time was the same.

If one has a query that just loops over consecutive joins, like in basic SPARQL, interrupting the processing after a set time period is simple. But such queries are not very interesting. To give meaningful partial answers with nested aggregation and sub-queries requires some more tricks. The basic idea is to terminate the innermost active sub-query/aggregation at the first timeout, and extend the timeout a bit so that accumulated results get fed to the next aggregation, like from the GROUP BY to the ORDER BY. If this again times out, we continue with the next outer layer. This guarantees that results are delivered if there were any results found for which the query pattern is true. False results are not produced, except in cases where there is comparison with a count and the count is smaller than it would be with the full evaluation.

One can also use this as a basis for paid services. The cutoff does not have to be time; it can also be in other units, making it insensitive to concurrent usage and variations of working set.

This system will be deployed on our Billion Triples Challenge demo instance in a few days, after some more testing. When Virtuoso 6 ships, all LOD Cloud AMIs and OpenLink-hosted LOD Cloud SPARQL endpoints will have this enabled by default. (AMI users will be able to disable the feature, if desired.) The feature works with Virtuoso 6 in both single server and cluster deployment.

# PermaLink Comments [0]
12/11/2008 16:13 GMT-0500 Modified: 12/12/2008 10:29 GMT-0500
An Example of RDF Scalability

We hear it to exhaustion, where is RDF scalability? We have been suggesting for a while that this is a solved question. I will here give some concrete numbers to back this.

The scalability dream is to add hardware and get increased performance in proportion to the power the added component has when measured by itself. A corollary dream is to take scalability effects that are measured in a simple task and see them in a complex task.

Below we show how we do 3.3 million random triple lookups per second on two 8 core commodity servers producing complete results, joining across partitions. On a single 4 core server, the figure is about 1 million lookups per second. With a single thread, it is about 250K lookups per second. This is the good case. But even our worse case is quite decent.

We took a simple SPARQL query, counting how many people say they reciprocally know each other. In the Billion Triples Challenge data set, there are 25M foaf:knows quads of which 92K are reciprocal. Reciprocal here means that when x knows y in some graph, y knows x in the same or any other graph.

SELECT COUNT (*) 
WHERE { 
         ?p1  foaf:knows  ?p2  . 
         ?p2  foaf:knows  ?p1 
      }

There is no guarantee that the triple of x knows y is in the same partition as the triple y knows x. Thus the join is randomly distributed, n partitions to n partitions.

We left this out of the Billion Triples Challenge demo because this did not run fast enough for our liking. Since then, we have corrected this.

If run on a single thread, this query would be a loop over all the quads with a predicate of foaf:knows, and an inner loop looking for a quad with 3 of 4 fields given (SPO). If we have a partitioned situation, we have a loop over all the foaf:knows quads in each partition, and an inner lookup looking for the reciprocal foaf:knows quad in whatever partition it may be found.

We have implemented this with two different message patterns:

  1. Centralized: One process reads all the foaf:knows quads from all processes. Every 50K quads, it sends a batch of reciprocal quad checks to each partition that could contain a reciprocal quad. Each partition keeps the count of found reciprocal quads, and these are gathered and added up at the end.

  2. Symmetrical: Each process reads the foaf:knows quads in its partition, and sends a batch of checks to each process that could have the reciprocal foaf:knows quad every 50K quads. At the end, the counts are gathered from all partitions. There is some additional control traffic but we do not go into its details here.

Below is the result measured on 2 machines each with 2 x Xeon 5345 (quad core; total 8 cores), 16G RAM, and each machine running 6 Virtuoso instances. The interconnect is dual 1-Gbit ethernet. Numbers are with warm cache.

Centralized: 35,543 msec, 728,634 sequential + random lookups per second
Cluster 12 nodes, 35 s. 1072 m/s 39,085 KB/s 316% cpu ...

Symmetrical: 7706 msec, 3,360,740 sequential + random lookups per second
Cluster 12 nodes, 7 s. 572 m/s 16,983 KB/s 1137% cpu ...

The second line is the summary from the cluster status report for the duration of the query. The interesting numbers are the KB/s and the %CPU. The former is the cross-sectional data transfer rate for intra-cluster communication; the latter is the consolidated CPU utilization, where a constantly-busy core counts for 100%. The point to note is that the symmetrical approach takes 4x less real time with under half the data transfer rate. Further, when using multiple machines, the speed of a single interface does not limit the overall throughput as it does in the centralized situation.

These figures represent the best and worst cases of distributed JOINing. If we have a straight sequence of JOINs, with single pattern optionals and existences and the order in which results are produced is not significant (i.e., there is aggregation, existence test, or ORDER BY), the symmetrical pattern is applicable. On the other hand, if there are multiple triple pattern optionals, complex sub-queries, DISTINCTs in the middle of the query, or results have to be produced in the order of an index, then the centralized approach must be used at least part of the time.

Also, if we must make transitive closures, which can be thought of as an extension of a DISTINCT in a subquery, we must pass the data through a single point before moving the bindings to the next JOIN in the sequence. This happens for example in resolving owl:sameAs at run time. However, the good news is that performance does not fall much below the centralized figure even when there are complex nested structures with intermediate transitive closures, DISTINCTs, complex existence tests, etc., that require passing all intermediate results through a central point. No matter the complexity, it is always possible to vector some tens-of-thousands of variable bindings into a single message exchange. And if there are not that many intermediate results, then single query execution time is not a problem anyhow.

For our sample query, we would get still more speed by using a partitioned hash join, filling the hash from the foaf:knows relations and then running the foaf:knows relations through the hash. If the hash size is right, a hash lookup is somewhat better than an index lookup. The problem is that when the hash join is not the right solution, it is an expensive mistake: the best case is good; the worst case is very bad. But if there is no index then hash join is better than nothing. One problem of hash joins is that they make temporary data structures which, if large, will skew the working set. One must be quite sure of the cardinality before it is safe to try a hash join. So we do not do hash joins with RDF, but we do use them sometimes with relational data.

These same methods apply to relational data just as well. This does not make generic RDF storage outperform an application-specific relational representation on the same platform, as the latter benefits from all the same optimizations, but in terms of sheer numbers, this makes RDF representation an option where it was not an option before. RDF is all about not needing to design the schema around the queries, and not needing to limit what joins with what else.

# PermaLink Comments [0]
11/27/2008 11:23 GMT-0500 Modified: 12/01/2008 12:09 GMT-0500
State of the Semantic Web, Part 1 - Sociology, Business, and Messaging (update 2)

I was in Vienna for the Linked Data Practitioners gathering this week. Danny Ayers asked me if I would blog about the State of the Semantic Web or write the This Week's Semantic Web column. I don't have the time to cover all that may have happened during the past week but I will editorialize about the questions that again were raised in Vienna. How these things relate to Virtuoso will be covered separately. This is about the overarching questions of the times, not the finer points of geek craft.

Sören Auer asked me to say a few things about relational to RDF mapping. I will cite some highlights from this, as they pertain to the general scene. There was an "open hacking" session Wednesday night featuring lightning talks. I will use some of these too as a starting point.

The messaging?

The SWEO (Semantic Web Education and Outreach) interest group of the W3C spent some time looking for an elevator pitch for the Semantic Web. It became "Data Unleashed." Why not? Let's give this some context.

So, if we are holding a Semantic Web 101 session, where should we begin? I hazard to guess that we should not begin by writing a FOAF file in Turtle by hand, as this is one thing that is not likely to happen in the real world.

Of course, the social aspect of the Data Web is the most immediately engaging, so a demo might be to go make an account with myopenlink.net and see that after one has entered the data one normally enters for any social network, one has become a Data Web citizen. This means that one can be found, just like this, with a query against the set of data spaces hosted on the system. Then we just need a few pages that repurpose this data and relate it to other data. We show some samples of queries like this in our Billion Triples Challenge demo. We will make a webcast about this to make it all clearer.

Behold: The Data Web is about the world becoming a database; writing SPARQL queries or triples is incidental. You will write FOAF files by hand just as little as you now write SQL insert statements for filling in your account information on Myspace.

Every time there is a major shift in technology, this shift needs to be motivated by addressing a new class of problem. This means doing something that could not be done before. The last time this happened was when the relational database became the dominant IT technology. At that time, the questions involved putting the enterprise in the database and building a cluster of Line Of Business (LOB) applications around the database. The argument for the RDBMS was that you did not have to constrain the set of queries that might later be made, when designing the database. In other words, it was making things more ad hoc. This was opposed then on grounds of being less efficient than the hierarchical and network databases which the relational eventually replaced.

Today, the point of the Data Web is that you do not have to constrain what your data can join or integrate with, when you design your database. The counter-argument is that this is slow and geeky and not scalable. See the similarity?

A difference is that we are not specifically aiming at replacing the RDBMS. In fact, if you know exactly what you will query and have a well defined workload, a relational representation optimized for the workload will give you about 10x the performance of the equivalent RDF warehouse. OLTP remains a relational-only domain.

However, when we are talking about doing queries and analytics against the Web, or even against more than a handful of relational systems, the things which make RDBMS good become problematic.

What is the business value of this?

The most reliable of human drives is the drive to make oneself known. This drives all, from any social scene to business communications to politics. Today, when you want to proclaim you exist, you do so first on the Web. The Web did not become the prevalent media because business loved it for its own sake, it became prevalent because business could not afford not to assert their presence there. If anything, the Web eroded the communications dominance of a lot of players, which was not welcome but still had to be dealt with, by embracing the Web.

Today, in a world driven by data, the Data Web will be catalyzed by similar factors: If your data is not there, you will not figure in query results. Search engines will play some role there but also many social applications will have reports that are driven by published data. Also consider any e-commerce, any marketplace, and so forth. The Data Portability movement is a case in point: Users want to own their own content; silo operators want to capitalize on holding it. Right now, we see these things in silos; the Data Web will create bridges between these, and what is now in silo data centers will be increasingly available on an ad hoc basis with Open Data.

Again, we see a movement from the specialized to the generic: What LinkedIn does in its data center can be done with ad hoc queries with linked open data. Of course, LinkedIn does these things somewhat more efficiently because their system is built just for this task, but the linked data approach has the built-in readiness to join with everything else at almost no cost, without making a new data warehouse for each new business question.

We could call this the sociological aspect of the thing. Getting to more concrete business, we see an economy that, we could say, without being alarmists, is confronted with some issues. Well, generally when times are bad, this results in consolidation of property and power. Businesses fail and get split up and sold off in pieces, government adds controls and regulations and so forth. This means ad hoc data integration, as control without data is just pretense. If times are lean, this also means that there is little readiness to do wholesale replacement of systems, which will take years before producing anything. So we must play with what there is and make it deliver, in ways and conditions that were not necessarily anticipated. The agility of the Data Web, if correctly understood, can be of great benefit there, especially on the reporting and business intelligence side. Specifically mapping line-of-business systems into RDF on the fly will help with integration, making the specialized warehouse the slower and more expensive alternative. But this too is needed at times.

But for the RDF community to be taken seriously there, the messaging must be geared in this direction. Writing FOAF files by hand is not where you begin the pitch. Well, what is more natural then having a global, queriable information space, when you have a global information driven economy?

The Data Web is about making this happen. First with doing this in published generally available data; next with the enterprises having their private data for their own use but still linking toward the outside, even though private data stays private: You can still use standard terms and taxonomies, where they apply, when talking of proprietary information.

But let's get back to more specific issues

At the lightning talks in Vienna, one participant said, "Man's enemy is not the lion that eats men, it's his own brother. Semantic Web's enemy is the XML Web services stack that ate its lunch." There is some truth to the first part. The second part deserves some comment. The Web services stack is about transactions. When you have a fixed, often repeating task, it is a natural thing to make this a Web service. Even though SOA is not really prevalent in enterprise IT, it has value in things like managing supply-chain logistics with partners, etc. Lots of standard messages with unambiguous meaning. To make a parallel with the database world: first there was OLTP; then there was business intelligence. Of course, you must first have the transactions, to have something to analyze.

SOA is for the transactions; the Data Web is for integration, analysis, and discovery. It is the ad hoc component of the real time enterprise, if you will. It is not a competitor against a transaction oriented SOA. In fact, RDF has no special genius for transactions. Another mistake that often gets made is stretching things beyond their natural niche. Doing transactions in RDF is this sort of over-stretching without real benefit.

"I made an ontology and it really did solve a problem. How do I convince the enterprise people, the MBA who says it's too complex, the developer who says it is not what he's used to, and so on?"

This is an education question. One of the findings of SWEO's enterprise survey was that there was awareness that difficult problems existed. There were and are corporate ontologies and taxonomies, diversely implemented. Some of these needs are recognized. RDF based technologies offer to make these more open standards based. open standards have proven economical in the past. What we also hear is that major enterprises do not even know what their information and human resources assets are: Experts can't be found even when they are in the next department, or reports and analysis gets buried in wikis, spreadsheets, and emails.

Just as when SQL took off, we need vendors to do workshops on getting started with a technology. The affair in Vienna was a step in this direction. Another type of event specially focusing on vertical problems and their Data Web solutions is a next step. For example, one could do a workshop on integrating supply chain information with Data Web technologies. Or one on making enterprise knowledge bases from HR, CRM, office automation, wikis, etc. The good thing is that all these things are additions to, not replacements of, the existing mission-critical infrastructure. And better use of what you already have ought to be the theme of the day.

# PermaLink Comments [0]
10/24/2008 10:19 GMT-0500 Modified: 10/27/2008 11:28 GMT-0500
OpenLink Software's Virtuoso Submission to the Billion Triples Challenge

Introduction

We use Virtuoso 6 Cluster Edition to demonstrate the following:

  • Text and structured information based lookups
  • Analytics queries
  • Analysis of co-occurrence of features like interests and tags.
  • Dealing with identity of multiple IRI's (owl:sameAs)

The demo is based on a set of canned SPARQL queries that can be invoked using the OpenLink Data Explorer (ODE) Firefox extension.

The demo queries can also be run directly against the SPARQL end point.

The demo is being worked on at the time of submission and may be shown online by appointment.

Automatic annotation of the data based on named entity extraction is being worked on at the time of this submission. By the time of ISWC 2008 the set of sample queries will be enhanced with queries based on extracted named entities and their relationships in the UMBEL and Open CYC ontologies.

Also examples involving owl:sameAs are being added, likewise with similarity metrics and search hit scores.

The Data

The database consists of the billion triples data sets and some additions like Umbel. Also the Freebase extract is newer than the challenge original.

The triple count is 1115 million.

In the case of web harvested resources, the data is loaded in one graph per resource.

In the case of larger data sets like Dbpedia or the US census, all triples of the provenance share a data set specific graph.

All string literals are additionally indexed in a full text index. No stop words are used.

Most queries do not specify a graph. Thus they are evaluated against the union of all the graphs in the database. The indexing scheme is SPOG, GPOS, POGS, OPGS. All indices ending in S are bitmap indices.

The Queries

The demo uses Virtuoso SPARQL extensions in most queries. These extensions consist on one hand of well known SQL features like aggregation with grouping and existence and value subqueries and on the other of RDF specific features. The latter include run time RDFS and OWL inferencing support and backward chaining subclasses and transitivity.

Simple Lookups

sparql 
select ?s ?p (bif:search_excerpt (bif:vector ('semantic', 'web'), ?o)) 
where 
  {
    ?s ?p ?o . 
    filter (bif:contains (?o, "'semantic web'")) 
  } 
limit 10
;

This looks up triples with semantic web in the object and makes a search hit summary of the literal, highlighting the search terms.

sparql 
select ?tp count(*) 
where 
  { 
    ?s ?p2 ?o2 . 
    ?o2 a ?tp . 
    ?s foaf:nick ?o . 
    filter (bif:contains (?o, "plaid_skirt")) 
  } 
group by ?tp
order by desc 2
limit 40
;

This looks at what sorts of things are referenced by the properties of the foaf handle plaid_skirt.

What are these things called?

sparql 
select ?lbl count(*) 
where 
  { 
    ?s ?p2 ?o2 . 
    ?o2 rdfs:label ?lbl . 
    ?s foaf:nick ?o . 
    filter (bif:contains (?o, "plaid_skirt")) 
  } 
group by ?lbl
order by desc 2
;

Many of these things do not have a rdfs:label. Let us use a more general concept of lable which groups dc:title, foaf:name and other name-like properties together. The subproperties are resolved at run time, there is no materialization.

sparql 
define input:inference 'b3s'
select ?lbl count(*) 
where 
  { 
    ?s ?p2 ?o2 . 
    ?o2 b3s:label ?lbl . 
    ?s foaf:nick ?o . 
    filter (bif:contains (?o, "plaid_skirt")) 
  } 
group by ?lbl
order by desc 2
;

We can list sources by the topics they contain. Below we look for graphs that mention terrorist bombing.

sparql 
select ?g count(*) 
where 
  { 
    graph ?g 
      {
        ?s ?p ?o . 
        filter (bif:contains (?o, "'terrorist bombing'")) 
      }
  } 
group by ?g 
order by desc 2
;

Now some web 2.0 tagging of search results. The tag cloud of "computer"

sparql 
select ?lbl count (*) 
where 
  { 
    ?s ?p ?o . 
    ?o bif:contains "computer" . 
    ?s sioc:topic ?tg .
    optional 
      {
        ?tg rdfs:label ?lbl
      }
  }
group by ?lbl 
order by desc 2 
limit 40
;

This query will find the posters who talk the most about sex.

sparql 
select ?auth count (*) 
where 
  { 
    ?d dc:creator ?auth .
    ?d ?p ?o
    filter (bif:contains (?o, "sex")) 
  } 
group by ?auth
order by desc 2
;

Analytics

We look for people who are joined by having relatively uncommon interests but do not know each other.

sparql select ?i ?cnt ?n1 ?n2 ?p1 ?p2 
where 
  {
    {
      select ?i count (*) as ?cnt 
      where 
        { ?p foaf:interest ?i } 
      group by ?i
    }
    filter ( ?cnt > 1 && ?cnt < 10) .
    ?p1 foaf:interest ?i .
    ?p2 foaf:interest ?i .
    filter  (?p1 != ?p2 && 
             !bif:exists ((select (1) where {?p1 foaf:knows ?p2 })) && 
             !bif:exists ((select (1) where {?p2 foaf:knows ?p1 }))) .
    ?p1 foaf:nick ?n1 .
    ?p2 foaf:nick ?n2 .
  } 
order by ?cnt 
limit 50
;

The query takes a fairly long time, mostly spent counting the interested in 25M interest triples. It then takes people that share the interest and checks that neither claims to know the other. It then sorts the results rarest interest first. The query can be written more efficently but is here just to show that database-wide scans of the population are possible ad hoc.

Now we go to SQL to make a tag co-occurrence matrix. This can be used for showing a Technorati-style related tags line at the bottom of a search result page. This showcases the use of SQL together with SPARQL. The half-matrix of tags t1, t2 with the co-occurrence count at the intersection is much more efficiently done in SQL, specially since it gets updated as the data changes. This is an example of materialized intermediate results based on warehoused RDF.

create table 
tag_count (tcn_tag iri_id_8, 
           tcn_count int, 
           primary key (tcn_tag));
           
alter index 
tag_count on tag_count partition (tcn_tag int (0hexffff00));

create table 
tag_coincidence (tc_t1 iri_id_8, 
                 tc_t2 iri_id_8, 
                 tc_count int, 
                 tc_t1_count int, 
                 tc_t2_count int, 
                 primary key  (tc_t1, tc_t2))

alter index 
tag_coincidence on tag_coincidence partition (tc_t1 int (0hexffff00));

create index 
tc2 on tag_coincidence (tc_t2, tc_t1) partition (tc_t2 int (0hexffff00));

How many times each topic is mentioned?

insert into tag_count 
  select * 
    from (sparql define output:valmode "LONG" 
                 select ?t count (*) as ?cnt 
                 where 
                   {
                     ?s sioc:topic ?t
                   } 
                 group by ?t) 
    xx option (quietcast);

Take all t1, t2 where t1 and t2 are tags of the same subject, store only the permutation where the internal id of t1 < that of t2.

insert into tag_coincidence  (tc_t1, tc_t2, tc_count)
  select "t1", "t2", cnt 
    from 
      (select  "t1", "t2", count (*) as cnt 
         from 
           (sparql define output:valmode "LONG"
                   select ?t1 ?t2 
                     where 
                       {
                         ?s sioc:topic ?t1 . 
                         ?s sioc:topic ?t2 
                       }) tags
         where "t1" < "t2" 
         group by "t1", "t2") xx
    where isiri_id ("t1") and 
          isiri_id ("t2") 
    option (quietcast); 

Now put the individual occurrence counts into the same table with the co-occurrence. This denormalization makes the related tags lookup faster.

update tag_coincidence 
  set tc_t1_count = (select tcn_count from tag_count where tcn_tag = tc_t1),
      tc_t2_count = (select tcn_count from tag_count where tcn_tag = tc_t2);

Now each tag_coincidence row has the joint occurrence count and individual occurrence counts. A single select will return a Technorati-style related tags listing.

To show the URI's of the tags:

select top 10 id_to_iri (tc_T1), id_to_iri (tc_t2), tc_count 
  from tag_coincidence 
  order by tc_count desc;

Social Networks

We look at what interests people have

sparql 
select ?o ?cnt  
where 
  {
    {
      select ?o count (*) as ?cnt 
        where 
          {
            ?s foaf:interest ?o
          } 
        group by ?o
    } 
    filter (?cnt > 100) 
  } 
order by desc 2 
limit 100
;

Now the same for the Harry Potter fans

sparql 
select ?i2 count (*) 
where 
  { 
    ?p foaf:interest <http://www.livejournal.com/interests.bml?int=harry+potter> .
    ?p foaf:interest ?i2 
  } 
group by ?i2 
order by desc 2 
limit 20
;

We see whether knows relations are symmmetrical. We return the top n people that others claim to know without being reciprocally known.

sparql 
select ?celeb, count (*) 
where 
  { 
    ?claimant foaf:knows ?celeb . 
    filter (!bif:exists ((select (1) 
                          where 
                            {
                              ?celeb foaf:knows ?claimant 
                            }))) 
  } 
group by ?celeb 
order by desc 2 
limit 10
;

We look for a well connected person to start from.

sparql 
select ?p count (*) 
where 
  {
    ?p foaf:knows ?k 
  } 
group by ?p 
order by desc 2 
limit 50
;

We look for the most connected of the many online identities of Stefan Decker.

sparql 
select ?sd count (distinct ?xx) 
where 
  { 
    ?sd a foaf:Person . 
    ?sd ?name ?ns . 
    filter (bif:contains (?ns, "'Stefan Decker'")) . 
    ?sd foaf:knows ?xx 
  } 
group by ?sd 
order by desc 2
;

We count the transitive closure of Stefan Decker's connections

sparql 
select count (*) 
where 
  { 
    {
      select * 
      where 
        { 
          ?s foaf:knows ?o 
        }
    }
    option (transitive, t_distinct, t_in(?s), t_out(?o)) . 
    filter (?s = <mailto:stefan.decker@deri.org>)
  }
;

Now we do the same while following owl:sameAs links.

sparql 
define input:same-as "yes"
select count (*) 
where 
  { 
    {
      select * 
      where 
        { 
          ?s foaf:knows ?o 
        }
    }
    option (transitive, t_distinct, t_in(?s), t_out(?o)) . 
    filter (?s = <mailto:stefan.decker@deri.org>)
  }
;

Demo System

The system runs on Virtuoso 6 Cluster Edition. The database is partitioned into 12 partitions, each served by a distinct server process. The system demonstrated hosts these 12 servers on 2 machines, each with 2 xXeon 5345 and 16GB memory and 4 SATA disks. For scaling, the processes and corresponding partitions can be spread over a larger number of machines. If each ran on its own server with 16GB RAM, the whole data set could be served from memory. This is desirable for search engine or fast analytics applications. Most of the demonstrated queries run in memory on second invocation. The timing difference between first and second run is easily an order of magnitude.

# PermaLink Comments [0]
09/30/2008 12:24 GMT-0500 Modified: 10/03/2008 06:20 GMT-0500
The DARQ Matter of Federation
The DARQ Matter of Federation

Astronomers propose that the universe is held together, so to speak, by the gravity of invisible "dark matter" spread in interstellar and intergalactic space.

For the data web, it will be held together by federation, also an invisible factor. As in Minkowski space, so in cyberspace.

To take the astronomical analogy further, putting too much visible stuff in one place makes a black hole, whose chief properties are that it is very heavy, can only get heavier and that nothing comes out.

DARQ is Bastian Quilitz's federated extension of the Jena ARQ SPARQL processor. It has existed for a while and was also presented at ESWC2008. There is also SPARQL FED from Andy Seaborne, an explicit means of specifying which end point will process which fragment of a distributed SPARQL query. Still, for federation to deliver in an open, decentralized world, it must be transparent. For a specific application, with a predictable workload, it is of course OK to partition queries explicitly.

Bastian had split DBpedia among five Virtuoso servers and was querying this set with DARQ. The end result was that there was a rather frightful cost of federation as opposed to all the data residing in a single Virtuoso. The other result was that if selectivity of predicates was not correctly guessed by the federation engine, the proposition was a non-starter. With correct join order it worked, though.

Yet, we really want federation. Looking further down the road, we simply must make federation work. This is just as necessary as running on a server cluster for mid-size workloads.

Since we are convinced of the cause, let's talk about the means.

For DARQ as it now stands, there's probably an order of magnitude or even more to gain from a couple of simple tricks. If going to a SPARQL end point that is not the outermost in the loop join sequence, batch the requests together in one HTTP/1.1 message. So, if the query is "get me my friends living in cities of over a million people," there will be the fragment "get city where x lives" and later "ask if population of x greater than 1000000". If I have 100 friends, I send the 100 requests in a batch to each eligible server.

Further, if running against a server of known brand, use a client-server connection and prepared statements with array parameters. This can well improve the processing speed at the remote end point by another order of magnitude. This gain may however not be as great as the latency savings from message batching. We will provide a sample of how to do this with Virtuoso over JDBC so Bastian can try this if interested.

These simple things will give a lot of mileage and may even decide whether federation is an option in specific applications. For the open web however, these measures will not yet win the day.

When federating SQL, colocation of data is sort of explicit. If two tables are joined and they are in the same source, then the join can go to the source. For SPARQL this is also so but with a twist:

If a foaf:Person is found on a given server, this does not mean that the Person's geek code or email hash will be on the same server. Thus {?p name "Johnny" . ?p geekCode ?g . ?p emailHash ?h } does not necessarily denote a colocated join if many servers serve items of the vocabulary.

However, in most practical cases, for obtaining a rapid answer, treating this as a colocated fragment will be appropriate. Thus, it may be necessary to be able to declare that geek codes will be assumed colocated with names. This will save a lot of message passing and offer decent, if not theoretically total recall. For search style applications, starting with such assumptions will make sense. If nothing is found, then we can partition each join step separately for the unlikely case that there were a server that gave geek codes but not names.

For Virtuoso, we find that a federated query's asynchronous, parallel evaluation model is not so different from that on a local cluster. So the cluster version could have the option of federated query. The difference is that a cluster is local and tightly coupled and predictably partitioned but a federated setting is none of these.

For description, we would take DARQ's description model and maybe extend it a little where needed. Also we would enhance the protocol to allow just asking for the query cost estimate given a query with literals specified. We will do this eventually.

We would like to talk to Bastian about large improvements to DARQ, specially when working with Virtuoso. We'll see.

Of course, one mode of federating is the crawl-as-you-go approach of the Virtuoso Sponger. This will bring in fragments following seeAlso or sameAs declarations or other references. This will however not have the recall of a warehouse or federation over well described SPARQL end-points. But up to a certain volume it has the speed of local storage.

The emergence of voiD (Vocabulary of Interlinked Data) is a step in the direction of making federation a reality. There is a separate post about this.

# PermaLink Comments [0]
06/09/2008 10:02 GMT-0500 Modified: 06/11/2008 15:15 GMT-0500
DBpedia Benchmark Revisited
DBpedia Benchmark Revisited

We ran the DBpedia benchmark queries again with different configurations of Virtuoso. I had not studied the details of the matter previously but now did have a closer look at the queries.

Comparing numbers given by different parties is a constant problem. In the case reported here, we loaded the full DBpedia 3, all languages, with about 198M triples, onto Virtuoso v5 and Virtuoso Cluster v6, all on the same 4 core 2GHz Xeon with 8G RAM. All databases were striped on 6 disks. The Cluster configuration was with 4 processes in the same box.

We ran the queries in two variants:

  • With graph specified in the SPARQL FROM clause, using the default indices.
  • With no graph specified anywhere, using an alternate indexing scheme.

The times below are for the sequence of 5 queries; individual query times are not reported. I did not do a line-by-line review of the execution plans since they seem to run well enough. We could get some extra mileage from cost model tweaks, especially for the numeric range conditions, but we will do this when somebody comes up with better times.

First, about Virtuoso v5: Because there is a query in the set that specifies no condition on S or O and only P, this simply cannot be done with the default indices. With Virtuoso Cluster v6 it sort-of can, because v6 is more space efficient.

So we added the index:

create bitmap index rdf_quad_pogs on rdf_quad (p, o, g, s);
  Virtuoso v5 with
gspo, ogps, pogs
Virtuoso Cluster v6 with
gspo, ogps
Virtuoso Cluster v6 with
gspo, ogps, pogs
cold 210 s 136 s 33.4 s
warm 0.600 s 4.01 s 0.628 s

OK, so now let us do it without a graph being specified. For all platforms, we drop any existing indices, and --

create table r2 (g iri_id_8, s, iri_id_8, p iri_id_8, o any, primary key (s, p, o, g))
alter index R2 on R2 partition (s int (0hexffff00));

log_enable (2);
insert into r2 (g, s, p, o) select g, s, p, o from rdf_quad;

drop table rdf_quad;
alter table r2 rename RDF_QUAD;
create bitmap index rdf_quad_opgs on rdf_quad (o, p, g, s) partition (o varchar (-1, 0hexffff));
create bitmap index rdf_quad_pogs on rdf_quad (p, o, g, s) partition (o varchar (-1, 0hexffff));
create bitmap index rdf_quad_gpos on rdf_quad (g, p, o, s) partition (o varchar (-1, 0hexffff));

The code is identical for v5 and v6, except that with v5 we use iri_id (32 bit) for the type, not iri_id_8 (64 bit). We note that we run out of IDs with v5 around a few billion triples, so with v6 we have double the ID length and still manage to be vastly more space efficient.

With the above 4 indices, we can query the data pretty much in any combination without hitting a full scan of any index. We note that all indices that do not begin with s end with s as a bitmap. This takes about 60% of the space of a non-bitmap index for data such as DBpedia.

If you intend to do completely arbitrary RDF queries in Virtuoso, then chances are you are best off with the above index scheme.

  Virtuoso v5 with
gspo, ogps, pogs
Virtuoso Cluster v6 with
spog, pogs, opgs, gpos
warm 0.595 s 0.617 s

The cold times were about the same as above, so not reproduced.

Graph or No Graph?

It is in the SPARQL spirit to specify a graph and for pretty much any application, there are entirely sensible ways of keeping the data in graphs and specifying which ones are concerned by queries. This is why Virtuoso is set up for this by default.

On the other hand, for the open web scenario, dealing with an unknown large number of graphs, enumerating graphs is not possible and questions like which graph of which source asserts x become relevant. We have two distinct use cases which warrant different setups of the database, simple as that.

The latter use case is not really within the SPARQL spec, so implementations may or may not support this. For example Oracle or Vertica would not do this well since they partition data according to graph or predicate, respectively. On the other hand, stores that work with one quad table, which is most of the ones out there, should do it maybe with some configuring, as shown above.

Frameworks like Jena are not to my knowledge geared towards having a wildcard for graph, although I would suppose this can be arranged by adding some "super-graph" object, a graph of all graphs. I don't think this is directly supported and besides most apps would not need it.

Once the indices are right, there is no difference between specifying a graph and not specifying a graph with the queries considered. With more complex queries, specifying a graph or set of graphs does allow some optimizations that cannot be done with no graph specified. For example, bitmap intersections are possible only when all leading key parts are given.

Conclusions

The best warm cache time is with v5; the five queries run under 600 ms after the first go. This is noted to show that all-in-memory with a single thread of execution is hard to beat.

Cluster v6 performs the same queries in 623 ms. What is gained in parallelism is lost in latency if all operations complete in microseconds. On the other hand, Cluster v6 leaves v5 in the dust in any situation that has less than 100% hit rate. This is due to actual benefit from parallelism if operations take longer than a few microseconds, such as in the case of disk reads. Cluster v6 has substantially better data layout on disk, as well as fewer pages to load for the same content.

This makes it possible to run the queries without the pogs index on Cluster v6 even when v5 takes prohibitively long.

The morale of the story is to have a lot of RAM and space-efficient data representation.

The DBpedia benchmark does not specify any random access pattern that would give a measure of sustained throughput under load, so we are left with the extremes of cold and warm cache of which neither is quite realistic.

Chris Bizer and I have talked on and off about benchmarks and I have made suggestions that we will see incorporated into the Berlin SPARQL benchmark, which will, I believe, be much more informative.

Appendix: Query Text

For reference, the query texts specifying the graph are below. To run without specifying the graph, just drop the FROM <http://dbpedia.org> from each query. The returned row counts are indicated below each query's text.

sparql SELECT ?p ?o FROM <http://dbpedia.org> WHERE {
  <http://dbpedia.org/resource/Metropolitan_Museum_of_Art> ?p ?o };

-- 1337 rows

sparql PREFIX p: <http://dbpedia.org/property/>
SELECT ?film1 ?actor1 ?film2 ?actor2
FROM <http://dbpedia.org> WHERE {
  ?film1 p:starring <http://dbpedia.org/resource/Kevin_Bacon> .
  ?film1 p:starring ?actor1 .
  ?film2 p:starring ?actor1 .
  ?film2 p:starring ?actor2 . };

--  23910 rows

sparql PREFIX p: <http://dbpedia.org/property/>
SELECT ?artist ?artwork ?museum ?director FROM <http://dbpedia.org> 
WHERE {
  ?artwork p:artist ?artist .
  ?artwork p:museum ?museum .
  ?museum p:director ?director };

-- 303 rows

sparql PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?s ?homepage FROM <http://dbpedia.org>  WHERE {
   <http://dbpedia.org/resource/Berlin> geo:lat ?berlinLat .
   <http://dbpedia.org/resource/Berlin> geo:long ?berlinLong . 
   ?s geo:lat ?lat .
   ?s geo:long ?long .
   ?s foaf:homepage ?homepage .
   FILTER (
     ?lat        <=     ?berlinLat + 0.03190235436 &&
     ?long       >=     ?berlinLong - 0.08679199218 &&
     ?lat        >=     ?berlinLat - 0.03190235436 && 
     ?long       <=     ?berlinLong + 0.08679199218) };

-- 56 rows

sparql PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX p: <http://dbpedia.org/property/>
SELECT ?s ?a ?homepage FROM <http://dbpedia.org>  WHERE {
   <http://dbpedia.org/resource/New_York_City> geo:lat ?nyLat .
   <http://dbpedia.org/resource/New_York_City> geo:long ?nyLong . 
   ?s geo:lat ?lat .
   ?s geo:long ?long .
   ?s p:architect ?a .
   ?a foaf:homepage ?homepage .
   FILTER (
     ?lat        <=     ?nyLat + 0.3190235436 &&
     ?long       >=     ?nyLong - 0.8679199218 &&
     ?lat        >=     ?nyLat - 0.3190235436 && 
     ?long       <=     ?nyLong + 0.8679199218) };

-- 13 rows
# PermaLink Comments [0]
05/09/2008 15:33 GMT-0500 Modified: 05/12/2008 11:24 GMT-0500
 <<     | 1 | 2 |     >>
Powered by OpenLink Virtuoso Universal Server
Running on Linux platform
OpenLink Software 1998-2006