Details
Subscribe
Post Categories
Recent Articles
Display Settings
|
Showing posts in all categories Refresh
Provenance and Reification in Virtuoso
These days, data provenance is a big topic across the board, ranging from the linked data web, to RDF in general, to any kind of data integration, with or without RDF. Especially with scientific data we encounter the need for metadata and provenance, repeatability of experiments, etc. Data without context is worthless, yet the producers of said data do not always have a model or budget for metadata. And if they do, the approach is often a proprietary relational schema with web services in front.
RDF and linked data principles could evidently be a great help. This is a large topic that goes into the culture of doing science and will deserve a more extensive treatment down the road.
For now, I will talk about possible ways of dealing with provenance annotations in Virtuoso at a fairly technical level.
If data comes many-triples-at-a-time from some source (e.g., library catalogue, user of a social network), then it is often easiest to put the data from each source/user into its own graph. Annotations can then be made on the graph. The graph IRI will simply occur as the subject of a triple in the same or some other graph. For example, all such annotations could go into a special annotations graph.
On the query side, having lots of distinct graphs does not have to be a problem if the index scheme is the right one, i.e., the 4 index scheme discussed in the Virtuoso documentation. If the query does not specify a graph, then triples in any graph will be considered when evaluating the query.
One could write queries like —
SELECT ?pub
WHERE
{
GRAPH ?g
{
?person foaf:knows ?contact
}
?contact foaf:name "Alice" .
?g xx:has_publisher ?pub
}
This would return the publishers of graphs that assert that somebody knows Alice.
Of course, the RDF reification vocabulary can be used as-is to say things about single triples. It is however very inefficient and is not supported by any specific optimization. Further, reification does not seem to get used very much; thus there is no great pressure to specially optimize it.
If we have to say things about specific triples and this occurs frequently (i.e., for more than 10% or so of the triples), then modifying the quad table becomes an option. For all its inefficiency, the RDF reification vocabulary is applicable if reification is a rarity.
Virtuoso's RDF_QUAD table can be altered to have more columns. The problem with this is that space usage is increased and the RDF loading and query functions will not know about the columns. A SQL update statement can be used to set values for these additional columns if one knows the G,S,P,O.
Suppose we annotated each quad with the user who inserted it and a timestamp. These would be columns in the RDF_QUAD table. The next choice would be whether these were primary key parts or dependent parts. If primary key parts, these would be non-NULL and would occur on every index. The same quad would exist for each distinct user and time this quad had been inserted. For loading functions to work, these columns would need a default. In practice, we think that having such metadata as a dependent part is more likely, so that G,S,P,O are the unique identifier of the quad. Whether one would then include these columns on indices other than the primary key would depend on how frequently they were accessed.
In SPARQL, one could use an extension syntax like —
SELECT *
WHERE
{ ?person foaf:knows ?connection
OPTION ( time ?ts ) .
?connection foaf:name "Alice" .
FILTER ( ?ts > "2009-08-08"^^xsd:datetime )
}
This would return everybody who knows Alice since a date more recent than 2009-08-08. This presupposes that the quad table has been extended with a datetime column.
The OPTION (time ?ts) syntax is not presently supported but we can easily add something of the sort if there is user demand for it. In practice, this would be an extension mechanism enabling one to access extension columns of RDF_QUAD via a column ?variable syntax in the OPTION clause.
If quad metadata were not for every quad but still relatively frequent, another possibility would be making a separate table with a key of GSPO and a dependent part of R, where R would be the reification URI of the quad. Reification statements would then be made with R as a subject. This would be more compact than the reification vocabulary and would not modify the RDF_QUAD table. The syntax for referring to this could be something like —
SELECT *
WHERE
{ ?person foaf:knows ?contact
OPTION ( reify ?r ) .
?r xx:assertion_time ?ts .
?contact foaf:name "Alice" .
FILTER ( ?ts > "2008-8-8"^^xsd:datetime )
}
We could even recognize the reification vocabulary and convert it into the reify option if this were really necessary. But since it is so unwieldy I don't think there would be huge demand. Who knows? You tell us.
|
09/01/2009 10:44 GMT
|
Modified:
09/01/2009 11:20 GMT
|
Web Science and Keynotes at WWW 2009 (#4 of 5)
(Fourth of five posts related to the WWW 2009 conference, held the week of April 20, 2009.)
There was quite a bit of talk about what web science could or ought to be. I will here comment a bit on the panels and keynotes, in no special order.
In the web science panel, Tim Berners-Lee said that the
deliverable of the web science initiative could be a way of making sense of all the world's data once the web had transformed into a database capable of answering arbitrary queries.
Michael Brodie of Verizon said that one deliverable would be a well considered understanding of the issue of counter-terrorism and civil liberties: Everything, including terrorism, operates on the platform of the web. How do we understand an issue that is not one of privacy, intelligence, jurisprudence, or sociology, but of all these and more?
I would add to this that it is not only a matter of governments keeping and analyzing vast amounts of private data, but of basically anybody who wants to do this being able to do so, even if at a smaller scale. In a way, the data web brings formerly government-only capabilities to the public, and is thus a democratization of intelligence and analytics. The citizen blogger increased the accountability of the press; the citizen analyst may have a similar effect. This is trickier though. We remember Jefferson's words about vigilance and the price of freedom. But vigilance is harder today, not because information is not there but because there is so much of it, with diverse spins put on it.
Tim B-L said at another panel that it seemed as if the new capabilities, especially the web as a database, were coming just in time to help us cope with the problems confronting the planet. With this, plus having everybody online, we would have more information, more creativity, more of everything at our disposal.
I'd have to say that the web is dual use: The bulk of traffic may contribute to distraction more than to awareness, but then the same infrastructure and the social behaviors it supports may also create unprecedented value and in the best of cases also transparency. I have to think of "For whosoever hath, to him shall be given." [Matthew 13:12] This can mean many things; here I am talking about whoever hath a drive for knowledge.
The web is both equalizing and polarizing: The equality is in the access; the polarity in the use made thereof. For a huge amount of noise there will be some crystallization of value that could not have arisen otherwise. Developments have unexpected effects. I would not have anticipated that gaming should advance supercomputing, for example.
Wendy Hall gave a dinner speech about communities and conferences; how the original hypertext conferences, with lots of representation of the humanities, became the techie WWW conference series; and how now we have the pendulum swinging back to more diversity with the web science conferences. So it is with life. Aside from the facts that there are trends and pendulum effects, and that paths that cross usually cross again, it is very hard to say exactly how these things play out.
At the "20 years of web" panel, there was a round of questions on how different people had been surprised by the web. Surprises ranged from the web's actual scalability to its rapid adoption and the culture of "if I do my part, others will do theirs." On the minus side, the emergence of spam and phishing were mentioned as unexpected developments.
Questions of simplicity and complexity got a lot of attention, along with network effects. When things hit the right simplicity at the right place (e.g., HTML and HTTP, which hypertext-wise were nothing special), there is a tipping point.
No barrier of entry, not too much modeling, was repeated quite a bit, also in relation to semantic web and ontology design. There is a magic of emergent effects when the pieces are simple enough: Organic chemistry out of a couple of dozen elements; all the world's information online with a few tags of markup and a couple of protocol verbs. But then this is where the real complexity starts — one half of it in the transport, the other in the applications, yet a narrow interface between the two.
This then begs the question of content- and application-aware networks. The preponderance of opinion was for separation of powers — keep carriers and content apart.
Michael Brodie commented in the questions to the first panel that simplicity was greatly overrated, that the world was in fact very complex. It seems to me that that any field of human endeavor develops enough complexity to fully occupy the cleverest minds who undertake said activity. The life-cycle between simplicity and complexity seems to be a universal feature. It is a bit like the Zen idea that "for the beginner, rivers are rivers and mountains are mountains, for the student these are imponderable mysteries of bewildering complexity and transcendent dimension but for the master these are again rivers and mountains." One way of seeing this is that the master, in spite of the actual complexity and interrelatedness of all things, sees where these complexities are significant and where not and knows to communicate concerning these as fits the situation.
There is no fixed formula for saying where complexities and simplicities fit, relevance of detail is forever contextual. For technological systems, we find that there emerge relatively simple interfaces on either side of which there is huge complexity: The x86 instruction set, TCP/IP, SQL, to name a few. These are lucky breaks, it is very hard to say beforehand where these will emerge. Object oriented people would like to see such everywhere, which just leads to problems of modeling.
There was a keynote from Telefonica about infrastructure. We heard that the power and cooling cost more than the equipment, that data centers ought to be scaled down from the football stadium and 20 megawatt scale, that systems must be designed for partitioning, to name a few topics. This is all well accepted. The new question is whether storage should go into the network infrastructure. We have blogged that the network will be the database, and it is no surprise that a telco should have the same idea, just with slightly different emphasis and wording. For Telefonica, this is about efficiency of bulk delivery, for us this is more about virtualized query-able dataspaces. Both will be distributed but issues of separation of powers may keep the two roles of network with storage separate.
In conclusion, the network being the database was much more visible and accepted this year than last. The linked data web was in Tim B-L's keynote as it was in the opening speech by the Prince of Asturias.
|
04/30/2009 12:00 GMT
|
Modified:
04/30/2009 12:11 GMT
|
Web Scale and Fault Tolerance
One concern about Virtuoso Cluster is fault tolerance. This post talks about the basics of fault tolerance and what we can do with this, from improving resilience and optimizing performance to accommodating bulk loads without impacting interactive response. We will see that this is yet another step towards a 24/7 web-scale Linked Data Web. We will see how large scale, continuous operation, and redundancy are related.
It has been said many times — when things are large enough, failures become frequent. In view of this, basic storage of partitions in multiple copies is built into the Virtuoso cluster from the start. Until now, this feature has not been tested or used very extensively, aside from the trivial case of keeping all schema information in synchronous replicas on all servers.
Approaches to Fault Tolerance
Fault tolerance has many aspects but it starts with keeping data in at least two copies. There are shared-disk cluster databases like Oracle RAC that do not depend on partitioning. With these, as long as the disk image is intact, servers can come and go. The fault tolerance of the disk in turn comes from mirroring done by the disk controller. Raids other than mirrored disk are not really good for databases because of write speed.
With shared-nothing setups like Virtuoso, fault tolerance is based on multiple servers keeping the same logical data. The copies are synchronized transaction-by-transaction but are not bit-for-bit identical nor write-by-write synchronous as is the case with mirrored disks.
There are asynchronous replication schemes generally based on log shipping, where the replica replays the transaction log of the master copy. The master copy gets the updates, the replica replays them. Both can take queries. These do not guarantee an entirely ACID fail-over but for many applications they come close enough.
In a tightly coupled cluster, it is possible to do synchronous, transactional updates on multiple copies without great added cost. Sending the message to two places instead of one does not make much difference since it is the latency that counts. But once we go to wide area networks, this becomes as good as unworkable for any sort of update volume. Thus, wide area replication must in practice be asynchronous.
This is a subject for another discussion. For now, the short answer is that wide area log shipping must be adapted to the application's requirements for synchronicity and consistency. Also, exactly what content is shipped and to where depends on the application. Some application-specific logic will likely be involved; more than this one cannot say without a specific context.
Basics of Partition Fail-Over
For now, we will be concerned with redundancy protecting against broken hardware, software slowdown, or crashes inside a single site.
The basic idea is simple: Writes go to all copies; reads that must be repeatable or serializable (i.e., locking) go to the first copy; reads that refer to committed state without guarantee of repeatability can be balanced among all copies. When a copy goes offline, nobody needs to know, as long as there is at least one copy online for each partition. The exception in practice is when there are open cursors or such stateful things as aggregations pending on a copy that goes down. Then the query or transaction will abort and the application can retry. This looks like a deadlock to the application.
Coming back online is more complicated. This requires establishing that the recovering copy is actually in sync. In practice this requires a short window during which no transactions have uncommitted updates. Sometimes, forcing this can require aborting some transactions, which again looks like a deadlock to the application.
When an error is seen, such as a process no longer accepting connections and dropping existing cluster connections, we in practice go via two stages. First, the operations that directly depended on this process are aborted, as well as any computation being done on behalf of the disconnected server. At this stage, attempting to read data from the partition of the failed server will go to another copy but writes will still try to update all copies and will fail if the failed copy continues to be offline. After it is established that the failed copy will stay off for some time, writes may be re-enabled — but now having the failed copy rejoin the cluster will be more complicated, requiring an atomic window to ensure sync, as mentioned earlier.
For the DBA, there can be intermittent software crashes where a failed server automatically restarts itself, and there can be prolonged failures where this does not happen. Both are alerts but the first kind can wait. Since a system must essentially run itself, it will wait for some time for the failed server to restart itself. During this window, all reads of the failed partition go to the spare copy and writes give an error. If the spare does not come back up in time, the system will automatically re-enable writes on the spare but now the failed server may no longer rejoin the cluster without a complex sync cycle. This all can happen in well under a minute, faster than a human operator can react. The diagnostics can be done later.
If the situation was a hardware failure, recovery consists of taking a spare server and copying the database from the surviving online copy. This done, the spare server can come on line. Copying the database can be done while online and accepting updates but this may take some time, maybe an hour for every 200G of data copied over a network. In principle this could be automated by scripting, but we would normally expect a human DBA to be involved.
As a general rule, reacting to the failure goes automatically without disruption of service but bringing the failed copy online will usually require some operator action.
Levels of Tolerance and Performance
The only way to make failures totally invisible is to have all in duplicate and provisioned so that the system never runs at more than half the total capacity. This is often not economical or necessary. This is why we can do better, using the spare capacity for more than standby.
Imagine keeping a repository of linked data. Most of the content will come in through periodic bulk replacement of data sets. Some data will come in through pings from applications publishing FOAF and similar. Some data will come through on-demand RDFization of resources.
The performance of such a repository essentially depends on having enough memory. Having this memory in duplicate is just added cost. What we can do instead is have all copies store the whole partition but when routing queries, apply range partitioning on top of the basic hash partitioning. If one partition stores IDs 64K - 128K, the next partition 128K - 192K, and so forth, and all partitions are stored in two full copies, we can route reads to the first 32K IDs to the first copy and reads to the second 32K IDs to the second copy. In this way, the copies will keep different working sets. The RAM is used to full advantage.
Of course, if there is a failure, then the working set will degrade, but if this is not often and not for long, this can be quite tolerable. The alternate expense is buying twice as much RAM, likely meaning twice as many servers. This workload is memory intensive, thus servers should have the maximum memory they can have without going to parts that are so expensive one gets a new server for the price of doubling memory.
Background Bulk Processing
When loading data, the system is online in principle, but query response can be quite bad. A large RDF load will involve most memory and queries will miss the cache. The load will further keep most disks busy, so response is not good. This is the case as soon as a server's partition of the database is four times the size of RAM or greater. Whether the work is bulk-load or bulk-delete makes little difference.
But if partitions are replicated, we can temporarily split the database so that the first copies serve queries and the second copies do the load. If the copies serving on line activities do some updates also, these updates will be committed on both copies. But the load will be committed on the second copy only. This is fully appropriate as long as the data are different. When the bulk load is done, the second copy of each partition will have the full up to date state, including changes that came in during the bulk load. The online activity can be now redirected to the second copies and the first copies can be overwritten in the background by the second copies, so as to again have all data in duplicate.
Failures during such operations are not dangerous. If the copies doing the bulk load fail, the bulk load will have to be restarted. If the front end copies fail, the front end load goes to the copies doing the bulk load. Response times will be bad until the bulk load is stopped, but no data is lost.
This technique applies to all data intensive background tasks — calculation of entity search ranks, data cleansing, consistency checking, and so on. If two copies are needed to keep up with the online load, then data can be kept just as well in three copies instead of two. This method applies to any data-warehouse-style workload which must coexist with online access and occasional low volume updating.
Configurations of Redundancy
Right now, we can declare that two or more server processes in a cluster form a group. All data managed by one member of the group is stored by all others. The members of the group are interchangeable. Thus, if there is four-servers-worth of data, then there will be a minimum of eight servers. Each of these servers will have one server process per core. The first hardware failure will not affect operations. For the second failure, there is a 1/7 chance that it stops the whole system, if it falls on the server whose pair is down. If groups consist of three servers, for a total of 12, the two first failures are guaranteed not to interrupt operations; for the third, there is a 1/10 chance that it will.
We note that for big databases, as said before, the RAM cache capacity is the sum of all the servers' RAM when in normal operation.
There are other, more dynamic ways of splitting data among servers, so that partitions migrate between servers and spawn extra copies of themselves if not enough copies are online. The Google File System (GFS) does something of this sort at the file system level; Amazon's Dynamo does something similar at the database level. The analogies are not exact, though.
If data is partitioned in this manner, for example into 1K slices, each in duplicate, with the rule that the two duplicates will not be on the same physical server, the first failure will not break operations but the second probably will. Without extra logic, there is a probability that the partitions formerly hosted by the failed server have their second copies randomly spread over the remaining servers. This scheme equalizes load better but is less resilient.
Maintenance and Continuity
Databases may benefit from defragmentation, rebalancing of indices, and so on. While these are possible online, by definition they affect the working set and make response times quite bad as soon as the database is significantly larger than RAM. With duplicate copies, the problem is largely solved. Also, software version changes need not involve downtime.
Present Status
The basics of replicated partitions are operational. The items to finalize are about system administration procedures and automatic synchronization of recovering copies. This must be automatic because if it is not, the operator will find a way to forget something or do some steps in the wrong order. This also requires a management view that shows what the different processes are doing and whether something is hung or failing repeatedly. All this is for the recovery part; taking failed partitions offline is easy.
|
04/01/2009 10:18 GMT
|
Modified:
04/01/2009 11:18 GMT
|
Beyond Applications - Introducing the Planetary Datasphere (Part 2)
We have looked at the general implications of the DataSphere, a universal, ubiquitous database infrastructure, on end-user experience and application development and content. Now we will look at what this means at the back end, from hosting to security to server software and hardware.
Application Hosting
For the infrastructure provider, hosting the DataSphere is no different from hosting large Web 2.0 sites. This may be paid for by users, as in the cloud computing model where users rent capacity for their own purposes, or by advertisers, as in most of Web 2.0.
Clouds play a role in this as places with high local connectivity. The DataSphere is the atmosphere; the Cloud is an atmospheric phenomenon.
What of Proprietary Data and its Security?
Having proprietary data does not imply using a proprietary language. I would say that for any domain of discourse, no matter how private or specialized, at least some structural concepts can be borrowed from public, more generic sources. This lowers training thresholds and facilitates integration. Being able to integrate does not imply opening one's own data. To take an analogy, if you have a bunker with closed circuit air recycling, you still breathe air, even if that air is cut off from the atmosphere at large. For places with complex existing RDBMS security, the best is to map the RDBMS to RDF on the fly, always running all requests through the RDBMS. This implicitly preserves any policy or label based security schemes.
What of Individual Privacy on the Open Web?
The more complex situations will be found in environments with mixed security needs, as in social networking with partly-open and partly-closed profiles. The FOAF+SSL solution with https:// URIs is one approach. For query processing, we have a question of enforcing instance-level policies. In the DataSphere, granting privileges on tables and views no longer makes sense. In SQL, a policy means that behind the scenes the DBMS will add extra criteria to queries and updates depending on who is issuing them. The query processor adds conditions like getting the user's department ID and comparing it to the department ID on the payroll record. Labeled security is a scheme where data rows themselves contain security tags and the DBMS enforces these, row by row.
I would say that these techniques are suited for highly-structured situations where the roles, compartments, and needs are clear, and where the organization has the database know-how to write, test, and deploy such rules by the table, row, and column. This does not sit well with schema-last. I would not bet much on an average developer's capacity for making airtight policies on RDF data where not even 100% schema-adherence is guaranteed.
Doing security at the RDF graph level seems more appropriate. In many use cases, the graph is analogous to a photo album or a file system directory. A Data Space can be divided into graphs to provide more granularity for expressing topic, provenance, or security. If policy conditions apply mostly to the graph, then things are not as likely to slip by, for example, policy rules missing some infrequent misuse of the schema. In these cases, the burden on the query processor is also not excessive: Just as with documents, the container (table, graph) is the object of access grants, not the individual sentences (DBMS records, RDF triples) in the document.
It is left to the application to present a choice of graph level policies to the user. Exactly what these will be depends on the domain of discourse. A policy might restrict access to a meeting in a calendar to people whose OpenIDs figure in the attendee list, or limit access to a photo album to people mentioned in the owner's social network. Defining such policies is typically a task for the application developer.
The difference between the Document Web and the Linked Data Web is that while the Document Web enforces security when a thing is returned to the user, Linked Data Web enforcement must occur whenever a query references something, even if this is an intermediate result not directly shown to the user.
The DataSphere will offer a generic policy scheme, filtering what graphs are accessed in a given query situation. Other applications may then verify the safety of one's disclosed information using the same DataSphere infrastructure. Of course, the user must rely on the infrastructure provider to correctly enforce these rules. Then again, some users will operate and audit their own infrastructure anyway.
Federation vs. Centralization
On the open web, there is the question of federation vs. centralization. If an application is seen to be an interface to a vocabulary, it becomes more agnostic with respect to this. In practice, if we are talking about hosted services, what is hosted together joins much faster. Data Spaces with lots of interlinking, such as closely connected social networks, will tend to cluster together on the same cloud to facilitate joint operation. Data is ubiquitous and not location-conscious, but what one can efficiently do with it depends on location. Joint access patterns favor joint location. Due to technicalities of the matter, single database clusters will run complex queries within the cluster 100 to 1000 times faster than between clusters. The size of such data clouds may be in the hundreds-of-billions of triples. It seems to make sense to have data belonging to same-type or jointly-used applications close together. In practice, there will arise partitioning by type of usage, user profile, etc., but this is no longer airtight and applications more-or-less float on top of all of this.
A search engine can host a copy of the Document Web and allow text lookups on it. But a text lookup is a single well-defined query that happens to parallelize and partition very well. A search engine can also have all the structured public data copied, but the problem there is that queries are a lot less predictable and may take orders of magnitude more resources than a single text lookup. As a partial answer, even now, we can set up a database so that the first million single-row joins cost the user nothing, but doing more requires a special subscription.
The cost for hosting a trillion triples will vary radically in function of what throughput is promised. This may result in pricing per service level, a bit like ISP pricing varies in function of promised connectivity. Queries can be run for free if no throughput guarantee applies, and might cost more if the host promises at least five-million joins-per-second including infrequently-accessed data.
Performance and cost dynamics will probably lead to the emergence of domain-specific clusters of colocated Data Spaces. The landscape will be hybrid, where usage drives data colocation. A single Google is not a practical solution to the world's spectrum of query needs.
What is the Cost of Schema-Last?
The DataSphere proposition is predicated on a worldwide database fabric that can store anything, just like a network can transport anything. It cannot enforce a fixed schema, just like TCP/IP cannot say that it will transport only email. This is continuous schema evolution. Well, TCP/IP can transport anything but it does transport a lot of HTML and email. Similarly, the DataSphere can optimize for some common vocabularies.
We have seen that an application-specific relational schema is often 10 times more efficient than an equivalent completely generic RDF representation of the same thing. The gap may narrow, but task specific representations will keep an edge. We ought to know, as we do both.
While anything can be represented, the masses are not that creative. For any data-hosting provider, making a specialized representation for the top 100 entities may cut data size in half or better. This is a behind-the-scenes optimization that will in time be a matter of course.
Historically, our industry has been driven by two phenomena:
-
New PCs every 2 years. To make this necessary, Windows has been getting bigger and bigger, and not upgrading is not an option if one must exchange documents with new data formats and keep up with security.
-
Agility, or ad hoc over planned. The reason the RDBMS won over CODASYL network databases was that one did not have to define what queries could be made when creating the database. With the Linked Data Web, we have one more step in this direction when we say that one does not have to decide what can be represented when creating the database.
To summarize, there is some cost to schema-last, but then our industry needs more complexity to keep justifying constant investment. The cost is in this sense not all bad.
Building the DataSphere may be the next great driver of server demand. As a case in point, Cisco, whose fortune was made when the network became ubiquitous, just entered the server game. It's in the air.
DataSphere Precursors
Right now, we have the Linked Open Data movement with lots of new data being added. We have the drive for data- and reputation-portability. We have Freebase as a demonstrator of end-users actually producing structured data. We have convergence of terminology around DBpedia, FOAF, SIOC, and more. We have demonstrators of useful data integration on the RDF stack in diverse fields, especially life sciences.
We have a totally ubiquitous network for the distribution of this, plus database technology to make this work.
We have a practical need for semantics, as search is getting saturated, email is getting killed by spam, and information overload is a constant. Social networks can be leveraged for solving a lot of this, if they can only be opened.
Of course, there is a call for transparency in society at large. Well, the battle of transparency vs. spin is a permanent feature of human existence but even there, we cannot ignore the possibilities of open data.
Databases and Servers
Technically, what does this take? Mostly, this takes a lot of memory. The software is there and we are productizing it as we speak. As with other data intensive things, the key is scalable querying over clusters of commodity servers. Nothing we have not heard before. Of course, the DBMS must know about RDF specifics to get the right query plans and so on but this we have explained elsewhere.
This all comes down to the cost of memory. No amount of CPU or network speed will make any difference if data is not in memory. Right now, a board with 8G and a dual core AMD X86-64 and 4 disks may cost about $700. 2 x 4 core Xeon and 16G and 8 disks may be $4000, counting just the components. In our experience, about 32G per billion triples is a minimum. This must be backed by a few independent disks so as to fill the cache in parallel. A cluster with 1 TB of RAM would be under $100K if built from low end boards.
The workload is all about large joins across partitions. The queries parallelize well, thus using the largest and most expensive machines for building blocks is not cost efficient. Having absolutely everything in RAM is also not cost efficient, but it is necessary to have many disks to absorb the random access load. Disk access is predominantly random, unlike some analytics workloads that can read serially. If SSD's get a bit cheaper, one could have SSD for the database and disk for backup.
With large data centers, redundancy becomes an issue. The most cost effective redundancy is simply storing partitions in duplicate or triplicate on different commodity servers. The DBMS software should handle the replication and fail-over.
For operating such systems, scaling-on-demand is necessary. Data must move between servers, and adding or replacing servers should be an on-the-fly operation. Also, since access is essentially never uniform, the most commonly accessed partitions may benefit from being kept in more copies than less frequently accessed ones. The DBMS must be essentially self administrating since these things are quite complex and easily intractable if one does not have in depth understanding of this rather complex field.
The best price point for hardware varies with time. Right now, the optimum is to have many basic motherboards with maximum memory in a rack unit, then another unit with local disks for each motherboard. Much cheaper than SAN's and Infiniband fabrics.
Conclusions and Next Steps
The ingredients and use cases are there. If server clusters with 1TB RAM begin under $100K, the cost of deployment is small compared to personnel costs.
Bootstrapping the DataSphere from current Linked Open Data, such as DBpedia, OpenCYC, Freebase, and every sort of social network, is feasible. Aside from private data integration and analytics efforts and E-science, the use cases are liberating social networks and C2C and some aspects of search from silos, overcoming spam, and mass use of semantics extracted from text. Emergent effects will then carry the ball to places we have not yet been.
The Linked Data Web has its origins in Semantic Web research, and many of the present participants come from these circles. Things may have been slowed down by a disconnect, only too typical of human activity, between Semantic Web research on one hand and database engineering on the other. Right now, the challenge is one of engineering. As documented on this blog, we have worked quite a bit on cluster databases, mostly but not exclusively with RDF use cases. The actual challenges of this are however not at all what is discussed in Semantic Web conferences. These have to do with complexities of parallelism, timing, message bottlenecks, transactions, and the like, i.e., hardcore engineering. These are difficult beyond what the casual onlooker might guess but not impossible. The details that remain to be worked out are nothing semantic, they are hardcore database, concerning automatic provisioning and such matters.
It is as if the Semantic Web people look with envy at the Web 2.0 side where there are big deployments in production, yet they do not seem quite ready to take the step themselves. Well, I will write some other time about research and engineering. For now, the message is &mdash go for it. Stay tuned for more announcements, as we near production with our next generation of software.
Related
|
03/25/2009 10:50 GMT
|
Modified:
03/25/2009 12:31 GMT
|
Beyond Applications - Introducing the Planetary Datasphere (Part 1)
This is the first in a short series of blog posts about what becomes possible when essentially unlimited linked data can be deployed on the open web and private intranets.
The term DataSphere comes from Dan Simmons' Hyperion science fiction series, where it is a sort of pervasive computing capability that plays host to all sorts of processes, including what people do on the net today, and then some. I use this term here in order to emphasize the blurring of silo and application boundaries. The network is not only the computer but also the database. I will look at what effects the birth of a sort of linked data stratum can have on end-user experience, application development, application deployment and hosting, business models and advertising, and security; how cloud computing fits in; and how back-end software such as databases must evolve to support all of these.
This is a mid-term vision. The components are coming into production as we speak, but the end result is not here quite yet.
I use the word DataSphere to refer to a worldwide database fabric, a global Distributed DBMS collective, within which there are many Data Spaces, or Named Data Spaces. A Data Space is essentially a person's or organization's contribution to the DataSphere. I use Linked Data Web to refer to component technologies and practices such as RDF, SPARQL, Linked Data practices, etc. The DataSphere does not have to be built on this technology stack per se, but this stack is still the best bet for it.
General
There exist applications for performing specialized functions such as social networking, shopping, document search, and C2C commerce at planetary scale. All these applications run on their own databases, each with a task specific schema. They communicate by web pages and by predefined messages for diverse application-specific transactions and reports.
These silos are scalable because in general their data has some natural partitioning, and because the set of transactions is predetermined and the data structure is set up for this.
The Linked Data Web proposes to create a data infrastructure that can hold anything, just like a network can transport anything. This is not a network with a memory of messages, but a whole that can answer arbitrary questions about what has been said. The prerequisite is that the questions are phrased in a vocabulary that is compatible with the vocabulary in which the statements themselves were made.
In this setting, the vocabulary takes the place of the application. Of course, there continues to be a procedural element to applications; this has the function of translating statements between the domain vocabulary and a user interface. Examples are data import from existing applications, running predefined reports, composing new reports, and translating between natural language and the domain vocabulary.
The big difference is that the database moves outside of the silo, at least in logical terms. The database will be like the network — horizontal and ubiquitous. The equivalent of TCP/IP will be the RDF/SPARQL combination. The equivalent of routing protocols between ISPs will be gateways between the specific DBMS engines supporting the services.
The place of the DBMS in the stack changes
The RDBMS in itself is eternal, or at least as eternal as a culture with heavy reliance on written records is. Any such culture will invent the RDBMS and use it where it best fits. We are not replacing this; we are building an abstracted worldwide data layer. This is to the RDBMS supporting line-of-business applications what the www was to enterprise content management systems.
For transactions, the Web 2.0-style application-specific messages are fine. Also, any transactional system that must be audited must physically reside somewhere, have physical security, etc. It can't just be somewhere in the DataSphere, managed by some system with which one has no contract, just like Google's web page cache can't be relied on as a permanent repository of web content.
Providing space on the Linked Data Web is like providing hosting on the Document Web. This may have varying service levels, pricing models, etc. The value of a queriable DataSphere is that a new application does not have to begin by building its own schema, database infrastructure, service hosting, etc. The application becomes more like a language meme, a cultural form of interaction mediated by a relatively lightweight user-facing component, laterally open for unforeseen interaction with other applications from other domains of discourse.
End User Benefits
For the end user, the web will still look like a place where one can shop, discuss, date, whatever. These activities will be mediated by user interfaces as they are now. Right now, the end user's web presence is his/her blog or web site, and their contributions to diverse wikis, social web sites, and so forth. These are scattered. The user's Data Space is the collection of all these things, now presented in a queriable form. The user's Data Space is the user's statement of presence, referencing the diverse contributions of the user on diverse sites.
The personal Data Space being a queriable, structured whole facilitates finding and being found, which is what brings individuals to the web in the first place. The best applications and sites are those which make this the easiest. The Linked Data Web allows saying what one wishes in a structured, queriable manner, across all application domains, independently of domain specific silos. The end user's interaction with the personal data space is through applications, like now. But these applications are just wrappers on top of self describing data, represented in domain specific vocabularies; one vocabulary is used for social networking, another for C2C commerce, and so on. The user is the master of their personal Data Space, free to take it where he or she wishes.
Further benefits will include more ready referencing between these spaces, more uniform identity management, cross-application operations, and the emergence of "meta-applications," i.e., unified interfaces for managing many related applications/tasks.
Of course, there is the increase in semantic richness, such as better contextuality derived from entity extraction from text. But this is also possible in a silo. The Linked Data Web angle is the sharing of identifiers for real world entities, which makes extracts of different sources by different parties potentially joinable. The user interaction will hardly ever be with the raw data. But the raw data being still at hand makes for better targeting of advertisements, better offering of related services, easier discovery of related content, and less noise overall.
Kingsley Idehen has coined the term SDQ, for Serendipitous Discovery Quotient, to denote this. When applications expose explicit semantics, constructing a user experience that combines relevant data from many sources, including applications as well as highly targeted advertising, becomes natural. It is no longer a matter of "mashing up" web service interfaces with procedural code, but of "meshing" data through declarative queries across application spaces.
Applications in the DataSphere
The workflows supported by the DataSphere are essentially those taking place on the web now. The DataSphere dimension is expressed by bookmarklets, browser plugins, and the like, with ready access to related data and actions that are relevant for this data. Actions triggered by data can be anything from posting a comment to making an e-commerce purchase. Web 2.0 models fit right in.
Web application development now consists of designing an application-specific database schema and writing web pages to interact with this schema. In the DataSphere, the database is abstracted away, as is a large part of the schema. The application floats on a sea of data instead of being tied to its own specific store and schema. Some local transaction data should still be handled in the old way, though.
For the application developer, the question becomes one of vocabulary choice. How will the application synthesize URIs from the user interaction? Which URIs will be used, since pretty much anything will in practice have many names (e.g., DBpedia Vs. Freebase identifiers). The end user will generally have no idea of this choice, nor of the various degrees of normalization, etc., in the vocabularies. Still, usage of such applications will produce data using some identifiers and vocabularies. Benefits of ready joining without translation will drive adoption. A vocabulary with instance data will get more instance data.
The Linked Data Web infrastructure itself must support vocabulary and identifier choice by answering questions about who uses a particular identifier and where. Even now, we offer entity ranks and resolution of synonyms, queries on what graphs mention a certain identifier and so on. This is a means of finding the most commonly used term for each situation. Convergence of terminology cuts down on translation and makes for easier and more efficient querying.
Advertising
The application developer is, for purposes of advertising, in the position of the inventory owner, just like a traditional publisher, whether web or other. But with smarter data, it is not a matter of static keywords but of the semantically explicit data behind each individual user impression driving the ads. Data itself carries no ads but the user impression will still go through a display layer that can show ads. If the application relies on reuse of licensed content, such as media, then the content provider may get a cut of the ad revenue even if it is not the direct owner of the inventory. The specifics of implementing and enforcing this are to be worked out.
Content Providers, License, and Attribution
For the content provider, the URI is the brand carrier. If the data is well linked and queriable, this will drive usage and traffic to the services of the content provider. This is true of any provider, whether a media publisher, e-commerce business, government agency, or anything else.
Intellectual property considerations will make the URI a first class citizen. Just like the URI is a part of the document web experience, it is a part of the Linked Data Web experience. Just like Creative Commons licenses allow the licensor to define what type of attribution is required, a data publisher can mandate that a user experience mediated by whatever application should expose the source as a dereferenceable URI.
One element of data dereferencing must be linking to applications that facilitate human interaction with the data. A generic data browser is a developer tool; the end user experience must still be mediated by interfaces tailored to the domain. This layer can take care of making the brand visible and can show advertising or be monetized on a usage basis.
Next we will look at the service provider and infrastructure side of this.
Related
|
03/24/2009 09:38 GMT
|
Modified:
03/24/2009 10:50 GMT
|
"E Pluribus Unum", or "Inversely Functional Identity", or "Smooshing Without the Stickiness" (re-updated)
What a terrible word, smooshing... I have understood it to mean that when you have two names for one thing, you give each all the attributes of the other. This smooshes them together, makes them interchangeable.
This is complex, so I will begin with the point and the interested may read on for the details and implications. Starting with soon to be released version 6, Virtuoso allows you to say that two things, if they share a uniquely identifying property, are the same. Examples of uniquely identifying properties would be a book's ISBN number, or a person's social security plus full name. In relational language this is a unique key, and in RDF parlance, an inverse functional property.
In most systems, such problems are dealt with as a preprocessing step before querying. For example, all the items that are considered the same will get the same properties or at load time all identifiers will be normalized according to some application rules. This is good if the rules are clear and understood. This is so in closed situations, where things tend to have standard identifiers to begin with. But on the open web this is not so clear cut.
In this post, we show how to do these things ad hoc, without materializing anything. At the end, we also show how to materialize identity and what the consequences of this are with open web data. We use real live web crawls from the Billion Triples Challenge data set.
On the linked data web, there are independently arising descriptions of the same thing and thus arises the need to smoosh, if these are to be somehow integrated. But this is only the beginning of the problems.
To address these, we have added the option of specifying that some property will be considered inversely functional in a query. This is done at run time and the property does not really have to be inversely functional in the pure sense. foaf:name will do for an example. This simply means that for purposes of the query concerned, two subjects which have at least one foaf:name in common are considered the same. In this way, we can join between FOAF files. With the same database, a query about music preferences might consider having the same name as "same enough," but a query about criminal prosecution would obviously need to be more precise about sameness.
Our ontology is defined like this:
-- Populate a named graph with the triples you want to use in query time inferencing
ttlp ( '
@prefix foaf: <xmlns="http" xmlns.com="xmlns.com" foaf="foaf">
</>
@prefix owl: <xmlns="http" www.w3.org="www.w3.org" owl="owl">
</>
foaf:mbox_sha1sum a owl:InverseFunctionalProperty .
foaf:name a owl:InverseFunctionalProperty .
',
'xx',
'b3sifp'
);
-- Declare that the graph contains an ontology for use in query time inferencing
rdfs_rule_set ( 'http://example.com/rules/b3sifp#',
'b3sifp'
);
Then use it:
sparql
DEFINE input:inference "http://example.com/rules/b3sifp#"
SELECT DISTINCT ?k ?f1 ?f2
WHERE { ?k foaf:name ?n .
?n bif:contains "'Kjetil Kjernsmo'" .
?k foaf:knows ?f1 .
?f1 foaf:knows ?f2
};
VARCHAR VARCHAR VARCHAR
______________________________________ _______________________________________________ ______________________________
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/dajobe
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/net_twitter
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/amyvdh
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/pom
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/mattb
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/davorg
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/distobj
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/perigrin
....
Without the inference, we get no matches. This is because the data in question has one graph per FOAF file, and blank nodes for persons. No graph references any person outside the ones in the graph. So if somebody is mentioned as known, then without the inference there is no way to get to what that person's FOAF file says, since the same individual will be a different blank node there. The declaration in the context named b3sifp just means that all things with a matching foaf:name or foaf:mbox_sha1sum are the same.
Sameness means that two are the same for purposes of DISTINCT or GROUP BY, and if two are the same, then both have the UNION of all of the properties of both.
If this were a naive smoosh, then the individuals would have all the same properties but would not be the same for DISTINCT.
If we have complex application rules for determining whether individuals are the same, then one can materialize owl:sameAs triples and keep them in a separate graph. In this way, the original data is not contaminated and the materialized volume stays reasonable — nothing like the blow-up of duplicating properties across instances.
The pro-smoosh argument is that if every duplicate makes exactly the same statements, then there is no great blow-up. Best and worst cases will always depend on the data. In rough terms, the more ad hoc the use, the less desirable the materialization. If the usage pattern is really set, then a relational-style application-specific representation with identity resolved at load time will perform best. We can do that too, but so can others.
The principal point is about agility as concerns the inference. Run time is more agile than materialization, and if the rules change or if different users have different needs, then materialization runs into trouble. When talking web scale, having multiple users is a given; it is very uneconomical to give everybody their own copy, and the likelihood of a user accessing any significant part of the corpus is minimal. Even if the queries were not limited, the user would typically not wait for the answer of a query doing a scan or aggregation over 1 billion blog posts or something of the sort. So queries will typically be selective. Selective means that they do not access all of the data, hence do not benefit from ready-made materialization for things they do not even look at.
The exception is corpus-wide statistics queries. But these will not be done in interactive time anyway, and will not be done very often. Plus, since these do not typically run all in memory, these are disk bound. And when things are disk bound, size matters. Reading extra entailment on the way is just a performance penalty.
Enough talk. Time for an experiment. We take the Yahoo and Falcon web crawls from the Billion Triples Challenge set, and do two things with the FOAF data in them:
- Resolve identity at insert time. We remove duplicate person URIs, and give the single URI all the properties of all the duplicate URIs. We expect these to be most often repeats. If a person references another person, we normalize this reference to go to the single URI of the referenced person.
- Give every duplicate URI of a person all the properties of all the duplicates. If these are the same value, the data should not get much bigger, or so we think.
For the experiment, we will consider two people the same if they have the same foaf:name and are both instances of foaf:Person. This gets some extra hits but should not be statistically significant.
The following is a commented SQL script performing the smoosh. We play with internal IDs of things, thus some of these operations cannot be done in SPARQL alone. We use SPARQL where possible for readability. As the documentation states, iri_to_id converts from the qualified name of an IRI to its ID and id_to_iri does the reverse.
We count the triples that enter into the smoosh:
-- the name is an existence because else we'd get several times more due to
-- the names occurring in many graphs
sparql
SELECT COUNT(*)
WHERE { { SELECT DISTINCT ?person
WHERE { ?person a foaf:Person }
} .
FILTER ( bif:exists ( SELECT (1)
WHERE { ?person foaf:name ?nn }
)
) .
?person ?p ?o
};
-- We get 3284674
We make a few tables for intermediate results.
-- For each distinct name, gather the properties and objects from
-- all subjects with this name
CREATE TABLE name_prop
( np_name ANY,
np_p IRI_ID_8,
np_o ANY,
PRIMARY KEY ( np_name,
np_p,
np_o
)
);
ALTER INDEX name_prop
ON name_prop
PARTITION ( np_name VARCHAR (-1, 0hexffff) );
-- Map from name to canonical IRI used for the name
CREATE TABLE name_iri ( ni_name ANY PRIMARY KEY,
ni_s IRI_ID_8
);
ALTER INDEX name_iri
ON name_iri
PARTITION ( ni_name VARCHAR (-1, 0hexffff) );
-- Map from person IRI to canonical person IRI
CREATE TABLE pref_iri
( i IRI_ID_8,
pref IRI_ID_8,
PRIMARY KEY ( i )
);
ALTER INDEX pref_iri
ON pref_iri
PARTITION ( i INT (0hexffff00) );
-- a table for the materialization where all aliases get all properties of every other
CREATE TABLE smoosh_ct
( s IRI_ID_8,
p IRI_ID_8,
o ANY,
PRIMARY KEY ( s,
p,
o
)
);
ALTER INDEX smoosh_ct
ON smoosh_ct
PARTITION ( s INT (0hexffff00) );
-- disable transaction log and enable row auto-commit. This is necessary, otherwise
-- bulk operations are done transactionally and they will run out of rollback space.
LOG_ENABLE (2);
-- Gather all the properties of all persons with a name under that name.
-- INSERT SOFT means that duplicates are ignored
INSERT SOFT name_prop
SELECT "n", "p", "o"
FROM ( sparql
DEFINE output:valmode "LONG"
SELECT ?n ?p ?o
WHERE { ?x a foaf:Person .
?x foaf:name ?n .
?x ?p ?o
}
) xx ;
-- Now choose for each name the canonical IRI
INSERT INTO name_iri
SELECT np_name,
( SELECT MIN (s)
FROM rdf_quad
WHERE o = np_name
AND p = IRI_TO_ID ('http://xmlns.com/foaf/0.1/name')
) AS mini
FROM name_prop
WHERE np_p = IRI_TO_ID ('http://xmlns.com/foaf/0.1/name') ;
-- For each person IRI, map to the canonical IRI of that person
INSERT SOFT pref_iri (i, pref)
SELECT s,
ni_s
FROM name_iri,
rdf_quad
WHERE o = ni_name
AND p = IRI_TO_ID ('http://xmlns.com/foaf/0.1/name') ;
-- Make a graph where all persons have one iri with all the properties of all aliases
-- and where person-to-person refs are canonicalized
INSERT SOFT rdf_quad (g,s,p,o)
SELECT IRI_TO_ID ('psmoosh'),
ni_s,
np_p,
COALESCE ( ( SELECT pref
FROM pref_iri
WHERE i = np_o
),
np_o
)
FROM name_prop,
name_iri
WHERE ni_name = np_name
OPTION ( loop, quietcast ) ;
-- A little explanation: The properties of names are copied into rdf_quad with the name
-- replaced with its canonical IRI. If the object has a canonical IRI, this is used as
-- the object, else the object is unmodified. This is the COALESCE with the sub-query.
-- This takes a little time. To check on the progress, take another connection to the
-- server and do
STATUS ('cluster');
-- It will return something like
-- Cluster 4 nodes, 35 s. 108 m/s 1001 KB/s 75% cpu 186% read 12% clw threads 5r 0w 0i
-- buffers 549481 253929 d 8 w 0 pfs
-- Now finalize the state; this makes it permanent. Else the work will be lost on server
-- failure, since there was no transaction log
CL_EXEC ('checkpoint');
-- See what we got
sparql
SELECT COUNT (*)
FROM <psmoosh>
WHERE {?s ?p ?o};
-- This is 2253102
-- Now make the copy where all have the properties of all synonyms. This takes so much
-- space we do not insert it as RDF quads, but make a special table for it so that we can
-- run some statistics. This saves time.
INSERT SOFT smoosh_ct (s, p, o)
SELECT s, np_p, np_o
FROM name_prop,
rdf_quad
WHERE o = np_name
AND p = IRI_TO_ID ('http://xmlns.com/foaf/0.1/name') ;
-- as above, INSERT SOFT so as to ignore duplicates
SELECT COUNT (*)
FROM smoosh_ct;
-- This is 167360324
-- Find out where the bloat comes from
SELECT TOP 20 COUNT (*),
ID_TO_IRI (p)
FROM smoosh_ct
GROUP BY p
ORDER BY 1 DESC;
The results are:
54728777 http://www.w3.org/2002/07/owl#sameAs
48543153 http://xmlns.com/foaf/0.1/knows
13930234 http://www.w3.org/2000/01/rdf-schema#seeAlso
12268512 http://xmlns.com/foaf/0.1/interest
11415867 http://xmlns.com/foaf/0.1/nick
6683963 http://xmlns.com/foaf/0.1/weblog
6650093 http://xmlns.com/foaf/0.1/depiction
4231946 http://xmlns.com/foaf/0.1/mbox_sha1sum
4129629 http://xmlns.com/foaf/0.1/homepage
1776555 http://xmlns.com/foaf/0.1/holdsAccount
1219525 http://xmlns.com/foaf/0.1/based_near
305522 http://www.w3.org/1999/02/22-rdf-syntax-ns#type
274965 http://xmlns.com/foaf/0.1/name
155131 http://xmlns.com/foaf/0.1/dateOfBirth
153001 http://xmlns.com/foaf/0.1/img
111130 http://www.w3.org/2001/vcard-rdf/3.0#ADR
52930 http://xmlns.com/foaf/0.1/gender
48517 http://www.w3.org/2004/02/skos/core#subject
45697 http://www.w3.org/2000/01/rdf-schema#label
44860 http://purl.org/vocab/bio/0.1/olb
Now compare with the predicate distribution of the smoosh with identities canonicalized
sparql
SELECT COUNT (*) ?p
FROM <psmoosh>
WHERE { ?s ?p ?o }
GROUP BY ?p
ORDER BY 1 DESC
LIMIT 20;
Results are:
748311 http://xmlns.com/foaf/0.1/knows
548391 http://xmlns.com/foaf/0.1/interest
140531 http://www.w3.org/2000/01/rdf-schema#seeAlso
105273 http://www.w3.org/1999/02/22-rdf-syntax-ns#type
78497 http://xmlns.com/foaf/0.1/name
48099 http://www.w3.org/2004/02/skos/core#subject
45179 http://xmlns.com/foaf/0.1/depiction
40229 http://www.w3.org/2000/01/rdf-schema#comment
38272 http://www.w3.org/2000/01/rdf-schema#label
37378 http://xmlns.com/foaf/0.1/nick
37186 http://dbpedia.org/property/abstract
34003 http://xmlns.com/foaf/0.1/img
26182 http://xmlns.com/foaf/0.1/homepage
23795 http://www.w3.org/2002/07/owl#sameAs
17651 http://xmlns.com/foaf/0.1/mbox_sha1sum
17430 http://xmlns.com/foaf/0.1/dateOfBirth
15586 http://xmlns.com/foaf/0.1/page
12869 http://dbpedia.org/property/reference
12497 http://xmlns.com/foaf/0.1/weblog
12329 http://blogs.yandex.ru/schema/foaf/school
We can drop the owl:sameAs triples from the count, so the bloat is a bit less by that but it still is tens of times larger than the canonicalized copy or the initial state.
Now, when we try using the psmoosh graph, we still get different results from the results with the original data. This is because foaf:knows relations to things with no foaf:name are not represented in the smoosh. The exist:
sparql
SELECT COUNT (*)
WHERE { ?s foaf:knows ?thing .
FILTER ( !bif:exists ( SELECT (1)
WHERE { ?thing foaf:name ?nn }
)
)
};
-- 1393940
So the smoosh graph is not an accurate rendition of the social network. It would have to be smooshed further to be that, since the data in the sample is quite irregular. But we do not go that far here.
Finally, we calculate the smoosh blow up factors. We do not include owl:sameAs triples in the counts.
select (167360324 - 54728777) / 3284674.0;
34.290022997716059
select 2229307 / 3284674.0;
= 0.678699621332284
So, to get a smoosh that is not really the equivalent of the original, either multiply the original triple count by 34 or 0.68, depending on whether synonyms are collapsed or not.
Making the smooshes does not take very long, some minutes for the small one. Inserting the big one would be longer, a couple of hours maybe. It was 33 minutes for filling the smoosh_ct table. The metrics were not with optimal tuning so the performance numbers just serve to show that smooshing takes time. Probably more time than allowable in an interactive situation, no matter how the process is optimized.
|
12/16/2008 14:14 GMT
|
Modified:
12/16/2008 15:01 GMT
|
Virtuoso Anytime: No Query Is Too Complex (updated)
A persistent argument against the linked data web has been the cost, scalability, and vulnerability of SPARQL end points, should the linked data web gain serious mass and traffic.
As we are on the brink of hosting the whole DBpedia Linked Open Data cloud in Virtuoso Cluster, we have had to think of what we'll do if, for example, somebody decides to count all the triples in the set.
How can we encourage clever use of data, yet not die if somebody, whether through malice, lack of understanding, or simple bad luck, submits impossible queries?
Restricting the language is not the way; any language beyond text search can express queries that will take forever to execute. Also, just returning a timeout after the first second (or whatever arbitrary time period) leaves people in the dark and does not produce an impression of responsiveness. So we decided to allow arbitrary queries, and if a quota of time or resources is exceeded, we return partial results and indicate how much processing was done.
Here we are looking for the top 10 people whom people claim to know without being known in return, like this:
SQL> sparql
SELECT ?celeb,
COUNT (*)
WHERE { ?claimant foaf:knows ?celeb .
FILTER (!bif:exists ( SELECT (1)
WHERE { ?celeb foaf:knows ?claimant }
)
)
}
GROUP BY ?celeb
ORDER BY DESC 2
LIMIT 10;
celeb callret-1
VARCHAR VARCHAR
________________________________________ _________
http://twitter.com/BarackObama 252
http://twitter.com/brianshaler 183
http://twitter.com/newmediajim 101
http://twitter.com/HenryRollins 95
http://twitter.com/wilw 81
http://twitter.com/stevegarfield 78
http://twitter.com/cote 66
mailto:adam.westerski@deri.org 66
mailto:michal.zaremba@deri.org 66
http://twitter.com/dsifry 65
*** Error S1TAT: [Virtuoso Driver][Virtuoso Server]RC...: Returning incomplete
results, query interrupted by result timeout.
Activity: 1R rnd 0R seq 0P disk 1.346KB / 3 messages
SQL> sparql
SELECT ?celeb,
COUNT (*)
WHERE { ?claimant foaf:knows ?celeb .
FILTER (!bif:exists ( SELECT (1)
WHERE { ?celeb foaf:knows ?claimant }
)
)
}
GROUP BY ?celeb
ORDER BY DESC 2
LIMIT 10;
celeb callret-1
VARCHAR VARCHAR
________________________________________ _________
http://twitter.com/JasonCalacanis 496
http://twitter.com/Twitterrific 466
http://twitter.com/ev 442
http://twitter.com/BarackObama 356
http://twitter.com/laughingsquid 317
http://twitter.com/gruber 294
http://twitter.com/chrispirillo 259
http://twitter.com/ambermacarthur 224
http://twitter.com/t 219
http://twitter.com/johnedwards 188
*** Error S1TAT: [Virtuoso Driver][Virtuoso Server]RC...: Returning incomplete
results, query interrupted by result timeout.
Activity: 329R rnd 44.6KR seq 342P disk 638.4KB / 46 messages
The first query read all data from disk; the second run had the working set from the first and could read some more before time ran out, hence the results were better. But the response time was the same.
If one has a query that just loops over consecutive joins, like in basic SPARQL, interrupting the processing after a set time period is simple. But such queries are not very interesting. To give meaningful partial answers with nested aggregation and sub-queries requires some more tricks. The basic idea is to terminate the innermost active sub-query/aggregation at the first timeout, and extend the timeout a bit so that accumulated results get fed to the next aggregation, like from the GROUP BY to the ORDER BY. If this again times out, we continue with the next outer layer. This guarantees that results are delivered if there were any results found for which the query pattern is true. False results are not produced, except in cases where there is comparison with a count and the count is smaller than it would be with the full evaluation.
One can also use this as a basis for paid services. The cutoff does not have to be time; it can also be in other units, making it insensitive to concurrent usage and variations of working set.
This system will be deployed on our Billion Triples Challenge demo instance in a few days, after some more testing. When Virtuoso 6 ships, all LOD Cloud AMIs and OpenLink-hosted LOD Cloud SPARQL endpoints will have this enabled by default. (AMI users will be able to disable the feature, if desired.) The feature works with Virtuoso 6 in both single server and cluster deployment.
|
12/11/2008 16:13 GMT
|
Modified:
12/12/2008 10:29 GMT
|
An Example of RDF Scalability
We hear it to exhaustion, where is RDF scalability? We have been suggesting for a while that this is a solved question. I will here give some concrete numbers to back this.
The scalability dream is to add hardware and get increased performance in proportion to the power the added component has when measured by itself. A corollary dream is to take scalability effects that are measured in a simple task and see them in a complex task.
Below we show how we do 3.3 million random triple lookups per second on two 8 core commodity servers producing complete results, joining across partitions. On a single 4 core server, the figure is about 1 million lookups per second. With a single thread, it is about 250K lookups per second. This is the good case. But even our worse case is quite decent.
We took a simple SPARQL query, counting how many people say they reciprocally know each other. In the Billion Triples Challenge data set, there are 25M foaf:knows quads of which 92K are reciprocal. Reciprocal here means that when x knows y in some graph, y knows x in the same or any other graph.
SELECT COUNT (*)
WHERE {
?p1 foaf:knows ?p2 .
?p2 foaf:knows ?p1
}
There is no guarantee that the triple of x knows y is in the same partition as the triple y knows x. Thus the join is randomly distributed, n partitions to n partitions.
We left this out of the Billion Triples Challenge demo because this did not run fast enough for our liking. Since then, we have corrected this.
If run on a single thread, this query would be a loop over all the quads with a predicate of foaf:knows, and an inner loop looking for a quad with 3 of 4 fields given (SPO). If we have a partitioned situation, we have a loop over all the foaf:knows quads in each partition, and an inner lookup looking for the reciprocal foaf:knows quad in whatever partition it may be found.
We have implemented this with two different message patterns:
-
Centralized: One process reads all the foaf:knows quads from all processes. Every 50K quads, it sends a batch of reciprocal quad checks to each partition that could contain a reciprocal quad. Each partition keeps the count of found reciprocal quads, and these are gathered and added up at the end.
-
Symmetrical: Each process reads the foaf:knows quads in its partition, and sends a batch of checks to each process that could have the reciprocal foaf:knows quad every 50K quads. At the end, the counts are gathered from all partitions. There is some additional control traffic but we do not go into its details here.
Below is the result measured on 2 machines each with 2 x Xeon 5345 (quad core; total 8 cores), 16G RAM, and each machine running 6 Virtuoso instances. The interconnect is dual 1-Gbit ethernet. Numbers are with warm cache.
Centralized: 35,543 msec, 728,634 sequential + random lookups per second
Cluster 12 nodes, 35 s. 1072 m/s 39,085 KB/s 316% cpu ...
Symmetrical: 7706 msec, 3,360,740 sequential + random lookups per second
Cluster 12 nodes, 7 s. 572 m/s 16,983 KB/s 1137% cpu ...
The second line is the summary from the cluster status report for the duration of the query. The interesting numbers are the KB/s and the %CPU. The former is the cross-sectional data transfer rate for intra-cluster communication; the latter is the consolidated CPU utilization, where a constantly-busy core counts for 100%. The point to note is that the symmetrical approach takes 4x less real time with under half the data transfer rate. Further, when using multiple machines, the speed of a single interface does not limit the overall throughput as it does in the centralized situation.
These figures represent the best and worst cases of distributed JOINing. If we have a straight sequence of JOINs, with single pattern optionals and existences and the order in which results are produced is not significant (i.e., there is aggregation, existence test, or ORDER BY), the symmetrical pattern is applicable. On the other hand, if there are multiple triple pattern optionals, complex sub-queries, DISTINCTs in the middle of the query, or results have to be produced in the order of an index, then the centralized approach must be used at least part of the time.
Also, if we must make transitive closures, which can be thought of as an extension of a DISTINCT in a subquery, we must pass the data through a single point before moving the bindings to the next JOIN in the sequence. This happens for example in resolving owl:sameAs at run time. However, the good news is that performance does not fall much below the centralized figure even when there are complex nested structures with intermediate transitive closures, DISTINCTs, complex existence tests, etc., that require passing all intermediate results through a central point. No matter the complexity, it is always possible to vector some tens-of-thousands of variable bindings into a single message exchange. And if there are not that many intermediate results, then single query execution time is not a problem anyhow.
For our sample query, we would get still more speed by using a partitioned hash join, filling the hash from the foaf:knows relations and then running the foaf:knows relations through the hash. If the hash size is right, a hash lookup is somewhat better than an index lookup. The problem is that when the hash join is not the right solution, it is an expensive mistake: the best case is good; the worst case is very bad. But if there is no index then hash join is better than nothing. One problem of hash joins is that they make temporary data structures which, if large, will skew the working set. One must be quite sure of the cardinality before it is safe to try a hash join. So we do not do hash joins with RDF, but we do use them sometimes with relational data.
These same methods apply to relational data just as well. This does not make generic RDF storage outperform an application-specific relational representation on the same platform, as the latter benefits from all the same optimizations, but in terms of sheer numbers, this makes RDF representation an option where it was not an option before. RDF is all about not needing to design the schema around the queries, and not needing to limit what joins with what else.
|
11/27/2008 11:23 GMT
|
Modified:
12/01/2008 12:09 GMT
|
State of the Semantic Web, Part 1 - Sociology, Business, and Messaging (update 2)
I was in Vienna for the Linked Data Practitioners gathering this week. Danny Ayers asked me if I would blog about the State of the Semantic Web or write the This Week's Semantic Web column. I don't have the time to cover all that may have happened during the past week but I will editorialize about the questions that again were raised in Vienna. How these things relate to Virtuoso will be covered separately. This is about the overarching questions of the times, not the finer points of geek craft.
Sören Auer asked me to say a few things about relational to RDF mapping. I will cite some highlights from this, as they pertain to the general scene. There was an "open hacking" session Wednesday night featuring lightning talks. I will use some of these too as a starting point.
The messaging?
The SWEO (Semantic Web Education and Outreach) interest group of the W3C spent some time looking for an elevator pitch for the Semantic Web. It became "Data Unleashed." Why not? Let's give this some context.
So, if we are holding a Semantic Web 101 session, where should we begin? I hazard to guess that we should not begin by writing a FOAF file in Turtle by hand, as this is one thing that is not likely to happen in the real world.
Of course, the social aspect of the Data Web is the most immediately engaging, so a demo might be to go make an account with myopenlink.net and see that after one has entered the data one normally enters for any social network, one has become a Data Web citizen. This means that one can be found, just like this, with a query against the set of data spaces hosted on the system. Then we just need a few pages that repurpose this data and relate it to other data. We show some samples of queries like this in our Billion Triples Challenge demo. We will make a webcast about this to make it all clearer.
Behold: The Data Web is about the world becoming a database; writing SPARQL queries or triples is incidental. You will write FOAF files by hand just as little as you now write SQL insert statements for filling in your account information on Myspace.
Every time there is a major shift in technology, this shift needs to be motivated by addressing a new class of problem. This means doing something that could not be done before. The last time this happened was when the relational database became the dominant IT technology. At that time, the questions involved putting the enterprise in the database and building a cluster of Line Of Business (LOB) applications around the database. The argument for the RDBMS was that you did not have to constrain the set of queries that might later be made, when designing the database. In other words, it was making things more ad hoc. This was opposed then on grounds of being less efficient than the hierarchical and network databases which the relational eventually replaced.
Today, the point of the Data Web is that you do not have to constrain what your data can join or integrate with, when you design your database. The counter-argument is that this is slow and geeky and not scalable. See the similarity?
A difference is that we are not specifically aiming at replacing the RDBMS. In fact, if you know exactly what you will query and have a well defined workload, a relational representation optimized for the workload will give you about 10x the performance of the equivalent RDF warehouse. OLTP remains a relational-only domain.
However, when we are talking about doing queries and analytics against the Web, or even against more than a handful of relational systems, the things which make RDBMS good become problematic.
What is the business value of this?
The most reliable of human drives is the drive to make oneself known. This drives all, from any social scene to business communications to politics. Today, when you want to proclaim you exist, you do so first on the Web. The Web did not become the prevalent media because business loved it for its own sake, it became prevalent because business could not afford not to assert their presence there. If anything, the Web eroded the communications dominance of a lot of players, which was not welcome but still had to be dealt with, by embracing the Web.
Today, in a world driven by data, the Data Web will be catalyzed by similar factors: If your data is not there, you will not figure in query results. Search engines will play some role there but also many social applications will have reports that are driven by published data. Also consider any e-commerce, any marketplace, and so forth. The Data Portability movement is a case in point: Users want to own their own content; silo operators want to capitalize on holding it. Right now, we see these things in silos; the Data Web will create bridges between these, and what is now in silo data centers will be increasingly available on an ad hoc basis with Open Data.
Again, we see a movement from the specialized to the generic: What LinkedIn does in its data center can be done with ad hoc queries with linked open data. Of course, LinkedIn does these things somewhat more efficiently because their system is built just for this task, but the linked data approach has the built-in readiness to join with everything else at almost no cost, without making a new data warehouse for each new business question.
We could call this the sociological aspect of the thing. Getting to more concrete business, we see an economy that, we could say, without being alarmists, is confronted with some issues. Well, generally when times are bad, this results in consolidation of property and power. Businesses fail and get split up and sold off in pieces, government adds controls and regulations and so forth. This means ad hoc data integration, as control without data is just pretense. If times are lean, this also means that there is little readiness to do wholesale replacement of systems, which will take years before producing anything. So we must play with what there is and make it deliver, in ways and conditions that were not necessarily anticipated. The agility of the Data Web, if correctly understood, can be of great benefit there, especially on the reporting and business intelligence side. Specifically mapping line-of-business systems into RDF on the fly will help with integration, making the specialized warehouse the slower and more expensive alternative. But this too is needed at times.
But for the RDF community to be taken seriously there, the messaging must be geared in this direction. Writing FOAF files by hand is not where you begin the pitch. Well, what is more natural then having a global, queriable information space, when you have a global information driven economy?
The Data Web is about making this happen. First with doing this in published generally available data; next with the enterprises having their private data for their own use but still linking toward the outside, even though private data stays private: You can still use standard terms and taxonomies, where they apply, when talking of proprietary information.
But let's get back to more specific issues
At the lightning talks in Vienna, one participant said, "Man's enemy is not the lion that eats men, it's his own brother. Semantic Web's enemy is the XML Web services stack that ate its lunch." There is some truth to the first part. The second part deserves some comment. The Web services stack is about transactions. When you have a fixed, often repeating task, it is a natural thing to make this a Web service. Even though SOA is not really prevalent in enterprise IT, it has value in things like managing supply-chain logistics with partners, etc. Lots of standard messages with unambiguous meaning. To make a parallel with the database world: first there was OLTP; then there was business intelligence. Of course, you must first have the transactions, to have something to analyze.
SOA is for the transactions; the Data Web is for integration, analysis, and discovery. It is the ad hoc component of the real time enterprise, if you will. It is not a competitor against a transaction oriented SOA. In fact, RDF has no special genius for transactions. Another mistake that often gets made is stretching things beyond their natural niche. Doing transactions in RDF is this sort of over-stretching without real benefit.
"I made an ontology and it really did solve a problem. How do I convince the enterprise people, the MBA who says it's too complex, the developer who says it is not what he's used to, and so on?"
This is an education question. One of the findings of SWEO's enterprise survey was that there was awareness that difficult problems existed. There were and are corporate ontologies and taxonomies, diversely implemented. Some of these needs are recognized. RDF based technologies offer to make these more open standards based. open standards have proven economical in the past. What we also hear is that major enterprises do not even know what their information and human resources assets are: Experts can't be found even when they are in the next department, or reports and analysis gets buried in wikis, spreadsheets, and emails.
Just as when SQL took off, we need vendors to do workshops on getting started with a technology. The affair in Vienna was a step in this direction. Another type of event specially focusing on vertical problems and their Data Web solutions is a next step. For example, one could do a workshop on integrating supply chain information with Data Web technologies. Or one on making enterprise knowledge bases from HR, CRM, office automation, wikis, etc. The good thing is that all these things are additions to, not replacements of, the existing mission-critical infrastructure. And better use of what you already have ought to be the theme of the day.
|
10/24/2008 10:19 GMT
|
Modified:
10/27/2008 11:27 GMT
|
OpenLink Software's Virtuoso Submission to the Billion Triples Challenge
Introduction
We use Virtuoso 6 Cluster Edition to demonstrate the following:
- Text and structured information based lookups
- Analytics queries
- Analysis of co-occurrence of features like interests and tags.
- Dealing with identity of multiple IRI's (owl:sameAs)
The demo is based on a set of canned SPARQL queries that can be invoked using the OpenLink Data Explorer (ODE) Firefox extension.
The demo queries can also be run directly against the SPARQL end point.
The demo is being worked on at the time of submission and may be shown online by appointment.
Automatic annotation of the data based on named entity extraction is
being worked on at the time of this submission. By the time of ISWC
2008 the set of sample queries will be enhanced with queries based on
extracted named entities and their relationships in the UMBEL and Open
CYC ontologies.
Also examples involving owl:sameAs are being added, likewise with similarity metrics and search hit scores.
The Data
The database consists of the billion triples data sets and some additions like Umbel. Also the Freebase extract is newer than the challenge original.
The triple count is 1115 million.
In the case of web harvested resources, the data is loaded in one graph per resource.
In the case of larger data sets like Dbpedia or the US census, all triples of the provenance share a data set specific graph.
All string literals are additionally indexed in a full text index. No stop words are used.
Most queries do not specify a graph. Thus they are evaluated against the union of all the graphs in the database.
The indexing scheme is SPOG, GPOS, POGS, OPGS. All indices ending in S are bitmap indices.
The Queries
The demo uses Virtuoso SPARQL extensions in most queries. These
extensions consist on one hand of well known SQL features like
aggregation with grouping and existence and value subqueries and on
the other of RDF specific features.
The latter include run time RDFS and OWL inferencing support and backward
chaining subclasses and transitivity.
Simple Lookups
sparql
select ?s ?p (bif:search_excerpt (bif:vector ('semantic', 'web'), ?o))
where
{
?s ?p ?o .
filter (bif:contains (?o, "'semantic web'"))
}
limit 10
;
This looks up triples with semantic web in the object and makes a search hit summary of the literal,
highlighting the search terms.
sparql
select ?tp count(*)
where
{
?s ?p2 ?o2 .
?o2 a ?tp .
?s foaf:nick ?o .
filter (bif:contains (?o, "plaid_skirt"))
}
group by ?tp
order by desc 2
limit 40
;
This looks at what sorts of things are referenced by the properties of the foaf handle plaid_skirt.
What are these things called?
sparql
select ?lbl count(*)
where
{
?s ?p2 ?o2 .
?o2 rdfs:label ?lbl .
?s foaf:nick ?o .
filter (bif:contains (?o, "plaid_skirt"))
}
group by ?lbl
order by desc 2
;
Many of these things do not have a rdfs:label. Let us use a more general concept of lable
which groups dc:title, foaf:name and other name-like properties together. The subproperties are
resolved at run time, there is no materialization.
sparql
define input:inference 'b3s'
select ?lbl count(*)
where
{
?s ?p2 ?o2 .
?o2 b3s:label ?lbl .
?s foaf:nick ?o .
filter (bif:contains (?o, "plaid_skirt"))
}
group by ?lbl
order by desc 2
;
We can list sources by the topics they contain.
Below we look for graphs that mention terrorist bombing.
sparql
select ?g count(*)
where
{
graph ?g
{
?s ?p ?o .
filter (bif:contains (?o, "'terrorist bombing'"))
}
}
group by ?g
order by desc 2
;
Now some web 2.0 tagging of search results. The tag cloud of "computer"
sparql
select ?lbl count (*)
where
{
?s ?p ?o .
?o bif:contains "computer" .
?s sioc:topic ?tg .
optional
{
?tg rdfs:label ?lbl
}
}
group by ?lbl
order by desc 2
limit 40
;
This query will find the posters who talk the most about sex.
sparql
select ?auth count (*)
where
{
?d dc:creator ?auth .
?d ?p ?o
filter (bif:contains (?o, "sex"))
}
group by ?auth
order by desc 2
;
Analytics
We look for people who are joined by having relatively uncommon interests but do not know each other.
sparql select ?i ?cnt ?n1 ?n2 ?p1 ?p2
where
{
{
select ?i count (*) as ?cnt
where
{ ?p foaf:interest ?i }
group by ?i
}
filter ( ?cnt > 1 && ?cnt < 10) .
?p1 foaf:interest ?i .
?p2 foaf:interest ?i .
filter (?p1 != ?p2 &&
!bif:exists ((select (1) where {?p1 foaf:knows ?p2 })) &&
!bif:exists ((select (1) where {?p2 foaf:knows ?p1 }))) .
?p1 foaf:nick ?n1 .
?p2 foaf:nick ?n2 .
}
order by ?cnt
limit 50
;
The query takes a fairly long time, mostly spent counting the interested in 25M interest triples.
It then takes people that share the interest and checks that neither claims to know the other.
It then sorts the results rarest interest first. The query can be written more efficently but is
here just to show that database-wide scans of the population are possible ad hoc.
Now we go to SQL to make a tag co-occurrence matrix. This can be used for showing a Technorati-style
related tags line at the bottom of a search result page. This showcases the use of SQL together
with SPARQL. The half-matrix of tags t1, t2 with the co-occurrence count at the intersection is
much more efficiently done in SQL, specially since it gets updated as the data changes.
This is an example of materialized intermediate results based on warehoused RDF.
create table
tag_count (tcn_tag iri_id_8,
tcn_count int,
primary key (tcn_tag));
alter index
tag_count on tag_count partition (tcn_tag int (0hexffff00));
create table
tag_coincidence (tc_t1 iri_id_8,
tc_t2 iri_id_8,
tc_count int,
tc_t1_count int,
tc_t2_count int,
primary key (tc_t1, tc_t2))
alter index
tag_coincidence on tag_coincidence partition (tc_t1 int (0hexffff00));
create index
tc2 on tag_coincidence (tc_t2, tc_t1) partition (tc_t2 int (0hexffff00));
How many times each topic is mentioned?
insert into tag_count
select *
from (sparql define output:valmode "LONG"
select ?t count (*) as ?cnt
where
{
?s sioc:topic ?t
}
group by ?t)
xx option (quietcast);
Take all t1, t2 where t1 and t2 are tags of the same subject, store only the permutation where the internal id of t1 < that of t2.
insert into tag_coincidence (tc_t1, tc_t2, tc_count)
select "t1", "t2", cnt
from
(select "t1", "t2", count (*) as cnt
from
(sparql define output:valmode "LONG"
select ?t1 ?t2
where
{
?s sioc:topic ?t1 .
?s sioc:topic ?t2
}) tags
where "t1" < "t2"
group by "t1", "t2") xx
where isiri_id ("t1") and
isiri_id ("t2")
option (quietcast);
Now put the individual occurrence counts into the same table with the co-occurrence. This
denormalization makes the related tags lookup faster.
update tag_coincidence
set tc_t1_count = (select tcn_count from tag_count where tcn_tag = tc_t1),
tc_t2_count = (select tcn_count from tag_count where tcn_tag = tc_t2);
Now each tag_coincidence row has the joint occurrence count and individual occurrence counts.
A single select will return a Technorati-style related tags listing.
To show the URI's of the tags:
select top 10 id_to_iri (tc_T1), id_to_iri (tc_t2), tc_count
from tag_coincidence
order by tc_count desc;
Social Networks
We look at what interests people have
sparql
select ?o ?cnt
where
{
{
select ?o count (*) as ?cnt
where
{
?s foaf:interest ?o
}
group by ?o
}
filter (?cnt > 100)
}
order by desc 2
limit 100
;
Now the same for the Harry Potter fans
sparql
select ?i2 count (*)
where
{
?p foaf:interest <http://www.livejournal.com/interests.bml?int=harry+potter> .
?p foaf:interest ?i2
}
group by ?i2
order by desc 2
limit 20
;
We see whether knows relations are symmmetrical. We return the top n people that others claim to know without being reciprocally known.
sparql
select ?celeb, count (*)
where
{
?claimant foaf:knows ?celeb .
filter (!bif:exists ((select (1)
where
{
?celeb foaf:knows ?claimant
})))
}
group by ?celeb
order by desc 2
limit 10
;
We look for a well connected person to start from.
sparql
select ?p count (*)
where
{
?p foaf:knows ?k
}
group by ?p
order by desc 2
limit 50
;
We look for the most connected of the many online identities of Stefan Decker.
sparql
select ?sd count (distinct ?xx)
where
{
?sd a foaf:Person .
?sd ?name ?ns .
filter (bif:contains (?ns, "'Stefan Decker'")) .
?sd foaf:knows ?xx
}
group by ?sd
order by desc 2
;
We count the transitive closure of Stefan Decker's connections
sparql
select count (*)
where
{
{
select *
where
{
?s foaf:knows ?o
}
}
option (transitive, t_distinct, t_in(?s), t_out(?o)) .
filter (?s = <mailto:stefan.decker@deri.org>)
}
;
Now we do the same while following owl:sameAs links.
sparql
define input:same-as "yes"
select count (*)
where
{
{
select *
where
{
?s foaf:knows ?o
}
}
option (transitive, t_distinct, t_in(?s), t_out(?o)) .
filter (?s = <mailto:stefan.decker@deri.org>)
}
;
Demo System
The system runs on Virtuoso 6 Cluster Edition. The database is partitioned into 12 partitions,
each served by a distinct server process. The system demonstrated hosts these 12 servers on 2
machines, each with 2 xXeon 5345 and 16GB memory and 4 SATA disks. For scaling, the processes
and corresponding partitions can be spread over a larger number of machines. If each ran on its
own server with 16GB RAM, the whole data set could be served from memory. This is desirable for
search engine or fast analytics applications. Most of the demonstrated queries run in memory on
second invocation. The timing difference between first and second run is easily an order of
magnitude.
|
09/30/2008 15:39 GMT
|
Modified:
10/03/2008 06:20 GMT
|
|
|