Details
Virtuso Data Space Bot
Burlington, United States
Subscribe
Post Categories
Recent Articles
Display Settings
|
Showing posts in all categories Refresh
Beyond Applications - Introducing the Planetary Datasphere (Part 2)
We have looked at the general implications of the DataSphere, a universal, ubiquitous database infrastructure, on end-user experience and application development and content. Now we will look at what this means at the back end, from hosting to security to server software and hardware.
Application Hosting
For the infrastructure provider, hosting the DataSphere is no different from hosting large Web 2.0 sites. This may be paid for by users, as in the cloud computing model where users rent capacity for their own purposes, or by advertisers, as in most of Web 2.0.
Clouds play a role in this as places with high local connectivity. The DataSphere is the atmosphere; the Cloud is an atmospheric phenomenon.
What of Proprietary Data and its Security?
Having proprietary data does not imply using a proprietary language. I would say that for any domain of discourse, no matter how private or specialized, at least some structural concepts can be borrowed from public, more generic sources. This lowers training thresholds and facilitates integration. Being able to integrate does not imply opening one's own data. To take an analogy, if you have a bunker with closed circuit air recycling, you still breathe air, even if that air is cut off from the atmosphere at large. For places with complex existing RDBMS security, the best is to map the RDBMS to RDF on the fly, always running all requests through the RDBMS. This implicitly preserves any policy or label based security schemes.
What of Individual Privacy on the Open Web?
The more complex situations will be found in environments with mixed security needs, as in social networking with partly-open and partly-closed profiles. The FOAF+SSL solution with https:// URIs is one approach. For query processing, we have a question of enforcing instance-level policies. In the DataSphere, granting privileges on tables and views no longer makes sense. In SQL, a policy means that behind the scenes the DBMS will add extra criteria to queries and updates depending on who is issuing them. The query processor adds conditions like getting the user's department ID and comparing it to the department ID on the payroll record. Labeled security is a scheme where data rows themselves contain security tags and the DBMS enforces these, row by row.
I would say that these techniques are suited for highly-structured situations where the roles, compartments, and needs are clear, and where the organization has the database know-how to write, test, and deploy such rules by the table, row, and column. This does not sit well with schema-last. I would not bet much on an average developer's capacity for making airtight policies on RDF data where not even 100% schema-adherence is guaranteed.
Doing security at the RDF graph level seems more appropriate. In many use cases, the graph is analogous to a photo album or a file system directory. A Data Space can be divided into graphs to provide more granularity for expressing topic, provenance, or security. If policy conditions apply mostly to the graph, then things are not as likely to slip by, for example, policy rules missing some infrequent misuse of the schema. In these cases, the burden on the query processor is also not excessive: Just as with documents, the container (table, graph) is the object of access grants, not the individual sentences (DBMS records, RDF triples) in the document.
It is left to the application to present a choice of graph level policies to the user. Exactly what these will be depends on the domain of discourse. A policy might restrict access to a meeting in a calendar to people whose OpenIDs figure in the attendee list, or limit access to a photo album to people mentioned in the owner's social network. Defining such policies is typically a task for the application developer.
The difference between the Document Web and the Linked Data Web is that while the Document Web enforces security when a thing is returned to the user, Linked Data Web enforcement must occur whenever a query references something, even if this is an intermediate result not directly shown to the user.
The DataSphere will offer a generic policy scheme, filtering what graphs are accessed in a given query situation. Other applications may then verify the safety of one's disclosed information using the same DataSphere infrastructure. Of course, the user must rely on the infrastructure provider to correctly enforce these rules. Then again, some users will operate and audit their own infrastructure anyway.
Federation vs. Centralization
On the open web, there is the question of federation vs. centralization. If an application is seen to be an interface to a vocabulary, it becomes more agnostic with respect to this. In practice, if we are talking about hosted services, what is hosted together joins much faster. Data Spaces with lots of interlinking, such as closely connected social networks, will tend to cluster together on the same cloud to facilitate joint operation. Data is ubiquitous and not location-conscious, but what one can efficiently do with it depends on location. Joint access patterns favor joint location. Due to technicalities of the matter, single database clusters will run complex queries within the cluster 100 to 1000 times faster than between clusters. The size of such data clouds may be in the hundreds-of-billions of triples. It seems to make sense to have data belonging to same-type or jointly-used applications close together. In practice, there will arise partitioning by type of usage, user profile, etc., but this is no longer airtight and applications more-or-less float on top of all of this.
A search engine can host a copy of the Document Web and allow text lookups on it. But a text lookup is a single well-defined query that happens to parallelize and partition very well. A search engine can also have all the structured public data copied, but the problem there is that queries are a lot less predictable and may take orders of magnitude more resources than a single text lookup. As a partial answer, even now, we can set up a database so that the first million single-row joins cost the user nothing, but doing more requires a special subscription.
The cost for hosting a trillion triples will vary radically in function of what throughput is promised. This may result in pricing per service level, a bit like ISP pricing varies in function of promised connectivity. Queries can be run for free if no throughput guarantee applies, and might cost more if the host promises at least five-million joins-per-second including infrequently-accessed data.
Performance and cost dynamics will probably lead to the emergence of domain-specific clusters of colocated Data Spaces. The landscape will be hybrid, where usage drives data colocation. A single Google is not a practical solution to the world's spectrum of query needs.
What is the Cost of Schema-Last?
The DataSphere proposition is predicated on a worldwide database fabric that can store anything, just like a network can transport anything. It cannot enforce a fixed schema, just like TCP/IP cannot say that it will transport only email. This is continuous schema evolution. Well, TCP/IP can transport anything but it does transport a lot of HTML and email. Similarly, the DataSphere can optimize for some common vocabularies.
We have seen that an application-specific relational schema is often 10 times more efficient than an equivalent completely generic RDF representation of the same thing. The gap may narrow, but task specific representations will keep an edge. We ought to know, as we do both.
While anything can be represented, the masses are not that creative. For any data-hosting provider, making a specialized representation for the top 100 entities may cut data size in half or better. This is a behind-the-scenes optimization that will in time be a matter of course.
Historically, our industry has been driven by two phenomena:
-
New PCs every 2 years. To make this necessary, Windows has been getting bigger and bigger, and not upgrading is not an option if one must exchange documents with new data formats and keep up with security.
-
Agility, or ad hoc over planned. The reason the RDBMS won over CODASYL network databases was that one did not have to define what queries could be made when creating the database. With the Linked Data Web, we have one more step in this direction when we say that one does not have to decide what can be represented when creating the database.
To summarize, there is some cost to schema-last, but then our industry needs more complexity to keep justifying constant investment. The cost is in this sense not all bad.
Building the DataSphere may be the next great driver of server demand. As a case in point, Cisco, whose fortune was made when the network became ubiquitous, just entered the server game. It's in the air.
DataSphere Precursors
Right now, we have the Linked Open Data movement with lots of new data being added. We have the drive for data- and reputation-portability. We have Freebase as a demonstrator of end-users actually producing structured data. We have convergence of terminology around DBpedia, FOAF, SIOC, and more. We have demonstrators of useful data integration on the RDF stack in diverse fields, especially life sciences.
We have a totally ubiquitous network for the distribution of this, plus database technology to make this work.
We have a practical need for semantics, as search is getting saturated, email is getting killed by spam, and information overload is a constant. Social networks can be leveraged for solving a lot of this, if they can only be opened.
Of course, there is a call for transparency in society at large. Well, the battle of transparency vs. spin is a permanent feature of human existence but even there, we cannot ignore the possibilities of open data.
Databases and Servers
Technically, what does this take? Mostly, this takes a lot of memory. The software is there and we are productizing it as we speak. As with other data intensive things, the key is scalable querying over clusters of commodity servers. Nothing we have not heard before. Of course, the DBMS must know about RDF specifics to get the right query plans and so on but this we have explained elsewhere.
This all comes down to the cost of memory. No amount of CPU or network speed will make any difference if data is not in memory. Right now, a board with 8G and a dual core AMD X86-64 and 4 disks may cost about $700. 2 x 4 core Xeon and 16G and 8 disks may be $4000, counting just the components. In our experience, about 32G per billion triples is a minimum. This must be backed by a few independent disks so as to fill the cache in parallel. A cluster with 1 TB of RAM would be under $100K if built from low end boards.
The workload is all about large joins across partitions. The queries parallelize well, thus using the largest and most expensive machines for building blocks is not cost efficient. Having absolutely everything in RAM is also not cost efficient, but it is necessary to have many disks to absorb the random access load. Disk access is predominantly random, unlike some analytics workloads that can read serially. If SSD's get a bit cheaper, one could have SSD for the database and disk for backup.
With large data centers, redundancy becomes an issue. The most cost effective redundancy is simply storing partitions in duplicate or triplicate on different commodity servers. The DBMS software should handle the replication and fail-over.
For operating such systems, scaling-on-demand is necessary. Data must move between servers, and adding or replacing servers should be an on-the-fly operation. Also, since access is essentially never uniform, the most commonly accessed partitions may benefit from being kept in more copies than less frequently accessed ones. The DBMS must be essentially self administrating since these things are quite complex and easily intractable if one does not have in depth understanding of this rather complex field.
The best price point for hardware varies with time. Right now, the optimum is to have many basic motherboards with maximum memory in a rack unit, then another unit with local disks for each motherboard. Much cheaper than SAN's and Infiniband fabrics.
Conclusions and Next Steps
The ingredients and use cases are there. If server clusters with 1TB RAM begin under $100K, the cost of deployment is small compared to personnel costs.
Bootstrapping the DataSphere from current Linked Open Data, such as DBpedia, OpenCYC, Freebase, and every sort of social network, is feasible. Aside from private data integration and analytics efforts and E-science, the use cases are liberating social networks and C2C and some aspects of search from silos, overcoming spam, and mass use of semantics extracted from text. Emergent effects will then carry the ball to places we have not yet been.
The Linked Data Web has its origins in Semantic Web research, and many of the present participants come from these circles. Things may have been slowed down by a disconnect, only too typical of human activity, between Semantic Web research on one hand and database engineering on the other. Right now, the challenge is one of engineering. As documented on this blog, we have worked quite a bit on cluster databases, mostly but not exclusively with RDF use cases. The actual challenges of this are however not at all what is discussed in Semantic Web conferences. These have to do with complexities of parallelism, timing, message bottlenecks, transactions, and the like, i.e., hardcore engineering. These are difficult beyond what the casual onlooker might guess but not impossible. The details that remain to be worked out are nothing semantic, they are hardcore database, concerning automatic provisioning and such matters.
It is as if the Semantic Web people look with envy at the Web 2.0 side where there are big deployments in production, yet they do not seem quite ready to take the step themselves. Well, I will write some other time about research and engineering. For now, the message is &mdash go for it. Stay tuned for more announcements, as we near production with our next generation of software.
Related
|
03/25/2009 10:50 GMT-0500
|
Modified:
03/25/2009 12:31 GMT-0500
|
OpenLink Software's Virtuoso Submission to the Billion Triples Challenge
Introduction
We use Virtuoso 6 Cluster Edition to demonstrate the following:
- Text and structured information based lookups
- Analytics queries
- Analysis of co-occurrence of features like interests and tags.
- Dealing with identity of multiple IRI's (owl:sameAs)
The demo is based on a set of canned SPARQL queries that can be invoked using the OpenLink Data Explorer (ODE) Firefox extension.
The demo queries can also be run directly against the SPARQL end point.
The demo is being worked on at the time of submission and may be shown online by appointment.
Automatic annotation of the data based on named entity extraction is
being worked on at the time of this submission. By the time of ISWC
2008 the set of sample queries will be enhanced with queries based on
extracted named entities and their relationships in the UMBEL and Open
CYC ontologies.
Also examples involving owl:sameAs are being added, likewise with similarity metrics and search hit scores.
The Data
The database consists of the billion triples data sets and some additions like Umbel. Also the Freebase extract is newer than the challenge original.
The triple count is 1115 million.
In the case of web harvested resources, the data is loaded in one graph per resource.
In the case of larger data sets like Dbpedia or the US census, all triples of the provenance share a data set specific graph.
All string literals are additionally indexed in a full text index. No stop words are used.
Most queries do not specify a graph. Thus they are evaluated against the union of all the graphs in the database.
The indexing scheme is SPOG, GPOS, POGS, OPGS. All indices ending in S are bitmap indices.
The Queries
The demo uses Virtuoso SPARQL extensions in most queries. These
extensions consist on one hand of well known SQL features like
aggregation with grouping and existence and value subqueries and on
the other of RDF specific features.
The latter include run time RDFS and OWL inferencing support and backward
chaining subclasses and transitivity.
Simple Lookups
sparql
select ?s ?p (bif:search_excerpt (bif:vector ('semantic', 'web'), ?o))
where
{
?s ?p ?o .
filter (bif:contains (?o, "'semantic web'"))
}
limit 10
;
This looks up triples with semantic web in the object and makes a search hit summary of the literal,
highlighting the search terms.
sparql
select ?tp count(*)
where
{
?s ?p2 ?o2 .
?o2 a ?tp .
?s foaf:nick ?o .
filter (bif:contains (?o, "plaid_skirt"))
}
group by ?tp
order by desc 2
limit 40
;
This looks at what sorts of things are referenced by the properties of the foaf handle plaid_skirt.
What are these things called?
sparql
select ?lbl count(*)
where
{
?s ?p2 ?o2 .
?o2 rdfs:label ?lbl .
?s foaf:nick ?o .
filter (bif:contains (?o, "plaid_skirt"))
}
group by ?lbl
order by desc 2
;
Many of these things do not have a rdfs:label. Let us use a more general concept of lable
which groups dc:title, foaf:name and other name-like properties together. The subproperties are
resolved at run time, there is no materialization.
sparql
define input:inference 'b3s'
select ?lbl count(*)
where
{
?s ?p2 ?o2 .
?o2 b3s:label ?lbl .
?s foaf:nick ?o .
filter (bif:contains (?o, "plaid_skirt"))
}
group by ?lbl
order by desc 2
;
We can list sources by the topics they contain.
Below we look for graphs that mention terrorist bombing.
sparql
select ?g count(*)
where
{
graph ?g
{
?s ?p ?o .
filter (bif:contains (?o, "'terrorist bombing'"))
}
}
group by ?g
order by desc 2
;
Now some web 2.0 tagging of search results. The tag cloud of "computer"
sparql
select ?lbl count (*)
where
{
?s ?p ?o .
?o bif:contains "computer" .
?s sioc:topic ?tg .
optional
{
?tg rdfs:label ?lbl
}
}
group by ?lbl
order by desc 2
limit 40
;
This query will find the posters who talk the most about sex.
sparql
select ?auth count (*)
where
{
?d dc:creator ?auth .
?d ?p ?o
filter (bif:contains (?o, "sex"))
}
group by ?auth
order by desc 2
;
Analytics
We look for people who are joined by having relatively uncommon interests but do not know each other.
sparql select ?i ?cnt ?n1 ?n2 ?p1 ?p2
where
{
{
select ?i count (*) as ?cnt
where
{ ?p foaf:interest ?i }
group by ?i
}
filter ( ?cnt > 1 && ?cnt < 10) .
?p1 foaf:interest ?i .
?p2 foaf:interest ?i .
filter (?p1 != ?p2 &&
!bif:exists ((select (1) where {?p1 foaf:knows ?p2 })) &&
!bif:exists ((select (1) where {?p2 foaf:knows ?p1 }))) .
?p1 foaf:nick ?n1 .
?p2 foaf:nick ?n2 .
}
order by ?cnt
limit 50
;
The query takes a fairly long time, mostly spent counting the interested in 25M interest triples.
It then takes people that share the interest and checks that neither claims to know the other.
It then sorts the results rarest interest first. The query can be written more efficently but is
here just to show that database-wide scans of the population are possible ad hoc.
Now we go to SQL to make a tag co-occurrence matrix. This can be used for showing a Technorati-style
related tags line at the bottom of a search result page. This showcases the use of SQL together
with SPARQL. The half-matrix of tags t1, t2 with the co-occurrence count at the intersection is
much more efficiently done in SQL, specially since it gets updated as the data changes.
This is an example of materialized intermediate results based on warehoused RDF.
create table
tag_count (tcn_tag iri_id_8,
tcn_count int,
primary key (tcn_tag));
alter index
tag_count on tag_count partition (tcn_tag int (0hexffff00));
create table
tag_coincidence (tc_t1 iri_id_8,
tc_t2 iri_id_8,
tc_count int,
tc_t1_count int,
tc_t2_count int,
primary key (tc_t1, tc_t2))
alter index
tag_coincidence on tag_coincidence partition (tc_t1 int (0hexffff00));
create index
tc2 on tag_coincidence (tc_t2, tc_t1) partition (tc_t2 int (0hexffff00));
How many times each topic is mentioned?
insert into tag_count
select *
from (sparql define output:valmode "LONG"
select ?t count (*) as ?cnt
where
{
?s sioc:topic ?t
}
group by ?t)
xx option (quietcast);
Take all t1, t2 where t1 and t2 are tags of the same subject, store only the permutation where the internal id of t1 < that of t2.
insert into tag_coincidence (tc_t1, tc_t2, tc_count)
select "t1", "t2", cnt
from
(select "t1", "t2", count (*) as cnt
from
(sparql define output:valmode "LONG"
select ?t1 ?t2
where
{
?s sioc:topic ?t1 .
?s sioc:topic ?t2
}) tags
where "t1" < "t2"
group by "t1", "t2") xx
where isiri_id ("t1") and
isiri_id ("t2")
option (quietcast);
Now put the individual occurrence counts into the same table with the co-occurrence. This
denormalization makes the related tags lookup faster.
update tag_coincidence
set tc_t1_count = (select tcn_count from tag_count where tcn_tag = tc_t1),
tc_t2_count = (select tcn_count from tag_count where tcn_tag = tc_t2);
Now each tag_coincidence row has the joint occurrence count and individual occurrence counts.
A single select will return a Technorati-style related tags listing.
To show the URI's of the tags:
select top 10 id_to_iri (tc_T1), id_to_iri (tc_t2), tc_count
from tag_coincidence
order by tc_count desc;
Social Networks
We look at what interests people have
sparql
select ?o ?cnt
where
{
{
select ?o count (*) as ?cnt
where
{
?s foaf:interest ?o
}
group by ?o
}
filter (?cnt > 100)
}
order by desc 2
limit 100
;
Now the same for the Harry Potter fans
sparql
select ?i2 count (*)
where
{
?p foaf:interest <http://www.livejournal.com/interests.bml?int=harry+potter> .
?p foaf:interest ?i2
}
group by ?i2
order by desc 2
limit 20
;
We see whether knows relations are symmmetrical. We return the top n people that others claim to know without being reciprocally known.
sparql
select ?celeb, count (*)
where
{
?claimant foaf:knows ?celeb .
filter (!bif:exists ((select (1)
where
{
?celeb foaf:knows ?claimant
})))
}
group by ?celeb
order by desc 2
limit 10
;
We look for a well connected person to start from.
sparql
select ?p count (*)
where
{
?p foaf:knows ?k
}
group by ?p
order by desc 2
limit 50
;
We look for the most connected of the many online identities of Stefan Decker.
sparql
select ?sd count (distinct ?xx)
where
{
?sd a foaf:Person .
?sd ?name ?ns .
filter (bif:contains (?ns, "'Stefan Decker'")) .
?sd foaf:knows ?xx
}
group by ?sd
order by desc 2
;
We count the transitive closure of Stefan Decker's connections
sparql
select count (*)
where
{
{
select *
where
{
?s foaf:knows ?o
}
}
option (transitive, t_distinct, t_in(?s), t_out(?o)) .
filter (?s = <mailto:stefan.decker@deri.org>)
}
;
Now we do the same while following owl:sameAs links.
sparql
define input:same-as "yes"
select count (*)
where
{
{
select *
where
{
?s foaf:knows ?o
}
}
option (transitive, t_distinct, t_in(?s), t_out(?o)) .
filter (?s = <mailto:stefan.decker@deri.org>)
}
;
Demo System
The system runs on Virtuoso 6 Cluster Edition. The database is partitioned into 12 partitions,
each served by a distinct server process. The system demonstrated hosts these 12 servers on 2
machines, each with 2 xXeon 5345 and 16GB memory and 4 SATA disks. For scaling, the processes
and corresponding partitions can be spread over a larger number of machines. If each ran on its
own server with 16GB RAM, the whole data set could be served from memory. This is desirable for
search engine or fast analytics applications. Most of the demonstrated queries run in memory on
second invocation. The timing difference between first and second run is easily an order of
magnitude.
|
09/30/2008 12:24 GMT-0500
|
Modified:
10/03/2008 06:20 GMT-0500
|
SPARQL Extensions for Subqueries
SPARQL Extensions for Subqueries
Last time I said we had extended SPARQL for sub-queries. As a preview of the new functionality, let us look at a query from TPC-H.
Below is the Virtuoso SPARQL version of Q2.
sparql
define sql:signal-void-variables 1
prefix tpcd: <http://www.openlinksw.com/schemas/tpcd#>
prefix oplsioc: <http://www.openlinksw.com/schemas/oplsioc#>
prefix sioc: <http://rdfs.org/sioc/ns#>
prefix foaf: <http://xmlns.com/foaf/0.1/>
select
?supp+>tpcd:acctbal,
?supp+>tpcd:name,
?supp+>tpcd:has_nation+>tpcd:name as ?nation_name,
?part+>tpcd:partkey,
?part+>tpcd:mfgr,
?supp+>tpcd:address,
?supp+>tpcd:phone,
?supp+>tpcd:comment
from <http://example.com/tpcd>
where {
?ps a tpcd:partsupp ; tpcd:has_supplier ?supp ; tpcd:has_part ?part .
?supp+>tpcd:has_nation+>tpcd:has_region tpcd:name 'EUROPE' .
?part tpcd:size 15 .
?ps tpcd:supplycost ?minsc .
{ select ?p min(?ps+>tpcd:supplycost) as ?minsc
where {
?ps a tpcd:partsupp ; tpcd:has_part ?p ; tpcd:has_supplier ?ms .
?ms+>tpcd:has_nation+>tpcd:has_region tpcd:name 'EUROPE' .
}
}
filter (?part+>tpcd:type like '%BRASS') }
order by
desc (?supp+>tpcd:acctbal)
?supp+>tpcd:has_nation+>tpcd:name
?supp+>tpcd:name
?part+>tpcd:partkey ;
Note the pattern
{ ?ms+>tpcd:has_nation+>tpcd:has_region tpcd:name 'EUROPE' } which is a shorthand for
{ ?ms tpcd:has_nation ?t1 . ?t1 tpcd:has-region ?t2 . ?t2 tpcd:has_region ?t3 . ?t3 tpcd:name "EUROPE" }
Also note a sub-query is used for determining the lowest supply cost for a part.
The SQL text of the query can be found in the TPC-H benchmark specification, reproduced below:
select s_acctbal, s_name, n_name,
p_partkey, p_mfgr, s_address,
s_phone, s_comment
from part, supplier, partsupp, nation, region
where
p_partkey = ps_partkey
and s_suppkey = ps_suppkey
and p_size = 15
and p_type like '%BRASS'
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = 'EUROPE'
and ps_supplycost = (
select min(ps_supplycost)
from partsupp, supplier, nation, region
where
p_partkey = ps_partkey
and s_suppkey = ps_suppkey
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = 'EUROPE')
order by
s_acctbal desc, n_name, s_name, p_partkey;
For brevity we have omitted the declarations for mapping the TPC-H schema to its RDF equivalent. The mapping is straightforward, with each column mapping to a predicate and each table to a class.
This is now part of the next Virtuoso Open Source cut, due around next week.
As of this writing we are going through the TPC-H query by query and testing with mapping going to Virtuoso and Oracle databases.
Also we have been busy measuring Virtuoso 6. Even after switching from 32-bit to 64-bit IDs for IRIs and objects, the new databases are about half the size of the same Virtuoso 5.0.2 databases. This does not include any stream compression like gzip for disk pages. The load and query speeds are higher because of better working set. For all in memory, they are about even with 5.0.2. So now on an 8G box, we load 1067 million LUBM triples at 39.7 Kt/s instead of 29 Kt/s with 5.0.2. Right now we experimenting with clusters at Amazon EC2. We'll write about that in a bit.
|
01/16/2008 10:47 GMT-0500
|
Modified:
04/14/2008 14:02 GMT-0500
|
Social Web RDF Store Benchmark
Social Web RDF Store Benchmark
Elaborating on my previous post, as food for thought for an RDF store benchmarking activity under the W3C, I present the following rough sketch. At the end of the below, I propose some common business questions that should be answered by a social web aggregator.
The problem with these is that it is not really possible to ask interesting questions over a large database without involving some sort of counting and grouping. I feel that we simply cannot make a representative benchmark without these, quite regardless of the fact that SPARQL in its present form does not have these features. Hence I have simply stated the questions and left any implementation open. If this seems like an interesting direction, the nascent W3C benchmarking XG (experimental group) can refine the business questions, relative query frequencies, exact data set composition, etc.
Social Web RDF Benchmark
by Orri Erling
Goals
This benchmark model's use of RDF for representing and analyzing use of social software by user communities. The benchmark consists of a scalable synthetic data set, a feed of updates to the data set, and a query mix. The data set reflects the common characteristics of the social web, with realistic distribution of connections, user contributed content, commenting, tagging, and other social web activities. The data set is expressed in the FOAF and SIOC vocabularies. The query mix is divided between relatively short, dashboard or search engine style lookups, and longer running analytics queries.
The system being modeled is an an aggregator of social web content; we could liken it to an RDF-based Technorati with some extra features.
Users can publish their favorite queries or mesh-ups as logical views served by the system. In this manner, queries come to depend on other queries, somewhat like SQL VIEWs can reference each other.
There is a small qualification data set that can be tested against the queries to validate that the system under test (SUT) produces the correct results.
The benchmark is scaled by number of users. To facilitate comparison, some predefined scales are offered, i.e., 100K, 300K, 1M, 3M, 10M users. Each simulated user both produces and consumes content. The level of activity of users is unevenly divided.
There are two work mixes — the browsing mix, which consists of a mix of lookups and contributing content, and the analytics mix, which consists of long-running queries for tracking the state of the network. For each 100 browsing mixes, one analytics mix is performed.
A benchmark run is at least 1h real-time in duration. The metric is calculated by the number of browsing mixes completed during the test window. This simulates 10% of the users being online at any one time, thus for a scale of 1M users, 100K browsing mixes will be simultaneously proceeding.
The test driver submits the work via HTTP. What load balancing or degree of parallel serving of the requests is used is left up to the SUT.
The metric is expressed as queries per second, taking the total number of queries executed by completed browsing mixes and dividing this by the real time of the measurement window. The metric is called qpsSW, for queries per second, socialweb. The cost metric is $/qpsSW, calculated by the costing rules of the TPC. If compute-on-demand infrastructure is used, the costing will be $/qpsSW/day.
The test sponsor is the party contributing the result. The contribution consists of the metric and of a full disclosure report (FDR), written following a template given in the benchmark specification. The disclosure requirements follow the TPC practices, including publishing any configuration scripts, data definition language statements, timing for warm-up and test window, times for individual queries etc. All details of the hardware and software are disclosed.
Test Support Software
The software consists of the data generator and of a test driver. The test driver calls functions supplied by the test sponsor for performing the diverse operations in the test. Source code for any modifications of the test driver is to be published as part of the FDR.
Rules for SUT
Any hardware/software combination — including single machines, clusters, clusters rented from computer providers like Amazon EC2 — is eligible.
The SUT must produce correct answers for the validation queries against the validation data set.
The implementation of the queries is not restricted. These can be any SPARQL or other queries, application server based logic, stored procedures or other, in any language, provided full source code is provided in the FDR.
The data set is provided as serialized RDF. The means of storage are left up to the SUT. The basic intention is to use a triple store of some form, but the specific indexing, use of property tables, materialized views, and so forth, is left up to the test sponsor. All tuning and configuration is to be published in the FDR.
Simulated Workload
For each operation of each mix, the specification shall present:
-
The logical intent of the operation, the business question, e.g., What is the hot topic among my friends?
-
The question or update expressed in terms of the data in the data set.
-
Sample text of a query answering the question or pseudo-code for deriving the answer.
-
Result set layout, if applicable.
The relative frequencies of the queries are given in the query mix summary.
Browsing Mix
The browsing mix consists of the following operations:
Updates
For one new social contact, there are 10 posts and 20 comments.
Queries
-
What are the 10 most recent posts by somebody in my friends or their friends? This would be a typical dashboard item.
-
What are the authoritative bloggers on topic x? This is a moderately complex ad-hoc query. Take posts tagged with the topic, count links to them, take the blogs containing them, show the 10 most cited blogs with the most recent posts with the tag. This would be typical of a stored query, like a parameterizable report.
-
How do I contact person x? Calculate the chain of common acquaintances best for reaching person x. For practicality, we do not do a full walk of anything but just take the distinct persons in 2 steps of the user and in 2 steps of x and see the intersection.
-
Who are the people like me? Find the top 10 people ranked by count of tags in common in the person's tag cloud. The tag cloud is the set of interests and the set of tags in blog posts of the person.
-
Who react to or talk about me? Count of replies to material by the user, grouped by the commenting user and the site of the comment, top 20, sorted by count descending.
-
Who are my fans that I do not know? Same as above, excluding people within 2 steps.
-
Who are my competitors? Most prolific posters on topics of my interest that do not cite me.
-
Where is the action? On forums where I participate, what are the top 5 threads, as measured by posts in the last day. Show count of posts in the last day and the day before that.
-
How do I get there? Who are the people active around both topic x and y? This is defined by a person having participated during the last year in forums of x as well as of y. Forums are tagged by topics. The most active users are first. The ranking is proportional to the sum of the number of posts in x and y.
Analytic Mix
These queries are typical questions about the state of the conversation space as a whole and can for example be published as a weekly summary page.
-
The fastest propagating idea - What is the topic with the most users who have joined in the last day? A user is considered to have joined if the user was not discussing this in the past 10 days.
-
Prime movers - What users start conversations? A conversation is the set of material in reply to or citing a post. The reply distance can be arbitrarily long, the citing distance is a direct link to the original post or a reply there to. The number and extent of conversations contribute towards the score.
-
Geography - Over the last 10 days, for each geographic area, show the top 50 tags. The location is the location of the poster.
-
Social hubs - For each community, get the top 5 people who are central to it in terms of number of links to other members of the same community and in terms of being linked from posts. A community is the set of forums that have a specific topic.
|
11/21/2007 08:07 GMT-0500
|
Modified:
04/25/2008 16:29 GMT-0500
|
Recent Virtuoso Developments
Recent Virtuoso Developments
We have been extensively working on virtual database refinements. There are many SQL cost model adjustments to better model distributed queries and we now support direct access to Oracle and Informix statistics system tables. Thus, when you attach a table from one or the other, you automatically getup to date statistics. This helps Virtuoso optimize distributed queries. Also the documentation is updated as concerns these, with a new section on distributed query optimization.
On the applications side, we have been keeping up with the SIOC RDF ontology developments. All ODS applications now make their data available as SIOC graphs for download and SPARQL query access.
What is most exciting however is our advance in mapping relational data into RDF. We now have a mapping language that makes arbitrary legacy data in Virtuoso or elsewhere in the relational world RDF query-able. We will put out a white paper on this in a few days.
Also we have some innovations in mind for optimizing the physical storage of RDF triples. We keep experimenting, now with our sights set to the high end of triple storage, towards billion triple data sets. We are experimenting with a new more space efficient index structure for better working set behavior. Next week will yield the first results.
|
01/09/2007 01:35 GMT-0500
|
Modified:
04/16/2008 16:53 GMT-0500
|
Ideas on RDF Store Benchmarking
Ideas on RDF Store Benchmarking
This post presents some ideas and use cases for RDF store benchmarking.
Use Cases
- Basic triple storage and retrieval. The LUBM benchmark captures many aspects of this.
- Recursive rule application. The simpler cases of this are things like transitive closure.
- Mapping of relational data to RDF. Since relational benchmarks are well established, as in the TPC benchmarks, the schemas and test data generation can come from there. The problem is that the D/H/R benchmarks consist of aggregates and grouping exclusively but SPARQL does not have these.
Benchmarking Triple Stores
An RDF benchmark suite should meet the following criteria:
- Have a single scale factor.
- Produce a single metric, queries per unit of time, for example. The metric should be concisely expressible, for example 10 qpsR at 100M, options 1, 2, 3. Due to the heterogeneous nature of the systems under test, the result's short form likely needs to specify the metric, scale and options included in the test.
- Have optional parts, such as different degrees of inferencing and maybe language extensions such as full text, as this is a likely component of any social software.
- Have a specification for a full disclosure report, TPC style, even though we can skip the auditing part in the interest of making it easy for vendors to publish results and be listed.
- Have a subject domain where real data are readily available and which is broadly understood by the community. For example, SIOC data about on-line communities seems appropriate. Typical degree of connectedness, number of triples per person etc can be measured from real files .
- Have a diverse enough workload. This should include initial bulk load of data, some adding of triples during the run and continuous query load.
The query load should illustrate the following types of operations:
- Basic lookups, such as would be made for filling in a person's home page in a social networks app. List data of user plus names and emails of friends. Relatively short joins, unions, and optionals.
- Graph operations like shortest path from individual to individual in a social network.
- Selecting data with drill down, as in faceted browsing. For example, start with articles having tag t, see distinct tags of articles with tag t, select another tag t2 to see the distinct tags of articles with both t and t2 and so forth.
- Retrieving all closely related nodes, as in composing a SIOC snapshot over a person's post in different communities, the recent activity report for a forum etc. These will be construct or describe queries. The coverage of describe is unclear, hence construct may be better.
If we take an application like LinkedIn as a model, we can get a reasonable estimate of the relative frequency of different queries. For the queries per second metric, we can define the mix similarly to TPC C. We count executions of the main query and divide by running time. Within this time, for every 10 executions of the main query there are varying numbers of executions of secondary queries, typically more complex ones.
Full Disclosure Report
The report contains basic TPC-like items such as:
- Metric qps/scale/options
- Software used, DBMS, RDF toolkit if separate
- Hardware. Number, clock and type of CPUs per machine, number of machines in cluster, RAM per machine, disks per machine, manufacturer, price of hardware/software
These can go into a summary spreadsheet that is just like the TPC ones.
Additionally, the full report should include:
- Configuration files for DBMS, web server, other components.
- Parameters for test driver, i.e., number of clicks, how many concurrent clicks. The tester determines the degree of parallelism that gets the best throughput and should indicate this in the report. Making a graph of throughput as function of concurrent clients is a lot of work and maybe not necessary here.
- Duration in real time. Since for any large database with a few G of working set the warm up time is easily 30 minutes, the warm up time should be mentioned but not included in the metric. The measured interval should not be less than 1h in duration and should reflect a "steady state," as defined in the TPC rules.
- Source code of server side application logic. This can be inference rules, stored procedures, dynamic web pages or any other server side software-like thing that exists or is modified for the purpose of the test.
- Specification of test driver. If there is a commonly used test driver, its type, parameters and version. If the test driver is custom, reference to its source code.
- Database sizes. For a preallocated database of n G, how much was free after the initial load, how much after the test run? How many bytes per triple.
- CPU/IO. This may not always be readily measurable but is interesting still. Maybe a realistic spec is listing the sum of CPU minutes across all server machines and server processes. For IO, maybe the system totals from iostat before and after the full run, including load and warm-up. If the DBMS and RDF toolkits are separate, it is interesting to know the division of CPU time between them.
Test Drivers
OpenLink has a multithreaded C program that simulates n web users multiplexed over m threads. For example, 10000 users with 100 threads, each user with its own state, so that they carry out their respective usage patterns independently, getting served as soon as the server is available, still having no more than m requests going at any time. The usage pattern is something like go check the mail, browse the catalogue, add to shopping cart etc. This can be modified to browse a social network database and produce the desired query mix. This generates HTTP requests, hence would work against a SPARQL end point or any set of dynamic web pages.
The program produces a running report of the clicks per second rate and statistics at the end, listing the min/avg/max times per operation.
This can be packaged as a separate open source download once the test spec is agreed upon.
For generating test data, a modification of the LUBM generator is probably the most convenient choice.
Benchmarking Relational to RDF Mapping
This area is somewhat more complex than triple storage.
At least the following factors enter into the evaluation:
- Degree of SPARQL compliance. For example, can one have a variable as predicate? Are there limits on optionals and unions?
- Are the data being queried split over multiple RDBMS and joined between them?
- Type of use case. Is this about navigational lookups or about statistics? OLTP or OLAP? It would be the former, as SPARQL does not really have aggregation. Still, many of the interesting queries are about comparing large data sets.
The rationale for mapping relational data to RDF is often data integration. Even in simple cases like the OpenLink Data Spaces applications, a single SPARQL query will often result in a union of queries over distinct relational schemas, each somewhat similar but different in its details.
A test for mapping should represent this aspect. Of course, translating a column into a predicate is easy and useful, specially when copying data. Still, the full power of mapping seems to involve a single query over disparate sources with disparate schemas.
A real world case is OpenLink's ongoing work for mapping WordPress, Mediawiki, phpBB, Drupal, and possibly other popular web applications into SIOC.
Using this as a benchmark might make sense because the source schemas are widely known, there is a lot of real world data in these systems, and the test driver might even be the same as with the above proposed triple store benchmark. The query mix might have to be somewhat tailored.
Another "enterprise style" scenario might be to take the TPC C and TPC D databases — after all both have products, customers and orders — and map them into a common ontology. Then there could be queries sometimes running on only one, sometimes joining both.
Considering the times and the audience, the WordPress/Mediawiki scenario might be culturally more interesting and more fun to demo.
The test has two aspects: Throughput and coverage. I think these should be measured separately.
The throughput can be measured with queries that are generally sensible, such as "get articles by an author that I know with tags t1 and t2."
Then there are various pathological queries that work specially poorly with mapping. For example, if the types of subjects are not given, if the predicate is known at run time only, if the graph is not given, we get a union of everything joined with another union of everything and many of the joins between the terms of the different unions are identically empty but the software may not know this.
In a real world case, I would simply forbid such queries. In the benchmarking case, these may be of some interest. If the mapping is clever enough, it may survive cases like "list all predicates and objects of everything called gizmo where the predicate is in the product ontology".
It may be good to divide the test into a set of straightforward mappings and special cases and measure them separately. The former will be queries that a reasonably written application would do for producing user reports.
|
11/21/2006 09:22 GMT-0500
|
Modified:
04/16/2008 16:53 GMT-0500
|
More RDF scalability tests
More RDF scalability tests
We have lately been busy with RDF scalability. We work with the 8000 university LUBM data set, a little over a billion triples. We can load it in 23h 46m on a box with 8G RAM. With 16G we probably could get it in 16h.
The resulting database is 75G, 74 bytes per triple which is not bad. It will shrink a little more if explicitly compacted by merging adjacent partly filled pages. See Advances in Virtuoso RDF Triple Storage for an in-depth treatment of the subject.
The real question of RDF scalability is finding a way of having more than one CPU on the same index tree without them hitting the prohibitive penalty of waiting for a mutex. The sure solution is partitioning, would probably have to be by range of the whole key. but before we go to so much trouble, we'll look at dropping a couple of critical sections from index random access. Also some kernel parameters may be adjustable, like a spin count before calling the scheduler when trying to get an occupied mutex. Still we should not waste too much time on platform specifics. We'll see.
We just updated the Virtuoso Open Source cut. The latest RDF refinements are not in, so maybe the cut will have to be refreshed shortly.
We are also now applying the relational to RDF mapping discussed in Declarative SQL Schema to RDF Ontology Mapping to the ODS applications.
There is a form of the mapping in the VOS cut on the net but it is not quite ready yet. We must first finish testing it through mapping all the relational schemas of the ODS apps before we can really recommend it. This is another reason for a VOS update in the near future.
We will be looking at the query side of LUBM after the ISWC 2006 conference. So far, we find queries compile OK for many SIOC use cases with the cost model that there is now. A more systematic review of the cost model for SPARQL will come when we get to the queries.
We put some ideas about inferencing in the Advances in Triple Storage paper. The question is whether we should forward chain such things as class subsumption and subproperties. If we build these into the SQL engine used for running SPARQL, we probably can do these as unions at run time with good performance and better working set due to not storing trivial entailed triples. Some more thought and experimentation needs to go into this.
|
11/01/2006 15:36 GMT-0500
|
Modified:
04/16/2008 16:53 GMT-0500
|
Virtuoso and ODS Update
Virtuoso and ODS Update
We have released an update of Virtuoso Open Source Edition and the OpenLink Data Spaces suite.
This marks the coming of age of our RDF and SPARQL efforts. We have the new SQL cost model with SPARQL awareness, we have applications which present much of their data as SIOC, FOAF, ATOM OWL and other formats.
We continue refining these technologies. Our next roadmap item is mapping relational data into RDF and offering SPARQL access to relational data without data duplication. Expect a white paper about this soon.
|
08/10/2006 07:55 GMT-0500
|
Modified:
04/16/2008 16:53 GMT-0500
|
|
|