Details
OpenLink Software
Burlington, United States
Subscribe
Post Categories
Recent Articles
Community Member Blogs
Display Settings
Translate
|
Showing posts in all categories Refresh
"E Pluribus Unum", or "Inversely Functional Identity", or "Smooshing Without the Stickiness" (re-updated)
[
Orri Erling
]
What a terrible word, smooshing... I have understood it to mean that when you have two names for one thing, you give each all the attributes of the other. This smooshes them together, makes them interchangeable.
This is complex, so I will begin with the point and the interested may read on for the details and implications. Starting with soon to be released version 6, Virtuoso allows you to say that two things, if they share a uniquely identifying property, are the same. Examples of uniquely identifying properties would be a book's ISBN number, or a person's social security plus full name. In relational language this is a unique key, and in RDF parlance, an inverse functional property.
In most systems, such problems are dealt with as a preprocessing step before querying. For example, all the items that are considered the same will get the same properties or at load time all identifiers will be normalized according to some application rules. This is good if the rules are clear and understood. This is so in closed situations, where things tend to have standard identifiers to begin with. But on the open web this is not so clear cut.
In this post, we show how to do these things ad hoc, without materializing anything. At the end, we also show how to materialize identity and what the consequences of this are with open web data. We use real live web crawls from the Billion Triples Challenge data set.
On the linked data web, there are independently arising descriptions of the same thing and thus arises the need to smoosh, if these are to be somehow integrated. But this is only the beginning of the problems.
To address these, we have added the option of specifying that some property will be considered inversely functional in a query. This is done at run time and the property does not really have to be inversely functional in the pure sense. foaf:name will do for an example. This simply means that for purposes of the query concerned, two subjects which have at least one foaf:name in common are considered the same. In this way, we can join between FOAF files. With the same database, a query about music preferences might consider having the same name as "same enough," but a query about criminal prosecution would obviously need to be more precise about sameness.
Our ontology is defined like this:
-- Populate a named graph with the triples you want to use in query time inferencing
ttlp ( '
@prefix foaf: <xmlns="http" xmlns.com="xmlns.com" foaf="foaf">
</>
@prefix owl: <xmlns="http" www.w3.org="www.w3.org" owl="owl">
</>
foaf:mbox_sha1sum a owl:InverseFunctionalProperty .
foaf:name a owl:InverseFunctionalProperty .
',
'xx',
'b3sifp'
);
-- Declare that the graph contains an ontology for use in query time inferencing
rdfs_rule_set ( 'http://example.com/rules/b3sifp#',
'b3sifp'
);
Then use it:
sparql
DEFINE input:inference "http://example.com/rules/b3sifp#"
SELECT DISTINCT ?k ?f1 ?f2
WHERE { ?k foaf:name ?n .
?n bif:contains "'Kjetil Kjernsmo'" .
?k foaf:knows ?f1 .
?f1 foaf:knows ?f2
};
VARCHAR VARCHAR VARCHAR
______________________________________ _______________________________________________ ______________________________
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/dajobe
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/net_twitter
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/amyvdh
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/pom
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/mattb
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/davorg
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/distobj
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/perigrin
....
Without the inference, we get no matches. This is because the data in question has one graph per FOAF file, and blank nodes for persons. No graph references any person outside the ones in the graph. So if somebody is mentioned as known, then without the inference there is no way to get to what that person's FOAF file says, since the same individual will be a different blank node there. The declaration in the context named b3sifp just means that all things with a matching foaf:name or foaf:mbox_sha1sum are the same.
Sameness means that two are the same for purposes of DISTINCT or GROUP BY, and if two are the same, then both have the UNION of all of the properties of both.
If this were a naive smoosh, then the individuals would have all the same properties but would not be the same for DISTINCT.
If we have complex application rules for determining whether individuals are the same, then one can materialize owl:sameAs triples and keep them in a separate graph. In this way, the original data is not contaminated and the materialized volume stays reasonable — nothing like the blow-up of duplicating properties across instances.
The pro-smoosh argument is that if every duplicate makes exactly the same statements, then there is no great blow-up. Best and worst cases will always depend on the data. In rough terms, the more ad hoc the use, the less desirable the materialization. If the usage pattern is really set, then a relational-style application-specific representation with identity resolved at load time will perform best. We can do that too, but so can others.
The principal point is about agility as concerns the inference. Run time is more agile than materialization, and if the rules change or if different users have different needs, then materialization runs into trouble. When talking web scale, having multiple users is a given; it is very uneconomical to give everybody their own copy, and the likelihood of a user accessing any significant part of the corpus is minimal. Even if the queries were not limited, the user would typically not wait for the answer of a query doing a scan or aggregation over 1 billion blog posts or something of the sort. So queries will typically be selective. Selective means that they do not access all of the data, hence do not benefit from ready-made materialization for things they do not even look at.
The exception is corpus-wide statistics queries. But these will not be done in interactive time anyway, and will not be done very often. Plus, since these do not typically run all in memory, these are disk bound. And when things are disk bound, size matters. Reading extra entailment on the way is just a performance penalty.
Enough talk. Time for an experiment. We take the Yahoo and Falcon web crawls from the Billion Triples Challenge set, and do two things with the FOAF data in them:
- Resolve identity at insert time. We remove duplicate person URIs, and give the single URI all the properties of all the duplicate URIs. We expect these to be most often repeats. If a person references another person, we normalize this reference to go to the single URI of the referenced person.
- Give every duplicate URI of a person all the properties of all the duplicates. If these are the same value, the data should not get much bigger, or so we think.
For the experiment, we will consider two people the same if they have the same foaf:name and are both instances of foaf:Person. This gets some extra hits but should not be statistically significant.
The following is a commented SQL script performing the smoosh. We play with internal IDs of things, thus some of these operations cannot be done in SPARQL alone. We use SPARQL where possible for readability. As the documentation states, iri_to_id converts from the qualified name of an IRI to its ID and id_to_iri does the reverse.
We count the triples that enter into the smoosh:
-- the name is an existence because else we'd get several times more due to
-- the names occurring in many graphs
sparql
SELECT COUNT(*)
WHERE { { SELECT DISTINCT ?person
WHERE { ?person a foaf:Person }
} .
FILTER ( bif:exists ( SELECT (1)
WHERE { ?person foaf:name ?nn }
)
) .
?person ?p ?o
};
-- We get 3284674
We make a few tables for intermediate results.
-- For each distinct name, gather the properties and objects from
-- all subjects with this name
CREATE TABLE name_prop
( np_name ANY,
np_p IRI_ID_8,
np_o ANY,
PRIMARY KEY ( np_name,
np_p,
np_o
)
);
ALTER INDEX name_prop
ON name_prop
PARTITION ( np_name VARCHAR (-1, 0hexffff) );
-- Map from name to canonical IRI used for the name
CREATE TABLE name_iri ( ni_name ANY PRIMARY KEY,
ni_s IRI_ID_8
);
ALTER INDEX name_iri
ON name_iri
PARTITION ( ni_name VARCHAR (-1, 0hexffff) );
-- Map from person IRI to canonical person IRI
CREATE TABLE pref_iri
( i IRI_ID_8,
pref IRI_ID_8,
PRIMARY KEY ( i )
);
ALTER INDEX pref_iri
ON pref_iri
PARTITION ( i INT (0hexffff00) );
-- a table for the materialization where all aliases get all properties of every other
CREATE TABLE smoosh_ct
( s IRI_ID_8,
p IRI_ID_8,
o ANY,
PRIMARY KEY ( s,
p,
o
)
);
ALTER INDEX smoosh_ct
ON smoosh_ct
PARTITION ( s INT (0hexffff00) );
-- disable transaction log and enable row auto-commit. This is necessary, otherwise
-- bulk operations are done transactionally and they will run out of rollback space.
LOG_ENABLE (2);
-- Gather all the properties of all persons with a name under that name.
-- INSERT SOFT means that duplicates are ignored
INSERT SOFT name_prop
SELECT "n", "p", "o"
FROM ( sparql
DEFINE output:valmode "LONG"
SELECT ?n ?p ?o
WHERE { ?x a foaf:Person .
?x foaf:name ?n .
?x ?p ?o
}
) xx ;
-- Now choose for each name the canonical IRI
INSERT INTO name_iri
SELECT np_name,
( SELECT MIN (s)
FROM rdf_quad
WHERE o = np_name
AND p = IRI_TO_ID ('http://xmlns.com/foaf/0.1/name')
) AS mini
FROM name_prop
WHERE np_p = IRI_TO_ID ('http://xmlns.com/foaf/0.1/name') ;
-- For each person IRI, map to the canonical IRI of that person
INSERT SOFT pref_iri (i, pref)
SELECT s,
ni_s
FROM name_iri,
rdf_quad
WHERE o = ni_name
AND p = IRI_TO_ID ('http://xmlns.com/foaf/0.1/name') ;
-- Make a graph where all persons have one iri with all the properties of all aliases
-- and where person-to-person refs are canonicalized
INSERT SOFT rdf_quad (g,s,p,o)
SELECT IRI_TO_ID ('psmoosh'),
ni_s,
np_p,
COALESCE ( ( SELECT pref
FROM pref_iri
WHERE i = np_o
),
np_o
)
FROM name_prop,
name_iri
WHERE ni_name = np_name
OPTION ( loop, quietcast ) ;
-- A little explanation: The properties of names are copied into rdf_quad with the name
-- replaced with its canonical IRI. If the object has a canonical IRI, this is used as
-- the object, else the object is unmodified. This is the COALESCE with the sub-query.
-- This takes a little time. To check on the progress, take another connection to the
-- server and do
STATUS ('cluster');
-- It will return something like
-- Cluster 4 nodes, 35 s. 108 m/s 1001 KB/s 75% cpu 186% read 12% clw threads 5r 0w 0i
-- buffers 549481 253929 d 8 w 0 pfs
-- Now finalize the state; this makes it permanent. Else the work will be lost on server
-- failure, since there was no transaction log
CL_EXEC ('checkpoint');
-- See what we got
sparql
SELECT COUNT (*)
FROM <psmoosh>
WHERE {?s ?p ?o};
-- This is 2253102
-- Now make the copy where all have the properties of all synonyms. This takes so much
-- space we do not insert it as RDF quads, but make a special table for it so that we can
-- run some statistics. This saves time.
INSERT SOFT smoosh_ct (s, p, o)
SELECT s, np_p, np_o
FROM name_prop,
rdf_quad
WHERE o = np_name
AND p = IRI_TO_ID ('http://xmlns.com/foaf/0.1/name') ;
-- as above, INSERT SOFT so as to ignore duplicates
SELECT COUNT (*)
FROM smoosh_ct;
-- This is 167360324
-- Find out where the bloat comes from
SELECT TOP 20 COUNT (*),
ID_TO_IRI (p)
FROM smoosh_ct
GROUP BY p
ORDER BY 1 DESC;
The results are:
54728777 http://www.w3.org/2002/07/owl#sameAs
48543153 http://xmlns.com/foaf/0.1/knows
13930234 http://www.w3.org/2000/01/rdf-schema#seeAlso
12268512 http://xmlns.com/foaf/0.1/interest
11415867 http://xmlns.com/foaf/0.1/nick
6683963 http://xmlns.com/foaf/0.1/weblog
6650093 http://xmlns.com/foaf/0.1/depiction
4231946 http://xmlns.com/foaf/0.1/mbox_sha1sum
4129629 http://xmlns.com/foaf/0.1/homepage
1776555 http://xmlns.com/foaf/0.1/holdsAccount
1219525 http://xmlns.com/foaf/0.1/based_near
305522 http://www.w3.org/1999/02/22-rdf-syntax-ns#type
274965 http://xmlns.com/foaf/0.1/name
155131 http://xmlns.com/foaf/0.1/dateOfBirth
153001 http://xmlns.com/foaf/0.1/img
111130 http://www.w3.org/2001/vcard-rdf/3.0#ADR
52930 http://xmlns.com/foaf/0.1/gender
48517 http://www.w3.org/2004/02/skos/core#subject
45697 http://www.w3.org/2000/01/rdf-schema#label
44860 http://purl.org/vocab/bio/0.1/olb
Now compare with the predicate distribution of the smoosh with identities canonicalized
sparql
SELECT COUNT (*) ?p
FROM <psmoosh>
WHERE { ?s ?p ?o }
GROUP BY ?p
ORDER BY 1 DESC
LIMIT 20;
Results are:
748311 http://xmlns.com/foaf/0.1/knows
548391 http://xmlns.com/foaf/0.1/interest
140531 http://www.w3.org/2000/01/rdf-schema#seeAlso
105273 http://www.w3.org/1999/02/22-rdf-syntax-ns#type
78497 http://xmlns.com/foaf/0.1/name
48099 http://www.w3.org/2004/02/skos/core#subject
45179 http://xmlns.com/foaf/0.1/depiction
40229 http://www.w3.org/2000/01/rdf-schema#comment
38272 http://www.w3.org/2000/01/rdf-schema#label
37378 http://xmlns.com/foaf/0.1/nick
37186 http://dbpedia.org/property/abstract
34003 http://xmlns.com/foaf/0.1/img
26182 http://xmlns.com/foaf/0.1/homepage
23795 http://www.w3.org/2002/07/owl#sameAs
17651 http://xmlns.com/foaf/0.1/mbox_sha1sum
17430 http://xmlns.com/foaf/0.1/dateOfBirth
15586 http://xmlns.com/foaf/0.1/page
12869 http://dbpedia.org/property/reference
12497 http://xmlns.com/foaf/0.1/weblog
12329 http://blogs.yandex.ru/schema/foaf/school
We can drop the owl:sameAs triples from the count, so the bloat is a bit less by that but it still is tens of times larger than the canonicalized copy or the initial state.
Now, when we try using the psmoosh graph, we still get different results from the results with the original data. This is because foaf:knows relations to things with no foaf:name are not represented in the smoosh. The exist:
sparql
SELECT COUNT (*)
WHERE { ?s foaf:knows ?thing .
FILTER ( !bif:exists ( SELECT (1)
WHERE { ?thing foaf:name ?nn }
)
)
};
-- 1393940
So the smoosh graph is not an accurate rendition of the social network. It would have to be smooshed further to be that, since the data in the sample is quite irregular. But we do not go that far here.
Finally, we calculate the smoosh blow up factors. We do not include owl:sameAs triples in the counts.
select (167360324 - 54728777) / 3284674.0;
34.290022997716059
select 2229307 / 3284674.0;
= 0.678699621332284
So, to get a smoosh that is not really the equivalent of the original, either multiply the original triple count by 34 or 0.68, depending on whether synonyms are collapsed or not.
Making the smooshes does not take very long, some minutes for the small one. Inserting the big one would be longer, a couple of hours maybe. It was 33 minutes for filling the smoosh_ct table. The metrics were not with optimal tuning so the performance numbers just serve to show that smooshing takes time. Probably more time than allowable in an interactive situation, no matter how the process is optimized.
|
12/16/2008 14:14 GMT
|
Modified:
12/16/2008 15:01 GMT
|
"E Pluribus Unum", or "Inversely Functional Identity", or "Smooshing Without the Stickiness" (re-updated)
[
Virtuso Data Space Bot
]
What a terrible word, smooshing... I have understood it to mean that when you have two names for one thing, you give each all the attributes of the other. This smooshes them together, makes them interchangeable.
This is complex, so I will begin with the point and the interested may read on for the details and implications. Starting with soon to be released version 6, Virtuoso allows you to say that two things, if they share a uniquely identifying property, are the same. Examples of uniquely identifying properties would be a book's ISBN number, or a person's social security plus full name. In relational language this is a unique key, and in RDF parlance, an inverse functional property.
In most systems, such problems are dealt with as a preprocessing step before querying. For example, all the items that are considered the same will get the same properties or at load time all identifiers will be normalized according to some application rules. This is good if the rules are clear and understood. This is so in closed situations, where things tend to have standard identifiers to begin with. But on the open web this is not so clear cut.
In this post, we show how to do these things ad hoc, without materializing anything. At the end, we also show how to materialize identity and what the consequences of this are with open web data. We use real live web crawls from the Billion Triples Challenge data set.
On the linked data web, there are independently arising descriptions of the same thing and thus arises the need to smoosh, if these are to be somehow integrated. But this is only the beginning of the problems.
To address these, we have added the option of specifying that some property will be considered inversely functional in a query. This is done at run time and the property does not really have to be inversely functional in the pure sense. foaf:name will do for an example. This simply means that for purposes of the query concerned, two subjects which have at least one foaf:name in common are considered the same. In this way, we can join between FOAF files. With the same database, a query about music preferences might consider having the same name as "same enough," but a query about criminal prosecution would obviously need to be more precise about sameness.
Our ontology is defined like this:
-- Populate a named graph with the triples you want to use in query time inferencing
ttlp ( '
@prefix foaf: <xmlns="http" xmlns.com="xmlns.com" foaf="foaf">
</>
@prefix owl: <xmlns="http" www.w3.org="www.w3.org" owl="owl">
</>
foaf:mbox_sha1sum a owl:InverseFunctionalProperty .
foaf:name a owl:InverseFunctionalProperty .
',
'xx',
'b3sifp'
);
-- Declare that the graph contains an ontology for use in query time inferencing
rdfs_rule_set ( 'http://example.com/rules/b3sifp#',
'b3sifp'
);
Then use it:
sparql
DEFINE input:inference "http://example.com/rules/b3sifp#"
SELECT DISTINCT ?k ?f1 ?f2
WHERE { ?k foaf:name ?n .
?n bif:contains "'Kjetil Kjernsmo'" .
?k foaf:knows ?f1 .
?f1 foaf:knows ?f2
};
VARCHAR VARCHAR VARCHAR
______________________________________ _______________________________________________ ______________________________
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/dajobe
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/net_twitter
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/amyvdh
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/pom
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/mattb
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/davorg
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/distobj
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/perigrin
....
Without the inference, we get no matches. This is because the data in question has one graph per FOAF file, and blank nodes for persons. No graph references any person outside the ones in the graph. So if somebody is mentioned as known, then without the inference there is no way to get to what that person's FOAF file says, since the same individual will be a different blank node there. The declaration in the context named b3sifp just means that all things with a matching foaf:name or foaf:mbox_sha1sum are the same.
Sameness means that two are the same for purposes of DISTINCT or GROUP BY, and if two are the same, then both have the UNION of all of the properties of both.
If this were a naive smoosh, then the individuals would have all the same properties but would not be the same for DISTINCT.
If we have complex application rules for determining whether individuals are the same, then one can materialize owl:sameAs triples and keep them in a separate graph. In this way, the original data is not contaminated and the materialized volume stays reasonable — nothing like the blow-up of duplicating properties across instances.
The pro-smoosh argument is that if every duplicate makes exactly the same statements, then there is no great blow-up. Best and worst cases will always depend on the data. In rough terms, the more ad hoc the use, the less desirable the materialization. If the usage pattern is really set, then a relational-style application-specific representation with identity resolved at load time will perform best. We can do that too, but so can others.
The principal point is about agility as concerns the inference. Run time is more agile than materialization, and if the rules change or if different users have different needs, then materialization runs into trouble. When talking web scale, having multiple users is a given; it is very uneconomical to give everybody their own copy, and the likelihood of a user accessing any significant part of the corpus is minimal. Even if the queries were not limited, the user would typically not wait for the answer of a query doing a scan or aggregation over 1 billion blog posts or something of the sort. So queries will typically be selective. Selective means that they do not access all of the data, hence do not benefit from ready-made materialization for things they do not even look at.
The exception is corpus-wide statistics queries. But these will not be done in interactive time anyway, and will not be done very often. Plus, since these do not typically run all in memory, these are disk bound. And when things are disk bound, size matters. Reading extra entailment on the way is just a performance penalty.
Enough talk. Time for an experiment. We take the Yahoo and Falcon web crawls from the Billion Triples Challenge set, and do two things with the FOAF data in them:
- Resolve identity at insert time. We remove duplicate person URIs, and give the single URI all the properties of all the duplicate URIs. We expect these to be most often repeats. If a person references another person, we normalize this reference to go to the single URI of the referenced person.
- Give every duplicate URI of a person all the properties of all the duplicates. If these are the same value, the data should not get much bigger, or so we think.
For the experiment, we will consider two people the same if they have the same foaf:name and are both instances of foaf:Person. This gets some extra hits but should not be statistically significant.
The following is a commented SQL script performing the smoosh. We play with internal IDs of things, thus some of these operations cannot be done in SPARQL alone. We use SPARQL where possible for readability. As the documentation states, iri_to_id converts from the qualified name of an IRI to its ID and id_to_iri does the reverse.
We count the triples that enter into the smoosh:
-- the name is an existence because else we'd get several times more due to
-- the names occurring in many graphs
sparql
SELECT COUNT(*)
WHERE { { SELECT DISTINCT ?person
WHERE { ?person a foaf:Person }
} .
FILTER ( bif:exists ( SELECT (1)
WHERE { ?person foaf:name ?nn }
)
) .
?person ?p ?o
};
-- We get 3284674
We make a few tables for intermediate results.
-- For each distinct name, gather the properties and objects from
-- all subjects with this name
CREATE TABLE name_prop
( np_name ANY,
np_p IRI_ID_8,
np_o ANY,
PRIMARY KEY ( np_name,
np_p,
np_o
)
);
ALTER INDEX name_prop
ON name_prop
PARTITION ( np_name VARCHAR (-1, 0hexffff) );
-- Map from name to canonical IRI used for the name
CREATE TABLE name_iri ( ni_name ANY PRIMARY KEY,
ni_s IRI_ID_8
);
ALTER INDEX name_iri
ON name_iri
PARTITION ( ni_name VARCHAR (-1, 0hexffff) );
-- Map from person IRI to canonical person IRI
CREATE TABLE pref_iri
( i IRI_ID_8,
pref IRI_ID_8,
PRIMARY KEY ( i )
);
ALTER INDEX pref_iri
ON pref_iri
PARTITION ( i INT (0hexffff00) );
-- a table for the materialization where all aliases get all properties of every other
CREATE TABLE smoosh_ct
( s IRI_ID_8,
p IRI_ID_8,
o ANY,
PRIMARY KEY ( s,
p,
o
)
);
ALTER INDEX smoosh_ct
ON smoosh_ct
PARTITION ( s INT (0hexffff00) );
-- disable transaction log and enable row auto-commit. This is necessary, otherwise
-- bulk operations are done transactionally and they will run out of rollback space.
LOG_ENABLE (2);
-- Gather all the properties of all persons with a name under that name.
-- INSERT SOFT means that duplicates are ignored
INSERT SOFT name_prop
SELECT "n", "p", "o"
FROM ( sparql
DEFINE output:valmode "LONG"
SELECT ?n ?p ?o
WHERE { ?x a foaf:Person .
?x foaf:name ?n .
?x ?p ?o
}
) xx ;
-- Now choose for each name the canonical IRI
INSERT INTO name_iri
SELECT np_name,
( SELECT MIN (s)
FROM rdf_quad
WHERE o = np_name
AND p = IRI_TO_ID ('http://xmlns.com/foaf/0.1/name')
) AS mini
FROM name_prop
WHERE np_p = IRI_TO_ID ('http://xmlns.com/foaf/0.1/name') ;
-- For each person IRI, map to the canonical IRI of that person
INSERT SOFT pref_iri (i, pref)
SELECT s,
ni_s
FROM name_iri,
rdf_quad
WHERE o = ni_name
AND p = IRI_TO_ID ('http://xmlns.com/foaf/0.1/name') ;
-- Make a graph where all persons have one iri with all the properties of all aliases
-- and where person-to-person refs are canonicalized
INSERT SOFT rdf_quad (g,s,p,o)
SELECT IRI_TO_ID ('psmoosh'),
ni_s,
np_p,
COALESCE ( ( SELECT pref
FROM pref_iri
WHERE i = np_o
),
np_o
)
FROM name_prop,
name_iri
WHERE ni_name = np_name
OPTION ( loop, quietcast ) ;
-- A little explanation: The properties of names are copied into rdf_quad with the name
-- replaced with its canonical IRI. If the object has a canonical IRI, this is used as
-- the object, else the object is unmodified. This is the COALESCE with the sub-query.
-- This takes a little time. To check on the progress, take another connection to the
-- server and do
STATUS ('cluster');
-- It will return something like
-- Cluster 4 nodes, 35 s. 108 m/s 1001 KB/s 75% cpu 186% read 12% clw threads 5r 0w 0i
-- buffers 549481 253929 d 8 w 0 pfs
-- Now finalize the state; this makes it permanent. Else the work will be lost on server
-- failure, since there was no transaction log
CL_EXEC ('checkpoint');
-- See what we got
sparql
SELECT COUNT (*)
FROM <psmoosh>
WHERE {?s ?p ?o};
-- This is 2253102
-- Now make the copy where all have the properties of all synonyms. This takes so much
-- space we do not insert it as RDF quads, but make a special table for it so that we can
-- run some statistics. This saves time.
INSERT SOFT smoosh_ct (s, p, o)
SELECT s, np_p, np_o
FROM name_prop,
rdf_quad
WHERE o = np_name
AND p = IRI_TO_ID ('http://xmlns.com/foaf/0.1/name') ;
-- as above, INSERT SOFT so as to ignore duplicates
SELECT COUNT (*)
FROM smoosh_ct;
-- This is 167360324
-- Find out where the bloat comes from
SELECT TOP 20 COUNT (*),
ID_TO_IRI (p)
FROM smoosh_ct
GROUP BY p
ORDER BY 1 DESC;
The results are:
54728777 http://www.w3.org/2002/07/owl#sameAs
48543153 http://xmlns.com/foaf/0.1/knows
13930234 http://www.w3.org/2000/01/rdf-schema#seeAlso
12268512 http://xmlns.com/foaf/0.1/interest
11415867 http://xmlns.com/foaf/0.1/nick
6683963 http://xmlns.com/foaf/0.1/weblog
6650093 http://xmlns.com/foaf/0.1/depiction
4231946 http://xmlns.com/foaf/0.1/mbox_sha1sum
4129629 http://xmlns.com/foaf/0.1/homepage
1776555 http://xmlns.com/foaf/0.1/holdsAccount
1219525 http://xmlns.com/foaf/0.1/based_near
305522 http://www.w3.org/1999/02/22-rdf-syntax-ns#type
274965 http://xmlns.com/foaf/0.1/name
155131 http://xmlns.com/foaf/0.1/dateOfBirth
153001 http://xmlns.com/foaf/0.1/img
111130 http://www.w3.org/2001/vcard-rdf/3.0#ADR
52930 http://xmlns.com/foaf/0.1/gender
48517 http://www.w3.org/2004/02/skos/core#subject
45697 http://www.w3.org/2000/01/rdf-schema#label
44860 http://purl.org/vocab/bio/0.1/olb
Now compare with the predicate distribution of the smoosh with identities canonicalized
sparql
SELECT COUNT (*) ?p
FROM <psmoosh>
WHERE { ?s ?p ?o }
GROUP BY ?p
ORDER BY 1 DESC
LIMIT 20;
Results are:
748311 http://xmlns.com/foaf/0.1/knows
548391 http://xmlns.com/foaf/0.1/interest
140531 http://www.w3.org/2000/01/rdf-schema#seeAlso
105273 http://www.w3.org/1999/02/22-rdf-syntax-ns#type
78497 http://xmlns.com/foaf/0.1/name
48099 http://www.w3.org/2004/02/skos/core#subject
45179 http://xmlns.com/foaf/0.1/depiction
40229 http://www.w3.org/2000/01/rdf-schema#comment
38272 http://www.w3.org/2000/01/rdf-schema#label
37378 http://xmlns.com/foaf/0.1/nick
37186 http://dbpedia.org/property/abstract
34003 http://xmlns.com/foaf/0.1/img
26182 http://xmlns.com/foaf/0.1/homepage
23795 http://www.w3.org/2002/07/owl#sameAs
17651 http://xmlns.com/foaf/0.1/mbox_sha1sum
17430 http://xmlns.com/foaf/0.1/dateOfBirth
15586 http://xmlns.com/foaf/0.1/page
12869 http://dbpedia.org/property/reference
12497 http://xmlns.com/foaf/0.1/weblog
12329 http://blogs.yandex.ru/schema/foaf/school
We can drop the owl:sameAs triples from the count, so the bloat is a bit less by that but it still is tens of times larger than the canonicalized copy or the initial state.
Now, when we try using the psmoosh graph, we still get different results from the results with the original data. This is because foaf:knows relations to things with no foaf:name are not represented in the smoosh. The exist:
sparql
SELECT COUNT (*)
WHERE { ?s foaf:knows ?thing .
FILTER ( !bif:exists ( SELECT (1)
WHERE { ?thing foaf:name ?nn }
)
)
};
-- 1393940
So the smoosh graph is not an accurate rendition of the social network. It would have to be smooshed further to be that, since the data in the sample is quite irregular. But we do not go that far here.
Finally, we calculate the smoosh blow up factors. We do not include owl:sameAs triples in the counts.
select (167360324 - 54728777) / 3284674.0;
34.290022997716059
select 2229307 / 3284674.0;
= 0.678699621332284
So, to get a smoosh that is not really the equivalent of the original, either multiply the original triple count by 34 or 0.68, depending on whether synonyms are collapsed or not.
Making the smooshes does not take very long, some minutes for the small one. Inserting the big one would be longer, a couple of hours maybe. It was 33 minutes for filling the smoosh_ct table. The metrics were not with optimal tuning so the performance numbers just serve to show that smooshing takes time. Probably more time than allowable in an interactive situation, no matter how the process is optimized.
|
12/16/2008 14:14 GMT
|
Modified:
12/16/2008 15:01 GMT
|
XTech Talks covering Linked Data
[
Kingsley Uyi Idehen
]
Courtesy a post by Chris Bizer to the LOD community mailing list, here is a list of Linked Data oriented talks at the upcoming XTech 2008 event (also see the XTech 2008 Schedule which is Linked Data friendly). Of course, I am posting this to my Blog Data Space with the sole purpose of adding data to the rapidly growing Giant Global Graph of Linked Data, basically adding to my collection of live Linked Data utility demos :-)
Here is the list:
-
Linked Data Deployment (Daniel Lewis, OpenLink Software)
-
The Programmes Ontology (Tom Scott, BBC and all)
-
SemWebbing the London Gazette (Jeni Tennison, The Stationery Office)
-
Searching, publishing and remixing a Web of Semantic Data (Richard Cyganiak, DERI Galway)
-
Building a Semantic Web Search Engine: Challenges and Solutions (Aidan Hogan, DERI Galway)
- 'That's not what you said yesterday!' - evolving your Web API (Ian Davis, Talis)
-
Representing, indexing and mining scientific data using XML and RDF: Golem and CrystalEye (Andrew Walkingshaw,
University of Cambridge)
For the time challenged (i.e. those unable to view this post using it's permalink / URI as a data source via the OpenLink RDF Browser, Zitgist Data Viewer, DISCO Hyperdata Browser, or Tabulator), the benefits of this post are as follows:
- automatic URI generation for all linked items in this post
- automatic propagation of tags to del.icio.us, Technorati, and PingTheSemanticWeb
- automatic association of formal meanings to my Tags using the MOAT Ontology
- automatic collation and generation of statistical data about my tags using the SCOT Ontology (*missing link is a callout to SCOT Tag Ontology folks to sort the project's home page URL at the very least*)
- explicit typing of my Tags as SKOS Concepts.
Put differently, I cost-effectively contribute to the GGG across all Web interaction dimensions (1.0, 2.0, 3.0) :-)
|
05/02/2008 14:53 GMT
|
Modified:
05/05/2008 17:07 GMT
|
Linked Data and Information Architecture
[
Virtuso Data Space Bot
]
Linked Data and Information Architecture
We had a workshop on Linked Open Data (LOD) last week in Beijing. You can see the papers in the program. The event was a success with plenty of good talks and animated conversation. I will not go into every paper here but will comment a little on the conversation and draw some technology requirements going forward.
Tim Berners-Lee showed a read-write version of Tabulator. This raises the question of updating on the Data Web. The consensus was that one could assert what one wanted in one's own space but that others' spaces would be read-only. What spaces one considered relevant would be the user's or developer's business, as in the document web.
It seems to me that a significant use case of LOD is an open-web situation where the user picks a broad read-only "data wallpaper" or backdrop of assertions, and then uses this combined with a much smaller, local, writable data set. This is certainly the case when editing data for publishing, as in Tim's demo. This will also be the case when developing mesh-ups combining multiple distinct data sets bound together by sets of SameAs assertions, for example. Questions like, "What is the minimum subset of n data sets needed for deriving the result?" will be common. This will also be the case in applications using proprietary data combined with open data.
This means that databases will have to deal with queries that specify large lists of included graphs, all graphs in the store or all graphs with an exclusion list. All this is quite possible but again should be considered when architecting systems for an open linked data web.
"There is data but what can we really do with it? How far can we trust it, and what can we confidently decide based on it?"
As an answer to this question, Zitgist has compiled the UMBEL taxonomy using SKOS. This draws on Wikipedia, Open CYC, Wordnet, and YAGO, hence the acronym WOWY. UMBEL is both a taxononmy and a set of instance data, containing a large set of named entities, including persons, organizations, geopolitical entities, and so forth. By extracting references to this set of named entities from documents and correlating this to the taxonomy, one gets a good idea of what a document (or part thereof) is about.
Kingsley presented this in the Zitgist demo. This is our answer to the criticism about DBpedia having errors in classification. DBpedia, as a bootstrap stage, is about giving names to all things. Subsequent efforts like UMBEL are about refining the relationships.
"Should there be a global URI dictionary?"
There was a talk by Paolo Bouquet about Entity Name System, a a sort of data DNS, with the purpose of associating some description and rough classification to URIs. This would allow discovering URIs for reuse. I'd say that this is good if it can cut down on the SameAs proliferation and if this can be widely distributed and replicated for resilience, à la DNS. On the other hand, it was pointed out that this was not quite in the LOD spirit, where parties would mint their own dereferenceable URIs, in their own domains. We'll see.
"What to do when identity expires?"
Giovanni of Sindice said that a document should be removed from search if it was no longer available. Kingsley pointed out that resilience of reference requires some way to recover data. The data web cannot be less resilient than the document web, and there is a point to having access to history. He recommended hooking up with the Internet Archive, since they make long term persistence their business. In this way, if an application depends on data, and the URIs on which it depends are no longer dereferenceable or or provide content from a new owner of the domain, those who need the old version can still get it and host it themselves.
It is increasingly clear that OWL SameAs is both the blessing and bane of linked data. We can easily have tens of URIs for the same thing, especially with people. Still, these should be considered the same.
Returning every synonym in a query answer hardly makes sense but accepting them as input seems almost necessary. This is what we do with Virtuoso's SameAs support. Even so, this can easily double query times even when there are no synonyms.
Be that as it may, SameAs is here to stay; just consider the mapping of DBpedia to Geonames, for example.
Also, making aberrant SameAs statements can completely poison a data set and lead to absurd query results. Hence choosing which SameAs assertions from which source will be considered seems necessary. In an open web scenario, this leads inevitably to multi-graph queries that can be complex to write with regular SPARQL. By extension, it seems that a good query would also include the graphs actually used for deriving each result row. This is of course possible but has some implications on how databases should be organized.
Yves Raymond gave a talk about deriving identity between Musicbrainz and Jamendo. I see the issue as a core question of linked data in general. The algorithm Yves presented started with attribute value similarities and then followed related entities. Artists would be the same if they had similar names and similar names of albums with similar song titles, for example. We can find the same basic question in any analysis, for example, looking at how news reporting differs between media, supposing there is adequate entity extraction.
There is basic graph diffing in RDFSync, for example. But here we are expanding the context significantly. We will traverse references to some depth, allow similarity matches, SameAs, and so forth. Having presumed identity of two URIs, we can then look at the difference in their environment to produce a human readable summary. This could then be evaluated for purposes of analysis or of combining content.
At first sight, these algorithms seem well parallelizable, as long as all threads have access to all data. For scaling, this means a probably message-bound distributed algorithm. This is something to look into for the next stage of linked data.
Some inference is needed, but if everybody has their own choice of data sets to query, then everybody would also have their own entailed triples. This will make for an explosion of entailed graphs if forward chaining is used. Forward chaining is very nice because it keeps queries simple and easy to optimize. With Virtuoso, we still favor backward chaining since we expect a great diversity of graph combinations and near infinite volume in the open web scenario. With private repositories of slowly changing data put together for a special application, the situation is different.
In conclusion, we have a real LOD movement with actual momentum and a good idea of what to do next. The next step is promoting this to the broader community, starting with Linked Data Planet in New York in June.
|
04/29/2008 10:37 GMT
|
Modified:
04/29/2008 17:18 GMT
|
Linked Data and Information Architecture
[
Orri Erling
]
We had a workshop on Linked Open Data (LOD) last week in Beijing. You can see the papers in the program. The event was a success with plenty of good talks and animated conversation. I will not go into every paper here but will comment a little on the conversation and draw some technology requirements going forward.
Tim Berners-Lee showed a read-write version of Tabulator. This raises the question of updating on the Data Web. The consensus was that one could assert what one wanted in one's own space but that others' spaces would be read-only. What spaces one considered relevant would be the user's or developer's business, as in the document web.
It seems to me that a significant use case of LOD is an open-web situation where the user picks a broad read-only "data wallpaper" or backdrop of assertions, and then uses this combined with a much smaller, local, writable data set. This is certainly the case when editing data for publishing, as in Tim's demo. This will also be the case when developing mesh-ups combining multiple distinct data sets bound together by sets of SameAs assertions, for example. Questions like, "What is the minimum subset of n data sets needed for deriving the result?" will be common. This will also be the case in applications using proprietary data combined with open data.
This means that databases will have to deal with queries that specify large lists of included graphs, all graphs in the store or all graphs with an exclusion list. All this is quite possible but again should be considered when architecting systems for an open linked data web.
"There is data but what can we really do with it? How far can we trust it, and what can we confidently decide based on it?"
As an answer to this question, Zitgist has compiled the UMBEL taxonomy using SKOS. This draws on Wikipedia, Open CYC, Wordnet, and YAGO, hence the acronym WOWY. UMBEL is both a taxononmy and a set of instance data, containing a large set of named entities, including persons, organizations, geopolitical entities, and so forth. By extracting references to this set of named entities from documents and correlating this to the taxonomy, one gets a good idea of what a document (or part thereof) is about.
Kingsley presented this in the Zitgist demo. This is our answer to the criticism about DBpedia having errors in classification. DBpedia, as a bootstrap stage, is about giving names to all things. Subsequent efforts like UMBEL are about refining the relationships.
"Should there be a global URI dictionary?"
There was a talk by Paolo Bouquet about Entity Name System, a a sort of data DNS, with the purpose of associating some description and rough classification to URIs. This would allow discovering URIs for reuse. I'd say that this is good if it can cut down on the SameAs proliferation and if this can be widely distributed and replicated for resilience, à la DNS. On the other hand, it was pointed out that this was not quite in the LOD spirit, where parties would mint their own dereferenceable URIs, in their own domains. We'll see.
"What to do when identity expires?"
Giovanni of Sindice said that a document should be removed from search if it was no longer available. Kingsley pointed out that resilience of reference requires some way to recover data. The data web cannot be less resilient than the document web, and there is a point to having access to history. He recommended hooking up with the Internet Archive, since they make long term persistence their business. In this way, if an application depends on data, and the URIs on which it depends are no longer dereferenceable or or provide content from a new owner of the domain, those who need the old version can still get it and host it themselves.
It is increasingly clear that OWL SameAs is both the blessing and bane of linked data. We can easily have tens of URIs for the same thing, especially with people. Still, these should be considered the same.
Returning every synonym in a query answer hardly makes sense but accepting them as input seems almost necessary. This is what we do with Virtuoso's SameAs support. Even so, this can easily double query times even when there are no synonyms.
Be that as it may, SameAs is here to stay; just consider the mapping of DBpedia to Geonames, for example.
Also, making aberrant SameAs statements can completely poison a data set and lead to absurd query results. Hence choosing which SameAs assertions from which source will be considered seems necessary. In an open web scenario, this leads inevitably to multi-graph queries that can be complex to write with regular SPARQL. By extension, it seems that a good query would also include the graphs actually used for deriving each result row. This is of course possible but has some implications on how databases should be organized.
Yves Raymond gave a talk about deriving identity between Musicbrainz and Jamendo. I see the issue as a core question of linked data in general. The algorithm Yves presented started with attribute value similarities and then followed related entities. Artists would be the same if they had similar names and similar names of albums with similar song titles, for example. We can find the same basic question in any analysis, for example, looking at how news reporting differs between media, supposing there is adequate entity extraction.
There is basic graph diffing in RDFSync, for example. But here we are expanding the context significantly. We will traverse references to some depth, allow similarity matches, SameAs, and so forth. Having presumed identity of two URIs, we can then look at the difference in their environment to produce a human readable summary. This could then be evaluated for purposes of analysis or of combining content.
At first sight, these algorithms seem well parallelizable, as long as all threads have access to all data. For scaling, this means a probably message-bound distributed algorithm. This is something to look into for the next stage of linked data.
Some inference is needed, but if everybody has their own choice of data sets to query, then everybody would also have their own entailed triples. This will make for an explosion of entailed graphs if forward chaining is used. Forward chaining is very nice because it keeps queries simple and easy to optimize. With Virtuoso, we still favor backward chaining since we expect a great diversity of graph combinations and near infinite volume in the open web scenario. With private repositories of slowly changing data put together for a special application, the situation is different.
In conclusion, we have a real LOD movement with actual momentum and a good idea of what to do next. The next step is promoting this to the broader community, starting with Linked Data Planet in New York in June.
|
04/29/2008 12:08 GMT
|
Modified:
04/29/2008 17:18 GMT
|
My 5 Favorite Things about Linked Data on the Web
[
Kingsley Uyi Idehen
]
- End to Buzzword Blur - how buzzwords are used to obscure comprehension of core concepts. Let SKOS, MOAT, SCOT reign!
- End of Data Silos - you don't own me, my data, my data's mobility (import/export), or accessibility (by reference) just because I signed up for Yet Another Software as Service (ySaaS)
- End of Misinformation - Sins of omission will no longer go unpunished the era of self induced amnesia due to competitive concerns is over, Co-opetition shall reign (Ray Noorda always envisoned this reality)
- Serendipitous information and data discovery gets cheaper by the second - you're only a link away for a universe of relevant and accessible data
- Rise of Quality - Contrary to historic president (due to all of the above) well engineered solutions will no longer be sure indicators of commercial failure
BTW - Benjamin Nowack penned an interesting post titled: Semantic Web Aliases, that covers a variety of labels used to describe the Semantic Web. The great thing about this post is that it provides yet another demonstration-in-the-making for the virtues of Linked Data :-)
Labels are harmless when their sole purpose is the creation of routes of comprehension for concepts. Unfortunately, Labels aren't always constructed with concept comprehension in mind, most of the time they are artificial inflectors and deflectors servicing marketing communications goals.
Anyway, irrespective of actual intent, I've endowed all of the labels from Bengee's post with URIs as my contribution important disambiguation effort re. the Semantic Web:
As per usual this post is best appreciated when processed via an Linked Data aware user agent.
|
03/05/2008 04:49 GMT
|
Modified:
03/09/2008 11:48 GMT
|
Additional OpenLink Data Spaces Features
[
Kingsley Uyi Idehen
]
Daniel Lewis has published another post about OpenLink Data Spaces (ODS) functionality titled:A few new features in OpenLink Data Spaces, that exposes additional features (some hot out the oven).
OpenLink Data Spaces (ODS) now officially supports:
Which means that OpenLink Data Spaces support all of the main standards being discussed in the DataPortability Interest Group!
APML Example:
All users of ODS automatically get a dynamically created APML file, for example: APML profile for Kingsley Idehen
The URI for an APML profile is: http://myopenlink.net/dataspace/<ods-username>/apml.xml
Meaning of a Tag Example:
All users of ODS automatically have tag cloud information embedded inside their SIOC file, for example: SIOC for Kingsley Idehen on the Myopenlink.net installation of ODS.
But even better, MOAT has been implemented in the ODS Tagging System. This has been demonstrated in a recent test blog post by my colleague Mitko Iliev, the blog post comes up on the tag search: http://myopenlink.net/dataspace/imitko/weblog/Mitko%27s%20Weblog/tag/paris
Which can be put through the OpenLink Data Browser:
OAuth Example:
OAuth Tokens and Secrets can be created for any ODS application. To do this:
- you can log in to MyOpenlink.net beta service, the Live Demo ODS installation, an EC2 instance, or your local installation
- then go to ‘Settings’
- and then you will see ‘OAuth Keys’
- you will then be able to choose the applications that you have instantiated and generate the token and secret for that app.
Related Document (Human) Links
Remember (as per my most recent post about ODS), ODS is about unobtrusive fusion of Web 1.0, 2.0, and 3.0+ usage and interaction patterns. Thanks to a lot of recent standardization in the Semantic Web realm (e.g SPARQL), we are now employ the MOAT, SKOS, and SCOT ontologies as vehicles for Structured Tagging.
Structured Tagging?
This is how we take a key Web 2.0 feature (think 2D in a sense), bend it over, to create a Linked Data Web (Web 3.0) experience unobtrusively (see earlier posts re. Dimensions of Web). Thus, nobody has to change how they tag or where they tag, just expose ODS to the URLs of your Web 2.0 tagged content and it will produce URIs (Structured Data Object Identifiers) and a lnked data graph for your Tags Data Space (nee. Tag Cloud). ODS will construct a graph which exposes tag subject association, tag concept alignment / intended meaning, and tag frequencies, that ultimately deliver "relative disambiguation" of intended Tag Meaning (i.e. you can easily discern the taggers meaning via the Tags actual Data Space which is associated with the tagger). In a nutshell, the dynamics of relevance matching, ranking, and the like, change immensely without futile timeless debates about matters such as:
What's the Linked Data value proposition?
What's the Linked Data business model?
What's the Semantic Web Killer application?
We can just get on with demonstrating Linked Data value using what exists on the Web today. This is the approach we are deliberately taking with ODS.
Related Items
.
Tip: This post is best viewed via an RDF aware User Agent (e.g. a Browser or Data Viewer). I say this because the permalink of this post is a URI in a Linked Data Space (My Blog) comprised of more data than meets the eye (i.e. what you see when you read this post via a Document Web Browser) :-)
|
02/09/2008 17:54 GMT
|
Modified:
02/11/2008 11:38 GMT
|
Enterprise 0.0, Linked Data, and Semantic Data Web
[
Kingsley Uyi Idehen
]
Last week we officially released Virtuoso 5.0.1 (in Commercial and Open Source Editions). The press release provided us with an official mechanism and timestamp for the current Virtuoso feature set.
A vital component of the new Virtuoso release is the finalization of our SQL to RDF mapping functionality -- enabling the declarative mapping of SQL Data to RDF. Additional technical insight covering other new features (delivered and pending) is provided by Orri Erling, as part of a series of post-Banff posts.
Why is SQL to RDF Mapping a Big Deal?
A majority of the world's data (especially in the enterprise realm) resides in SQL Databases. In addition, Open Access to the data residing in said databases remains the biggest challenge to enterprises for the following reasons:
-
SQL Data Sources are inherently heterogeneous because they are acquired with business applications that are in many cases inextricably bound to a particular DBMS engine
-
Data is predictably dirty
-
DBMS vendors ultimately hold the data captive and have traditionally resisted data access standards such as ODBC (*trust me they have, just look at the unprecedented bad press associated with ODBC the only truly platform independent data access API. Then look at how this bad press arose..*)
Enterprises have known from the beginning of modern corporate times that data access, discovery, and manipulation capabilities are inextricably linked to the "Real-time Enterprise" nirvana (hence my use of 0.0 before this becomes 3.0).
In my experience, as someone whose operated in the data access and data integration realms since the late '80s, I've painfully observed enterprises pursue, but unsuccessfully attain, full control over enterprise data (the prized asset of any organization) such that data-, information-, knowledge-workers are just a click away from commencing coherent platform and database independent data drill-downs and/or discovery that transcend intranet, internet, and extranet boundaries -- serendipitous interaction with relevant data, without compromise!
Okay, situation analysis done, we move on..
At our most recent (12th June) monthly Semantic Web Gathering, I unveiled to TimBL and a host of other attendees a simple, but powerful, demonstration of how Linked Data, as an aspect of the Semantic Data Web, can be applied to enterprise data integration challenges.
Actual SQL to RDF Mapping Demo / Experiment
Hypothesis
A SQL Schema can be effectively mapped declaratively to RDF such that SQL Rows morph into RDF Instance Data (Entity Sets) based on the Concepts & Properties defined in a Concrete Conceptual Data Model oriented Data Dictionary ( RDF Schema and/or OWL Ontology). In addition, the solution must demonstrate how "Linked Data in the Web" is completely different from "Data on the Web" or "Linked Data on the Web" (btw - Tom Heath eloquently unleashed this point in his recent podcast interview with Talis).
Apparatus
An Ontology - in this case we simply derived the Northwind Ontology from the XML Schema based CSDL ( Conceptual Schema Definition Language) used by Microsoft's public Astoria demo (specifically the Northwind Data Services demo).
SQL Database Schema - Northwind (comes bundled with ACCESS, SQL Server, and Virtuoso) comprised of tables such as: Customer, Employee, Product, Category, Supplier, Shipper etc.
OpenLink Virtuoso - SQL DBMS Engine (although this could have been any ODBC or JDBC accessible Database), SQL-RDF Metaschema Language, HTTP URL-rewriter, WebDAV Engine, and DBMS hosted XSLT processor
Client Tools - iSPARQL Query Builder, RDF Browser (which could also have been Tabulator or DISCO or a standard Web Browser)
Experiment / Demo
-
Declaratively map the Northwind SQL Schema to RDF using the Virtuoso Meta Schema Language (see: Virtuoso PL based Northwind_SQL_RDF script)
-
Start browsing the data by clicking on the URIs that represent the RDF Data Model Entities resulting from the SQL to RDF Mapping
Observations
-
Via a single Data Link click I was able to obtain specific information about the Customer represented by the URI "ALFKI" (act of URI Dereferencing as you would an Object ID in an Object or Object-Relational Database)
-
Via a
Dynamic Data Page I was able to explore all the entity relationships or specific entity data (i.e Exploratory or Entity specific dereferencing) in the Northwind Data Space
-
I was able to perform similar exploration (as per item 2) using our
OpenLink Browser.
Conclusions
The vision of data, information, or knowledge at your fingertips is nigh! Thanks to the infrastructure provided by the Semantic Data Web (URIs, RDF Data Model, variety of RDF Serialization Formats[1][2][3], and Shared Data Dictionaries / Schemas / Ontologies [1][2][3][4][5]) it's now possible to Virtualize enterprise data from the Physical Storage Level, through the Logical Data Management Levels (Relational), up to a Concrete Conceptual Model (Graph) without operating system, development environment or framework, or database engine lock-in.
Next Steps
We produce a shared ontology for the CRM and Business Reporting Domains. I hope this experiment clarifies how this is quite achievable by converting XML Schemas to RDF Data Dictionaries (RDF Schemas or Ontologies). Stay tuned :-)
Also watch TimBL amplify and articulate Linked Data value in a recent interview.
Other Related Matters
To deliver a mechanism that facilitates the crystallization of this reality is a contribution of boundless magnitude (as we shall all see in due course). Thus, it is easy to understand why even "her majesty", the queen of England, simply had to get in on the act and appoint TimBL to the "British Order of Merit" :-)
Note: All of the demos above now work with IE & Safari (a "remember what Virtuoso is epiphany") by simply putting Virtuoso's DBMS hosted XSLT engine to use :-) This also applies to my earlier collection of demos from the Hello Data Web and other Data Web & Linked Data related demo style posts.
|
06/14/2007 15:28 GMT
|
Modified:
02/04/2008 23:19 GMT
|
SPARQL, Ajax, Tagging, Folksonomies, Share Ontologies and Semantic Web
[
Kingsley Uyi Idehen
]
A quick dump that demonstrates how I integrate tags and links from del.icio.us with links from my local bookmark database via one of my public Data Spaces (this demo uses the kidehen Data Space).
SPARQL (query language for the Semantic Web) basically enables me to query a collection of typed links (predicates/properties/attributes) in my Data Space (ODS based of course) without breaking my existing local bookmarks database or the one I maintain at del.icio.us.
I am also demonstrating how Web 2.0 concepts such as Tagging mesh nicely with the more formal concepts of Topics in the Semantic Web realm. The key to all of this is the ability to generate RDF Data Model Instance Data based on Shared Ontologies such as SIOC (from DERI's SIOC Project) and SKOS (again showing that Ontologies and Folksonomies are complimentary).
This demo also shows that Ajax also works well in the Semantic Web realm (or web dimension of interaction 3.0) especially when you have a toolkit with Data Aware controls (for SQL, RDF, and XML) such as OAT (OpenLink Ajax Toolkit). For instance, we've successfully used this to build a Visual Query Building Tool for SPARQL (alpha) that really takes a lot of the pain out of constructing SPARQL Queries (there is much more to come on this front re. handling of DISTINCT, FILTER, ORDER BY etc..).
For now, take a look at the SPARQL Query dump generated by this SIOC & SKOS SPARQL QBE Canvas Screenshot.
You can cut and paste the queries that follow into the Query Builder or use the screenshot to build your variation of this query sample. Alternatively, you can simply click on *This* SPARQL Protocol URL to see the query results in a basic HTML Table. And one last thing, you can grab the SPARQL Query File saved into my ODS-Briefcase (the WebDAV repository aspect of my Data Space).
Note the following SPARQL Protocol Endpoints:
-
MyOpenLink Data Space
-
Experimental Data Space SPARQL Query Builder (you need to register at http://myopenlink.net:8890/ods to use this version)
-
Live Demo Sever
-
Demo Server SPARQL Query Builder (use: demo for both username and pwd when prompted)
My beautified Version of the SPARQL Generated by QBE (you can cut and paste into "Advanced Query" section of QBE) is presented below:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX sioc: <http://rdfs.org/sioc/ns#>
PREFIX dct: <http://purl.org/dc/elements/1.1/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT distinct
?forum_name,
?owner,
?post,
?title,
?link,
?url,
?tag
FROM <http://myopenlink.net/dataspace>
WHERE {
?forum a sioc:Forum;
sioc:type "bookmark";
sioc:id ?forum_name;
sioc:has_member ?owner.
?owner sioc:id "kidehen".
?forum sioc:container_of ?post .
?post dct:title ?title .
optional { ?post sioc:link ?link }
optional { ?post sioc:links_to ?url }
optional { ?post sioc:topic ?topic.
?topic a skos:Concept;
skos:prefLabel ?tag}.
}
Unmodified dump from the QBE (this will be beautified automatically in due course by the QBE):
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX sioc: <http://rdfs.org/sioc/ns#>
PREFIX dct: <http://purl.org/dc/elements/1.1/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT ?var8 ?var9 ?var13 ?var14 ?var24 ?var27 ?var29 ?var54 ?var56
WHERE
{
graph ?graph {
?var8 rdf:type sioc:Forum .
?var8 sioc:container_of ?var9 .
?var8 sioc:type "bookmark" .
?var8 sioc:id ?var54 .
?var8 sioc:has_member ?var56 .
?var9 rdf:type sioc:Post .
OPTIONAL {?var9 dc:title ?var13} .
OPTIONAL {?var9 sioc:links_to ?var14} .
OPTIONAL {?var9 sioc:link ?var29} .
?var9 sioc:has_creator ?var37 .
OPTIONAL {?var9 sioc:topic ?var24} .
?var24 rdf:type skos:Concept .
OPTIONAL {?var24 skos:prefLabel ?var27} .
?var56 rdf:type sioc:User .
?var56 sioc:id "kidehen" .
}
}
Current missing items re. Visual QBE for SPARQL are:
-
Ability to Save properly to WebDAV so that I can then expose various saved SPARQL Queries (.rq file) from my Data Space via URIs
-
Handling of DISTINCT, FILTERS (note: OPTIONAL is handled via dotted predicate-links)
- General tidying up re. click event handling etc.
Note:
You can even open up your own account (using our Live Demo or Live Experiment Data Space servers) which enables you to repeat this demo by doing the following (post registration/sign-up):
- Export some bookmarks from your local browser to the usual HTML bookmarks dump file
- Create an ODS-Bookmarks Instance using your new ODS account
- Use the ODS-Bookmark Instance to import your local bookmarks from the HTML dump file
- Repeat the same import sequence using the ODS-Bookmark Instance, but this time pick the del.icio.us option
- Build your query (change 'kidehen' to your ODS-user-name)
- That's it you now have Semantic Web presence in the form of a Data Space for your local and del.icio.us hosted bookmarks with tags integrated
Quick Query Builder Tip:
You will need to import the following (using the Import Button in the Ontologies & Schemas side-bar);
-
http://www.w3.org/1999/02/22-rdf-syntax-ns# (RDF)
-
http://rdfs.org/sioc/ns# (SIOC)
-
http://purl.org/dc/elements/1.1/ (Dublin Core)
-
http://www.w3.org/2004/02/skos/core# (SKOS)
Browser Support: The SPARQL QBE is SVG based and currently works fine with the following browsers; Firefox 1.5/2.0, Camino (Cocoa variant of Firefox for Mac OS X), Webkit (Safari pre-release / advanced sibling), Opera 9.x. We are evaluating the use of the Adobe SVG plugin re. IE 6/7 support.
Of course this should be a screencast, but I am the middle of a plethora of things right now :-)
|
12/07/2006 17:35 GMT
|
Modified:
12/13/2006 15:09 GMT
|
Virtuoso's SQL Schema to RDF Ontology Mapping Language (1.0)
[
Kingsley Uyi Idehen
]
A new technical white paper about our declarative language for SQL Schema to RDF Ontology Mapping has just been published.
What is this?
A declarative language adapted from SPARQL's graph pattern language (N3/Turtle) for mapping SQL Data to RDF Ontologies. We currently refer to this as a Graph Pattern based RDF VIEW Definition Language.
Why is it important?
It provides an effective mechanism for exposing existing SQL Data as virtual RDF Data Sets (Graphs) negating the data duplication associated with generating physical RDF Graphs from SQL Data en route to persistence in a dedicated Triple Store.
Enterprise applications (traditional and web based) and most Web Applications (Web 1.0 and Web 2.0) sit atop relational databases, implying that SQL/RDF model and data integration is an essential element of the burgeoning "Data Web" (Semantic Web - Layer 1) comprehension and adoption process.
In a nutshell, this is a quick route for non disruptive exposure of existing SQL Data to SPARQL supporting RDF Tools and Development Environments.
How does it work?
RDF Side
- locate one or more Ontologies (e.g FOAF, SIOC, AtomOWL, SKOS etc.) that effectively defines the Concepts (Classes) and Terms (Predicates) to be exposed via your RDF Graph
- Using the Virtuoso's RDF View Definition Language declare a International Resource Identifier (or URI) for your Graph. Example:
CREATE GRAPH IRI("http://myopenlink.net/dataspace")
- Then create Classes (Concepts), Class Properties/Predicates (Memb), and Class Instances (Inst) for the new Graph. Example:
CREATE IRI CLASS odsWeblog:feed_iri "http://myopenlink.net/dataspace/kidehen/weblog/MyFeeds" (
in memb varchar not null, in inst varchar not null)
SQL Side
- If Virtuoso isn't your SQL Data Store, Identify the ODBC or JDBC SQL data source(s) containing the SQL data to be mapped to RDF and then link the relevant tables into Virtuoso's Virtual DBMS Layer
- Then use the RDF View Definition Language's graph pattern feature to generate SQL to RDF Mapping Template for your Graph. As shown in this ODS Weblog -> AtomOWL Mapping example.
|
10/18/2006 18:18 GMT
|
Modified:
11/17/2006 18:24 GMT
|
|
|