v1.0 (Virtuoso 5.0) October 2007
Many past commentators on the Semantic Web have argued that its maturation has been slowed by the classic "chicken-and-egg" problem. In order to stimulate the development of Semantic Web applications, there needs to be a critical mass of RDF data. Without these applications, this body of RDF data will not be created. In response to this need, a new class of tools emerged, so called /RDFizers,/ for transforming existing data into RDF.
Whether or not these concerns remain valid, indeed many would argue that the Semantic Web is growing rapidly, RDFizers are crucial enablers for driving the transition of the traditional Document-Web into the emerging Semantic Data-Web.
One such RDFizer is the "Sponger". Introduced in _ Virtuoso Universal Server 5.0 _ , the Sponger provides an as yet unrivalled set of tools for converting non-RDF data into RDF, packaged in an easily extensible framework, with tight integration to the Virtuoso RDF Quad Store. This whitepaper provides an in-depth description of these facilities.
Other facets of Virtuoso's Semantic Web related feature set are explored in the accompanying white papers "_ RDF Views of SQL Data _ " and "_ Deploying RDF Linked Data via Virtuoso Universal Server _".
Change History
1.0 Initial draft (Kingsley Idehen / Carl Blakeley, 10 Oct 07)
Virtuoso 5.0 introduced the /Sponger/, built-in RDF middleware for transforming non-RDF data into RDF "on the fly". Its goal is to use non-RDF Web data sources as input, e.g. (X)HTML Web Pages, (X)HTML Web pages hosting microformats, and even Web services such as those from Google, Del.icio.us, Flickr etc., and create RDF as output. The implication of this facility is that you can use non-RDF data sources as Semantic Web data sources. Architecturally, it is comprised of a number of Sponger Cartridges which are themselves comprised of a Metadata Extractor and RDF Schema/Ontology Mapper components. Metadata extracted from non-RDF resources is used as the basis for generating structured data by mapping it to a suitable ontology.
The Sponger is highly customizable. Custom cartridges can be developed using any language supported by the Virtuoso Server Extensions API enabling RDF instance data generation from resource types not available in the default Sponger Cartridge collection bundled as a Virtuoso VAD package (rdf cartridges_dav.vad).

Figure 1: Virtuoso metadata extraction & RDF structured data generation
The Sponger delivers middleware that accelerates the bootstrapping of the Semantic Data Web by generating RDF Linked Data from non-RDF data sources, unobtrusively. This "Swiss army knife" for on-the-fly Linked Data generation provides a bridge between the traditional Document Web and the Semantic Data Web ("Data Web").
Sponging data from non-RDF Web sources and converting it to RDF exposes the data in a canonical form for querying and inference, and enables fast and easy linked data Mesh-ups as an enhancement of current Web 2.0 oriented Mash-ups. The key difference being that Mesh-ups are constructed from _ Structured Data _ while Mash-ups are constructed from Semi- or Un-structured data sources.
The RDF extraction and instance data generation products that offer functionality demonstrated by the Sponger are also commonly referred to as " RDFizers ".
The Sponger can be invoked via the following mechanisms:
Virtuoso extends the SPARQL Query Language such that it is possible to download RDF resources from a given IRI, parse, and then store the resulting triples in a graph, with all three operations performed during the SPARQL query-execution process. The IRI/URI of the graph used to store the triples is usually equal to the URL where the resources are downloaded from, consequently the feature is known as "IRI/URI dereferencing". If a SPARQL query instructs the SPARQL processor to retrieve the target graph into local storage, then the SPARQL sponger will be invoked.
The SPARQL extensions for IRI dereferencing are described below. Essentially these enable downloading and local storage of selected triples either from one or more named graphs, or based on a proximity search from a starting URI for entities matching the select criteria and also related by the specified predicates, up to a given depth. For full details please refer to the OpenLink Virtuoso Reference Manual , section "IRI Dereferencing".
Virtuoso extends the syntax of the SPARQL "FROM" and "FROM NAMED" clauses. It allows an additional list of options at the end of both clauses: option ( get: option1 value1 , get: param2 value2 , ... ), where the names of the allowed parameters are:
*Example:*
SELECT ?id
FROM NAMED <http://myhost/user1.ttl>
OPTION (get:soft "soft", get:method "GET")
FROM NAMED <http://myhost/user2.ttl>
OPTION (get:soft "soft", get:method "GET")
WHERE { GRAPH ?g { ?id a ?o } };
If a get:... parameter repeats for every FROM clause, it can be written as a global pragma; so the above query can be rewritten as:
DEFINE get:method "GET"
DEFINE get:soft "soft"
SELECT ?id
FROM NAMED <http://myhost/user1.ttl>
FROM NAMED <http://myhost/user2.ttl>
WHERE { GRAPH ?g { ?id a ?o } };
In addition to the "define get:..." SPARQL extensions for IRI dereferencing in FROM clauses, Virtuoso supports dereferencing SPARQL IRIs used in the WHERE clause (graph patterns) of a SPARQL query via a set of "define input:grab-..." pragmas.
Consider an RDF resource which describes a member of a contact list, /user1/, and also contains statements about other users, user2 and /user3/, known to him. Resource user3 in turn contains statements about user4 and so on. If all the data relating to these users were loaded into Virtuoso's RDF database, the query to retrieve the details of all the users could be quite simple. e.g.:
SELECT ?id ?fullname ?email
WHERE { GRAPH ?g { ?id a <Person> ; <FullName> ?fullname ; <Email> ?email . }}
But what if some or all of these resources were not present in Virtuoso's quad store? The highly distributed nature of the Semantic Data Web makes it highly likely that these interlinked resources would be spread across several data spaces. Virtuoso's 'input:grab-...' extensions to SPARQL enable IRI dereferencing in such a way that all appropriate resources are loaded, i.e. "sponged", during query execution, even if some of the resources are not known beforehand. For any particular resource matched, and if necessary downloaded, by the query, it is possible to download related resources via a designated predicate path(s) to a specifiable depth i.e. number of 'hops', distance, or degrees of separation (i.e compute Transitive Closures in SPARQL).
Using Virtuoso's 'input:grab-' pragmas to enable sponging, the above query might be recast to:
DEFINE input:grab-var "?more"
DEFINE input:grab-depth 10
DEFINE input:grab-limit 100
DEFINE input:grab-base-iri "http://myhost/"
SELECT ?id ?fullname ?email
WHERE {
GRAPH ?g {
?id a <Person> ;
<FullName> ?fullname ;
<EMail> ?email .
OPTIONAL { ?id <SeeAlso> ?more }
}
};
A more advanced example showing a designated predicate traversal path via input:grab-seealso extension is:
DEFINE input:grab-iri <http://dbpedia.org/resource/Munich>
DEFINE input:grab-depth 10
DEFINE input:grab-seealso <http://dbpedia.org/property/hasPhotoCollection>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT *
WHERE {<http://dbpedia.org/resource/Munich> foaf:depiction ?o}
A summary of the input:grab pragmas is given below. Again, for full details please refer to the Virtuoso Reference Manual .
Sponger functionality is also exposed via Virtuoso's "/proxy/rdf/" endpoint, as an in-built REST style Web service available in any Virtuoso standard installation. This web service takes a target URL and either returns the content "as is" or tries to transform (by sponging) to RDF. Thus, the proxy service can be used as a 'pipe' for RDF browsers to browse non-RDF sources.
The RDF proxy service takes following URL parameters:
Example :
The URLs below can be pasted into a traditional (X)HTML oriented document-web browser:
http://demo.openlinksw.com/proxy/rdf/?url=http://www.w3c.org/People/Connolly/&force=rdf
Notice that the URL of the data source (
http://www.w3c.org/People/Connolly
) is given as a query string to the proxy, together with any Sponger options (
force=rdf
).
OpenLink currently provides two RDF client applications bundled as part of the OpenLink AJAX Toolkit , an RDF Browser and an interactive SPARQL query builder, iSPARQL . Both utilise sponging.
The OpenLink RDF Browser uses the /proxy/rdf/ service by default, running in 'soft' sponge mode.
iSPARQL uses the /sparql service and allows the user more control over sponging through five possible settings:
These settings are translated to IRI dereferencing pragmas on the server as follows:
| iSPARQL sponging setting | /sparql endpoint: "should sponge" query parameter value | SPARQL processor directives |
| Get Local Data Only | N/A | N/A |
| Get Remote Data When Missing Locally | soft | define get:soft "soft" |
| Get All Remote Data | grab-all | define input:grab-all "yes" define input:grab-depth 5 define input:grab-limit 100 |
| Get All Remote Data & Related Data | grab-seealso | define input:grab-all "yes" define input:grab-depth 5 define input:grab-limit 200define input:grab-seealso
<n0: xmlns:n0="http" www.w3.org="www.w3.org" rdf-schema="rdf-schema" seealso="seealso"> define input:grab-seealso
<n0: xmlns.com="xmlns.com" foaf="foaf" seealso="seealso">define input:grab-seealso
<n0: www.w3.org="www.w3.org" rdf-schema="rdf-schema" isdefinedby="isdefinedby">define input:grab-seealso
<n0: rdfs.org="rdfs.org" sioc="sioc" ns="ns" links_to="links_to">define input:grab-seealso
|
| Get Everything | grab-everything | define input:grab-all "yes" define input:grab-intermediate "yes" define input:grab-depth 5 define input:grab-limit 500 define input:grab-seealso <n0: xmlns:n0="http" www.w3.org="www.w3.org" rdf-schema="rdf-schema" seealso="seealso"> define input:grab-seealso <n0: xmlns.com="xmlns.com" foaf="foaf" seealso="seealso"/> </n0:> |
ODS-Briefcase is a component of OpenLink Data Spaces (ODS), a new generation distributed collaborative application platform for creating Semantic Web presence via Data Spaces derived from weblogs, wikis, feed aggregators, photo galleries, shared bookmarks, discussion forums and more. It is also a high level interface to the Virtuoso WebDAV repository.
ODS-Briefcase offers file-sharing functionality that includes the following features:
When resources or documents are put into the ODS Briefcase and are made publicly readable (via a Unix-style +r permission or ACL setting) and the resource in question is of a supported content type, metadata is automatically extracted at file upload time.
Note*/: ODS-Briefcase extracts metadata from a wide array of file formats, automatically./
The extracted metadata is available in two forms, pure WebDAV and RDF (with RDF/XML or N3/Turtle serialization options), that is optionally synchronized with the underlying Virtuoso Quad Store.
All public readable resources in WebDAV have their owner, creation time, update time, size and tags published, plus associated content type dependent metadata. This WebDAV metadata is also available in RDF form as a SPARQL queriable graph accessible via the SPARQL protocol endpoint using the WebDAV location as the RDF data set URI (graph or data source URI).
You can also use a special RDF_Sink folder to automate the process of uploading RDF resources files into the Virtuoso Quad Store via WebDAV or raw HTTP. The properties of the special folder control whether sponging (RDFization) occurs. Of course, by default, this feature is enabled across all Virtuoso and ODS installations (with an ODS-Briefcase Data Space instance enabled).
Raw HTTP Example using CURL:
Username: demo
Password: demo
Source File: wine.rdf
Destination Folder:
http://demo.openlinksw.com/DAV/home/demo/rdf_sink/
Content Type: application/rdf+xml
$ curl -v -T wine.rdf -H content-type:application/rdf+xml http://demo.openlinksw.com/DAV/home/demo/rdf_sink/ -u demo:demo
Finally, you can also get RDF data into Virtuoso"™s Quad Store via WebDAV using the Virtuoso Web Crawler utility (configurable via the Virtuoso Conductor UI). This feature also provides the ability to enable or disable Sponging as depicted below in Figure 2.
As the Sponger and ODS-Briefcase both extract structured data, what is the relationship between these two facilities?
The principal difference between the two is that the Sponger is an /RDF data crawler & generator/ , whereas Briefcase's structured data extractor is a WebDAV resource filter . The Briefcase structured data extractor is aimed at providing RDF data from WebDAV resources. Thus, if none of the available Sponger cartridges are able to extract metadata and produce RDF structured data, the Sponger calls upon the Briefcase extractor as the last resort in the RDF structured data generation pipeline.

Sponger cartridges are invoked through a cartridge hook which provides a Virtuoso PL entry point to the packaged functionality. Should you wish to utilize the Sponger from your own Virtuoso PL procedures, you can do so by calling these hook routines directly. Full details of the hook function prototype and how to define your own cartridges are presented later in this document.
The generated RDF-based structured data (RDF) can be consumed in a number of ways, depending on whether or not the data is persisted in Virtuoso's RDF Quad Store.
If the data is persisted, it can be queried through the Virtuoso SPARQL endpoint associated with any Virtuoso instance: /sparql. The RDF is exposed in a graph typically identified using a URL matching the source resource URL from which the RDF data was generated. Naturally, any SQL query can also access this, since SPARQL can be freely intermixed with SQL via Virtuoso"™s SPASQL (SPARQL inside SQL) functionality. RDF data is also accessible through Virtuoso"™s implementation of the URIQA protocol.
If not persisted, as is the case with the RDF Proxy Service, the data can be consumed by an RDF aware Web client, e.g. an RDF browser such as the OpenLink RDF Browser.
When an RDF aware client requests data from a network accessible resource via the Sponger the following events occur:
Depending on the file or format type detected at ingest, the Sponger applies the appropriate metadata extractor. Detection occurs at the time of content negotiation instigated by the retrieval user agent. The normal metadata extraction pipeline processing is follows:
RDF generation is done on the fly either using built-in XSLT processors, or in the case of GRDDL, the associated XSLT (exposed via Profile URIs) and local or remote XSLT processors. The RDF generation performed by the Mapping Pipeline is based on an internal mapping table which associates the source data's type with schemas and ontologies. This mapping will vary depending on if you are using Virtuoso with or without the ODS layer. If the ODS application layer (meaning the ODS-Framework and the ODS-Briefcase Data Space application at the very least) is present, the Sponger peforms additional mapping using SIOC , SKOS , FOAF , AtomOWL , Annotea bookmarks, Annotea annotations, EXIF , and other ontologies depending on the source data.
The number of ontologies handled by the Sponger is being increased constantly. To identify which ontologies are supported, view the Conductor's RDF Cartridges configuration panel as described later. For details of how to determine the full ontology set supported by Briefcase, refer to Appendix A.
ODS has its own built-in cartridges for the SIOC ontology which it uses as a data space "glue" ontology. SIOC provides a generic data model of containers, items, item types, and associations between items. The actual classes defined by SIOC include: User, UserGroup ? , Role, Site, Forum and Post. A separate SIOC types module (sioc-t) extends the SIOC Core ontology by defining additional superclasses, subclasses and subproperties to the original SIOC terms. Subclasses include: AddressBook ? , BookmarkFolder ? , Briefcase, EventCalendar ? , ImageGallery ? , Wiki, Weblog, BlogPost ? , Wiki plus many others. Within this generic model, SIOC permits the use of other ontologies (FOAF etc.) in describing attributes of SIOC entities that provide sound conceptual partitioning of data spaces that expose RDF Linked Data. Superclasses include: Container (a generic container of Items) and Space (Data Spaces). Thus, it"™s safe to say that SIOC delivers a generic wrapper, or "glue", ontology for integrating structured RDF data from a myriad of heterogeneous web accessible data sources.
All the data containers (briefcases, blogs, wikis, discussions etc.) maintained by the various ODS application realms (Data Spaces) describe and expose their data as SIOC instance data. The ODS SIOC Reference Guide details the SIOC mappings for each ODS application component (ODS-Framework, ODS-Weblog, ODS-Briefcase, ODS-Feed-Manager, ODS-Wiki, ODS-Mail, ODS-Calendar, ODS-Bookmark-Manager, ODS-Gallery, ODS-Polls, ODS-Addressbook, ODS-Discussion and ODS-Community). Example SPARQL queries for interacting with the SIOC instance data are also shown. In the context of the Sponger, the SIOC mappings used by ODS-Briefcase are some of the most powerful aspects of ODS as a whole (i.e. delivering a platform independent and web architecture based variant of Mac OS X"™s Spotlight functionality).
When the Proxy Service is invoked by a user agent, the Sponger caches the imported data in temporary Virtuoso storage. The cache's invalidation rules conform to those of traditional Web browsers. The data expiration time is determined based on subsequent data fetches of the same resource. The first data retrieval records the 'expires' header. On subsequent fetches, the current time is compared to the expiration time stored in the local cache. If HTTP 'expires' header data isn't returned by the source data server, the Sponger will derive its own expiration time by evaluating the 'date' header and 'last-modified' HTTP headers. The cache can be forcefully cleared using the SPARQL extensions get:soft "replace" or get:soft "replacing", as described earlier in the section "SPARQL Extensions for IRI Dereferencing".
As described earlier and illustrated below, the Sponger is comprised of cartridges which are themselves comprised of metadata extractors and ontology mappers .
A cartridge is invoked through its cartridge hook , a Virtuoso PL procedure entry point and binding to the cartridge's metadata extractor and ontology mapper.
Metadata extractors perform the initial data extraction operations against data sources that include: (X)HTML documents, XML based syndication formats (RSS, Atom, OPML, OCS etc.), binary files, REST style Web services and Microformats (non GRDDL, GRDDL, eRDF, and RDFa). Each metadata extractor is aligned to at least one ontology mapper.
Metadata extractors are built using Virtuoso PL, C/C++, Java or any other external language supported by Virtuoso"™s Server Extension API. Of course, Virtuoso"™s own metadata extractors are written in Virtuoso PL. Third party extractors can be harnessed through the external language support, examples being XMP and Spotlight (both C/C++ based), Aperture (Java based), and SIMILE RDFizers (also Java based).
Sponger ontology mappers peform the the task of generating RDF instance data from extracted metadata (non-RDF) using ontologies associated with a given data source type. They are typically XSLT (using GRDDL or an in-built Virtuoso mapping scheme) or Virtuoso PL based. Virtuoso comes preconfigured with a large range of ontology mappers contained in one or more Sponger cartridges. Nevertheless you are free to create and add your own cartridges, ontology mappers, or metadata extractors.
Figure 3: Sponger architecture
Below is an extract from the stylesheet /DAV/VAD/rdf_cartridges/xslt/flickr2rdf.xsl, used for extracting metadata from Flickr images. Here, the template combines RDF metadata extraction and ontology mapping based on the FOAF and Dublin Core ontologies.
<xsl:template match="owner">
<rdf:Description rdf:nodeID="person">
<rdf:type rdf:resource="http://xmlns.com/foaf/0.1/#Person" />
<xsl:if test="@realname != ''">
<foaf:name><xsl:value-of select="@realname"/></foaf:name>
</xsl:if>
<foaf:nick><xsl:value-of select="@username"/></foaf:nick>
</rdf:Description>
</xsl:template>
<xsl:template match="photo">
<rdf:Description rdf:about="{$baseUri}">
<rdf:type rdf:resource="http://www.w3.org/2003/12/exif/ns/IFD"/>
<xsl:variable name="lic" select="@license"/>
<dc:creator rdf:nodeID="person" />
...
Once a Sponger cartridge has been developed it must be plugged into the SPARQL engine by registering it in the Cartridge Registry, i.e. by adding a record in the table DB.DBA.SYS RDF_CARTRIDGES, either manually via DML, or more easily through Conductor (Virtuoso's browser-based administration console), which provides a UI for adding your own cartridges. Sponger configuration using Conductor is described in detail later. For the moment, we'll focus on outlining the broad architecture of the Sponger.
The SYS RDF_CARTRIDGES table definition is as follows:
create table DB.DBA.SYS_RDF_CARTRIDGES ( RC_ID integer identity, -- cartridge ID, designate order of execution RC_PATTERN varchar, -- a REGEX pattern to match URL or MIME type RC_TYPE varchar default 'MIME', -- what property of the current resource to match: MIME or URL are supported at present RC_HOOK varchar, -- fully qualified PL function name e.g. DB.DBA.MY_CARTRIDGE_FUNCTION RC_KEY long varchar, -- API specific key to use RC_DESCRIPTION long varchar, -- Cartridge description, free text RC_ENABLED integer default 1, -- a flag 0 or 1 integer to include or exclude the given cartridge from processing chain primary key (RC_TYPE, RC_PATTERN) );
The Virtuoso SPARQL processor supports IRI dereferencing via the Sponger. Thus, if the SPARQL query contains references to non-default graph URIs the Sponger goes out (via HTTP) to grab the RDF data sources exposed by the data source URIs and then places them into local storage (as Default or Named Graphs depending on the SPARQL query). Since SPARQL is RDF based, it can only process RDF-based structured data, serialized using RDF/XML, Turtle or N3 formats. As a result, when the SPARQL processor encounters a non-RDF data source, a call to the Sponger is triggered. The Sponger then locates the appropriate cartridge for the data source type in question, resulting in the production of SPARQL-palatable RDF instance data. If none of the registered cartridges are capable of handling the received content type, the Sponger will attempt to obtain RDF instance data via the in-built WebDAV metadata extractor.
Sponger cartridges are invoked during the aforementioned pipeline as follows:
When the SPARQL processor dereferences a URI, it plays the role of an HTTP user agent (client) that makes a content type specific request to an HTTP server via the HTTP request"™s Accept headers. The following then occurs:

Figure 4: Sponger cartridge invocation flowchart
The Virtuoso Conductor provides a graphical UI for most Virtuoso administration tasks, including interfaces for managing Sponger Cartridges.
The VAD (Virtuoso Application Distribution) package rdf cartridges_dav bundles a variety of pre-built cartridges for generating RDF instance data from a large range of popular Web resources and file types. Appendix B provides full details of the VAD's contents. The cartridges installed by the VAD can be viewed and configured through Conductor's /RDF Cartridges/ pane, shown below.

Figure 5: Conductor's RDF Cartridges pane
Earlier we outlined the structured data generation pipeline in which the search sequence for possible sources of metadata is controlled by the RDF cartridge ordering. This ordering can configured through the Conductor UI, as shown. The order in which cartridges are tried is reflected in the 'Seq#' values.
Among the various entry fields are fields for the cartridge hook function and the URL/MIME-type pattern, corresponding to the RC HOOK and RC_PATTERN columns of the SYS_RDF_CARTRIDGES table.

Figure 6: Flickr cartridge configuration settings
The RDF Cartridges VAD package includes a number of XSLT templates, all located in the folder /DAV/VAD/rdf_cartridges/xslt/. All the available templates can be viewed through Virtuoso's WebDAV browser, as illustrated below.

Figure 7: RDF Cartridges VAD package - XSLT templates
Some of the XSLT templates contained in /DAV/VAD/rdf_cartridges/xslt/ are GRDDL filters.
The GRDDL filters can be configured through the GRDDL Mappings panel in Conductor, shown below.
The URI for stylesheets stored in a Virtuoso
WebDAV
repository takes the form virt://WS.WS.SYS
DAV_RES.RES_FULL_PATH.RES_CONTENT:

Figure 8: RDF Cartridges VAD package - GRDDL filters
The Sponger is fully extensible by virtue of its pluggable Cartridge architecture. New data formats can be sponged by creating new cartridges. While OpenLink is actively adding cartridges for new data sources, you are obviously free to develop your own custom cartridges. To this end, details of the cartridge hook and example cartridge implementations are presented below.
Every Virtuoso PL hook function used to plug a custom Sponger cartridge into the Virtuoso SPARQL engine must have a parameter list with the following parameters (the names of the parameters are not important, but their order and presence are) :
*in graph_iri varchar:* the graph IRI which is currently retrieved
*in new origin_uri varchar:* the URL of the document retrieved
*in destination varchar:* the destination graph IRI
*inout content any:* the content of the document retrieved by Sponger
*inout async_queue any:* if the PingService ? initialization parameter has been configured in the [SPARQL] section of the virtuoso.ini file, this is a pre-allocated asynchronous queue to be used to call the PingTheSemanticWeb notification service
*inout ping_service any:* the URL of a ping service, as assigned to the PingService ? parameter in the [SPARQL] section of the virtuoso.ini configuration file. This argument could be used to notify the PingTheSemanticWeb notification service
*inout api_key any:* a string value specific to a given cartridge, contained in the RC KEY column of the DB.DBA.SYS_RDF_CARTRIDGES table. The value can be a single string or a serialized array of strings providing cartridge specific data.
In our first example (which is available in the form of an on-line tutorial ) we implement a basic cartridge, which maps the MIME type text/plain to an imaginary ontology which extends the class Document from FOAF with properties 'txt:UniqueWords ? ?' and 'txt:Chars', where the prefix 'txt:' is specified as 'urn:txt:v0.0:'.
use DB;
create procedure DB.DBA.RDF_LOAD_TXT_META
(
in graph_iri varchar,
in new_origin_uri varchar,
in dest varchar,
inout ret_body any,
inout aq any,
inout ps any,
inout ser_key any
)
{
declare words, chars int;
declare vtb, arr, subj, ses, str any;
declare ses any;
-- if any error we just say nothing can be done
declare exit handler for sqlstate '*'
{
return 0;
};
subj := coalesce (dest, new_origin_uri);
vtb := vt_batch (); chars := length (ret_body);
-- using the text index procedures we get a list of words
vt_batch_feed (vtb, ret_body, 1);
arr := vt_batch_strings_array (vtb);
-- the list has 'word' and positions array, so we must divide by 2
words := length (arr) / 2;
ses := string_output ();
-- we compose a N3 literal
http (sprintf ('<%s> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Document> .\n', subj), ses);
http (sprintf ('<%s> <urn:txt:v0.0:UniqueWords> "%d" .\n', subj, words), ses);
http (sprintf ('<%s> <urn:txt:v0.0:Chars> "%d" .\n', subj, chars), ses);
str := string_output_string (ses);
-- we push the N3 text into the local store
DB.DBA.TTLP (str, new_origin_uri, subj);
return 1;
};
delete from DB.DBA.SYS_RDF_CARTRIDGES where RC_HOOK = 'DB.DBA.RDF_LOAD_TXT_META';
insert soft DB.DBA.SYS_RDF_CARTRIDGES (RC_PATTERN, RC_TYPE, RC_HOOK, RC_KEY, RC_DESCRIPTION) values ('(text/plain)', 'MIME', 'DB.DBA.RDF_LOAD_TXT_META', null, 'Text Files (demo)');
-- here we set order to some large number so don't break existing cartridges update DB.DBA.SYS_RDF_CARTRIDGES set RC_ID = 2000 where RC_HOOK = 'DB.DBA.RDF_LOAD_TXT_META';
To test the cartridge you can use /sparql endpoint with option 'Retrieve remote RDF data for all missing source graphs' to execute:
select * from <URL-of-a-txt-file> where { ?s ?p ?o }
Notice in this example the use of DB.DBA.TTLP( ) to load the extracted structured data into the Virtuoso Quad Store. This RDF data import function parses TTL (TURTLE or N3) and inserts the triples into the table DB.DBA.RDF_QUAD, one of the key tables underpinning the Quad Store. For further details of Virtuoso's RDF and SPARQL API, please refer to the OpenLink Virtuoso Reference Manual .
The next example shows the Virtuoso/PL procedure RDF LOAD_FLICKR_IMG at the heart of the Virtuoso's Flickr Sponger cartridge:
--no_c_escapes-
create procedure DB.DBA.RDF_LOAD_FLICKR_IMG (
in graph_iri varchar, in new_origin_uri varchar, in dest varchar,
inout _ret_body any, inout aq any, inout ps any, inout _key any,
inout opts any)
{
declare xd, xt, url, tmp, api_key, img_id, hdr, exif any;
declare exit handler for sqlstate '*'
{
return 0;
};
tmp := sprintf_inverse (new_origin_uri,
'http://farm%s.static.flickr.com/%s/%s_%s.%s', 0);
img_id := tmp[2];
api_key := _key;
--cfg_item_value (virtuoso_ini_path (), 'SPARQL', 'FlickrAPIkey');
if (tmp is null or length (tmp) <> 5 or not isstring (api_key))
return 0;
url := sprintf
('http://api.flickr.com/services/rest/?method=flickr.photos.getInfo&photo_id=%s&api_key=%s', img_id, api_key);
tmp := http_get (url, hdr);
if (hdr[0] not like 'HTTP/1._ 200 %')
signal ('22023', trim(hdr[0], '\r\n'), 'RDFXX');
xd := xtree_doc (tmp);
exif := xtree_doc ('<rsp/>');
{
declare exit handler for sqlstate '*' { goto ende; };
url := sprintf ('http://api.flickr.com/services/rest/?method=flickr.photos.getExif&photo_id=%s&api_key=%s', img_id, api_key);
tmp := http_get (url, hdr);
if (hdr[0] like 'HTTP/1._ 200 %')
exif := xtree_doc (tmp);
ende:;
}
xt := xslt (registry_get ('_rdf_cartridges_path_') || 'xslt/flickr2rdf.xsl', xd,
vector ('baseUri', coalesce (dest, graph_iri), 'exif', exif));
xd := serialize_to_UTF8_xml (xt);
DB.DBA.RDF_LOAD_RDFXML (xd, new_origin_uri, coalesce (dest, graph_iri));
return 1;
}
Here the http_get( ) function retrieves an HTML page associated with the specified image, which is then parsed into an XML entity and in-memory XML parse tree by xtree_doc( ). Using the xslt( ) function with the stylesheet flickr2rdf.xsl, the XML entity is transformed into RDF/XML which is in turn parsed by RDF LOAD_RDFXML( ) and the extracted triples loaded into the Virtuoso Quad Store.
In order to allow the Sponger to update the local RDF quad store with triples constituting the sponged structured data, the role "SPARQL_UPDATE" must be granted to the account "SPARQL". This should normally be the case. If not, you must manually grant this permission. As with most Virtuoso DBA tasks, the Conductor provides the simplest means of doing this.
The Sponger supports pluggable "Custom Resolver" cartridges in order to support the dereferencing of other forms of URIs besides HTTP URLs, such as URN schemes. The handle-based DOI naming scheme, the URN naming scheme, and also the URN-based LSID scheme, are examples of custom resolvers.
By supporting alternate resolvers the range of data sources which can be linked into the Semantic Data-Web is extended enormously. The LSID resolver enables URN-based resources to be accessible as linked data. Similarly, the DOI resolver permits the huge collection of DOI-based data sources to be linked into the Web of Linked Data (Data Web).
An example SPARQL query dereferencing a URN-based URI is shown below:
http://demo.openlinksw.com/sparql?default-graph-uri=urn:lsid:ubio.org:namebank:11815&should-sponge=soft&query=SELECT+*+WHERE+{?s+?p+?o}&format=text/html&debug=on
As one would expect, the RDF Proxy Service also recognizes URNs. e.g:
http://demo.openlinksw.com/proxy/rdf/?url=urn:lsid:ubio.org:namebank:11815&force=rdf
The file http://www.ivan-herman.net/foaf.html contains a short profile of the W3C Semantic Web Activity Lead Ivan Herman. This XHTML file contains RDF embedded as RDFa. Running the file through the Sponger via Virtuoso's RDF proxy service extracts the embedded FOAF data as pure RDF, as can be seen by pasting the URL
http://demo.openlinksw.com/proxy/rdf/?url=http://www.ivan-herman.net/foaf.html&force=rdf
into an HTML browser then viewing the resulting page source. Though this example demonstrates the action of the /proxy/rdf/ service quite transparently, it is a basic and unwieldy way to view sponged RDF data. OpenLink's RDF Browser provides a more polished means to the same end. Indeed the RDF Browser makes use of the same proxy service.
As an alternative to using the RDF proxy service, we can sponge directly from within the SPARQL processor. After logging into Virtuoso's Conductor interface, the following query can be issued from the Interactive SQL (iSQL) panel:
sparql
define get:uri "http://www.ivan-herman.net/foaf.html"
define get:soft "soft"
select * from <http://mygraph> where {?s ?p ?o}
Here the sparql keyword invokes the SPARQL processor from the SQL interface and the RDF data sponged from page http://www.ivan-herman.net/foaf.html is loaded into the local RDF quad store as graph http://mygraph .
The new graph can then be queried using the basic SPARQL client normally available in a default Virtuoso installation at http://localhost:8890/sparql/. e.g.:
select * from <http://mygraph> where {?s ?p ?o}
(A much richer interactive SPARQL query builder, iSPARQL , is available as part of the OpenLink AJAX Toolkit (OAT), together with the OpenLink RDF Browser).
The Virtuoso/PL code for a simple custom cartridge, DB.DBA.RDF LOAD_TXT_META, was presented earlier. Included in the code was the SQL required to register the cartridge in the Cartridge Registry. Paste the whole of this code into Conductor's iSQL interface and execute it to define and register the cartridge.
Create a simple text document with a .txt extension. This must now be made Web accessible. A simple way to do this is to expose it as a WebDAV resource using Virtuoso's built-in WebDAV support. Login to Virtuoso's ODS Briefcase application, navigate to your Public folder and upload your text document, ensuring that the file extension is .txt, the MIME type is set to text/plain and the file permissions are rw-r--r--. If, for the purposes of this example, you logged into a local default Virtuoso instance as user 'dba' and uploaded a file named 'ODS sponger_test.txt', the file would be Web accessible via the URL http://localhost:8890/DAV/home/dba/Public/ODS_sponger_test.txt.
To sponge the document using the RDF LOAD_TXT_META cartridge, use the basic SPARQL client available at http://localhost:8890/sparql to execute the query
select * from
<http://localhost:8890/DAV/home/dba/Public/ODS_sponger_test.txt> where {?s
?p ?o}
with the option 'Retrieve remote RDF data for all missing source graphs' set. The returned result set should look something like:
|s|p|o| http://localhost:8890/DAV/home/dba/ |Public/ODS sponger_test.txt|http://www.w3.org/1999/02/22-rdf-syntax-ns#type|http://xmlns.com/foaf/0.1/ |Document| http://localhost:8890/DAV/home/dba/
| Public/ODS sponger_test.txt | urn:txt:v0.0:UniqueWords ? | 7 |
| http://localhost:8890/DAV/home/dba/
| Public/ODS sponger_test.txt | urn:txt:v0.0:Chars | 44 |
The full range of ontologies and mappings supported by the ODS-Briefcase metadata extractor is reflected in the contents of the Virtuoso directory DAV/VAD/oDrive/schemas/ (e.g. for a local Virtuoso instance, this would be http://localhost:8890/DAV/VAD/oDrive/schemas/).
The schema directory is browsed easily using the Conductor WebDAV Browser.

The schema files packaged in Briefcase cover both standard and custom ontologies. The standard ontologies include FOAF, OpenDocument ? , RSS , XBEL , Apple Spotlight and vCard. Others are proprietary OpenLink ontologies for describing file types and content.
Below is a partial listing of one of these files, Office.rdf, which defines the proprietary Office ontology used by Virtuoso for mapping Microsoft Office documents to RDF structured data.
<?xml version="1.0" encoding="UTF-8"?> <!-- - - $Id: Office.rdf,v 1.4 2007/05/10 08:51:53 ddimitrov Exp $ - - This file is part of the OpenLink Software Virtuoso Open-Source (VOS) - project. - - Copyright (C) 1998-2007 OpenLink Software - ... - --> <rdf:RDF xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:owl ="http://www.w3.org/2002/07/owl#" xmlns:virtrdf="http://www.openlinksw.com/schemas/virtrdf#" xml:base="http://www.openlinksw.com/schemas/Office#"> <owl:Ontology rdf:about="http://www.openlinksw.com/schemas/Office#"> <rdfs:label>Microsoft Office document</rdfs:label> <rdfs:comment>The Microsoft Office format general attributes.</rdfs:comment> <virtrdf:catName>Office Documents (Microsoft)</virtrdf:catName> <virtrdf:version>1.00</virtrdf:version> </owl:Ontology> <rdf:Property rdf:about="http://www.openlinksw.com/schemas/Office#TypeDescr"> <rdfs:Range rdf:resource="http://www.w3.org/2001/XMLSchema#string"/> <virtrdf:cardinality>single</virtrdf:cardinality> <virtrdf:label>Document Type</virtrdf:label> <virtrdf:catName>Document Type</virtrdf:catName> </rdf:Property> <rdf:Property rdf:about="http://www.openlinksw.com/schemas/Office#Author"> <rdfs:Range rdf:resource="http://www.w3.org/2001/XMLSchema#string"/> <virtrdf:cardinality>single</virtrdf:cardinality> <virtrdf:label>Author</virtrdf:label> <virtrdf:defaultValue>No name</virtrdf:defaultValue> <virtrdf:catName>Author</virtrdf:catName> </rdf:Property> <rdf:Property rdf:about="http://www.openlinksw.com/schemas/Office#LastAuthor"> <rdfs:Range rdf:resource="http://www.w3.org/2001/XMLSchema#string"/> <virtrdf:cardinality>single</virtrdf:cardinality> <virtrdf:label>Last Author</virtrdf:label> <virtrdf:defaultValue>No name</virtrdf:defaultValue> <virtrdf:catName>Last Author</virtrdf:catName> </rdf:Property> <rdf:Property rdf:about="http://www.openlinksw.com/schemas/Office#Company"> <rdfs:Range rdf:resource="http://www.w3.org/2001/XMLSchema#string"/> <virtrdf:cardinality>single</virtrdf:cardinality> <virtrdf:label>Company</virtrdf:label> <virtrdf:defaultValue>No name</virtrdf:defaultValue> <virtrdf:catName>Company</virtrdf:catName> </rdf:Property> <rdf:Property rdf:about="http://www.openlinksw.com/schemas/Office#Words"> <rdfs:Range rdf:resource="http://www.w3.org/2001/XMLSchema#integer"/> <virtrdf:cardinality>single</virtrdf:cardinality> <virtrdf:label>Word Count</virtrdf:label> </rdf:Property> ... <rdf:Property rdf:about="http://www.openlinksw.com/schemas/Office#Created"> <rdfs:Range rdf:resource="http://www.w3.org/2001/XMLSchema#dateTime"/> <virtrdf:cardinality>single</virtrdf:cardinality> <virtrdf:label>Date Created</virtrdf:label> <virtrdf:catName>Created</virtrdf:catName> </rdf:Property> </rdf:RDF>
Virtuoso supplies a cartridge for extracting RDF data from certain popular Web resources and file types in the form of the VAD package rdf cartridges_dav. If not already present, it can be installed using Conductor or the VAD_INSTALL function. Please refer to the Virtuoso Reference Manual for detailed information on VADs and VAD management.
Details of each of the cartridges contained in the RDF Cartridges VAD are given below.
A cartridge for mapping HTTP request and response messages to the HTTP vocabulary expressed in RDF, as defined by http://www.w3.org/2006/http.rdfs .
This cartridge is disabled by default. If it is enabled, it must be first in the order of execution. The cartridge hook function always returns 0, allowing other cartridge to return additional RDF instance data.
This is a composite cartridge for discovering in HTML pages metadata embedded in a variety of forms. The cartridge looks for RDF data in the order listed below:
A Sponger cartridge for Flickr images, using the Flickr REST API. To function properly it must have a configured key. The Flickr cartridge generates RDF instance data using: CC license, Dublin Core, Dublin Core Metadata Terms, GeoURL ? , FOAF and EXIF (ontology definition: http://www.w3.org/2003/12/exif/ns/ ).
A cartridge for Amazon articles using the Amazon REST API. It needs a Amazon API key in order to be functional.
Implements eBay's REST API in order to generate RDF from eBay articles. It needs a key and user name to be configured in order to work.
OpenOffice ? documents contain metadata which can be extracted using UNZIP, so this cartridge needs the Virtuoso UNZIP plugin to be configured on the server. (Each OpenOffice ? file is actually a collection of XML documents stored in a ZIP archive).
Transforms Yahoo traffic data to RDF.
Transforms iCalendar files to RDF as per http://www.w3.org/2002/12/cal/ical# .
Unknown binary content, PDF files and MS PowerPoint ? files can be transformed to RDF using the Aperture framework. This cartridge needs Virtuoso with Java hosting support, the Aperture framework and the MetaExtractor.class installed on the host system in order to work. For details of how to configure the Aperture framework see Appendix C.
To set up Virtuoso to host and run the Aperture framework, follow these steps:
JavaClasspath = lib/sesame-2.0-alpha-3.jar:lib/openrdf-util-crazy-debug.jar:lib/htmlparser-1.6.jar:lib/activation-1.0.2-upd2.jar:lib/bcmail-jdk14-132.jar:lib/poi-scratchpad-3.0-alpha2-20060616.jar:lib/openrdf-model-2.0-alpha-3.jar:lib/jacob-1.10-pre4.jar:lib/bcprov-jdk14-132.jar:lib/demork-2.0.jar:lib/commons-codec.jar:lib/fontbox-0.1.0-dev.jar:lib/pdfbox-0.7.3.jar:lib/applewrapper-0.1.jar:lib/junit-3.8.1.jar:lib/winlaf-0.5.1.jar:lib/aperture-test-2006.1-alpha-3.jar:lib/openrdf-util-fixed-locking.jar:lib/commons-logging-1.1.jar:lib/mail-1.4.jar:lib/aperture-2006.1-alpha-3.jar:lib/poi-3.0-alpha2-20060616.jar:lib/ical4j-cvs20061019.jar:lib/openrdf-util-2.0-alpha-3.jar:lib/rio-2.0-alpha-3.jar:lib/poi-contrib-3.0-alpha2-20060616.jar:lib/aperture-examples-2006.1-alpha-3.jar:.
SQL> DB.DBA.import_jar (NULL, 'MetaExtractor', 1);
Done. -- 466 msec.
SQL> select "MetaExtractor"().getMetaFromFile ('some_pdf_in_server_working_dir.pdf', 5);
... some RDF data should be returned ...
Important : The installation guidelines presented above have been verified on Linux with aperture-2006.1-alpha-3. Some adjustment may be needed for different operating systems or versions of Aperture.
Throughout this document and in the latest Virtuoso releases, the term "cartridge" is used to identify the pluggable Sponger components through which non-RDF data is transformed to RDF, by way of metadata-extraction and ontology-mapping. Earlier releases of Virtuoso used the term "mapper" in place of "cartridge". The table below lists the components affected by this change in nomenclature, indicating the new and old component names.
| Component New Name | Component Old Name | Component Type |
| RDF Cartridges VAD | RDF Mappers VAD | VAD label |
| /DAV/VAD/rdf_cartridges | /DAV/VAD/rdf_mappers | WebDAV path |
| RDF Cartridges | RDF Mappers | Conductor configuration panel |
| rdf cartridges_dav.vad | rdf_mappers_dav.vad | VAD package |
| _rdf_cartridges_path | _rdf_mappers_path | Registry entry |
| SYS_RDF_CARTRIDGES | SYS_RDF_MAPPERS | DBMS table |
| RC_xxx | RM_xxx | Column prefix in SYS_RDF_CARTRIDGES / SYS_RDF_MAPPERS table |
Aperture - a Java framework for extracting and querying full-text content and metadata from various information systems (file systems, web sites, mail boxes etc) and file formats (documents, images etc).
CRNI Handle System - provides unique persistent identifiers (handles) for Internet resources. It is a general purpose distributed information system providing identifier and resolution services through a namespace and an open set of protocols which allow handles to be resolved into the information necessary to locate, access and use the resources they identify.
Data Spaces - points of presence on the web for accessing structured data gleaned from a variety of heterogeneous data sources.
DOI - a digital object identifier. A location-independent, permanent document or digital resource identifier, based on the CNRI Handle System, which does not change, even if the resource is relocated. DOIs are resolved through the DOI resolver.
eRDF - HTML Embeddable RDF. A technique for embedding a subset of RDF into (X)HTML.
hCard - a microformat for publishing the contact details of people, companies, organizations, and places, in (X)HTML, Atom, RSS, or arbitrary XML.
geoURL - is a location-to-URL reverse directory allowing you to find URLs by their proximity to a given location.
GRDDL - G leaning R esource D escriptions from D ialects of L anguages - a mechanism for extracting RDF data from XML and XHTML documents using transformation algorithms typically represented in XSLT. The transformation algorithms may be explicitly associated using a link element in the head of the document, or held in an associated metadata profile document or namespace document.
LSID - a life science identifier. A URN-based identifier for a piece of Web-based biological information. LSIDs occupy one namespace (urn:lsid) in the URN naming scheme.
microformats - markup that allows the expression of semantics in an HTML (or XHTML) web page. Programs can extract meaning from a standard web page that is marked up with microformats.
Ning - an online platform for creating social networks and websites
ODS - OpenLink Data Spaces . A new generation distributed collaborative application platform for creating Semantic Web presence via Data Spaces derived from: weblogs, wikis, feed aggregators, photo galleries, shared bookmarks, discussion forums, and more.
PingTheSemanticWeb - a repository for RDF documents. You can notify this service that you have created or updated an RDF document on your web site, or you can import a list of recently created or updated RDF documents.
RDF browser - a piece of technology that enables you to browse RDF data sources by traversing data links. The key difference between this approach and traditional browsing is that RDF data links are typed (they possess inherent meaning and context) whereas traditional HTML links are untyped. There are a number of RDF Browsers currently available, including OpenLink 's RDF Browser , which is a component of OAT ( OpenLink Ajax Toolkit), Tabulator and DISCO .
Spotlight - a file system metadata extraction and search facility in Mac OS X.
structured data - data organized into semantic chunks or entities, with similar entities grouped together in relations or classes. (Michael Bergman provides an in-depth discussion of current Semantic Web terminology, and proposals for bringing more clarity to this area, in his post More Structure, More Terminology and (hopefully) More Clarity ).
URIQA - The URI Query Agent Model . A model for interacting with Semantic Web enabled web servers. It introduces new HTTP methods to indicate to a web server that, for a given resource URI, it should return a concise bounded description of that resource rather than a representation of it.
URN - Uniform Resource Name. A form of Uniform Resource Identifier (URI) which uniquely identifies a resource but which, unlike a Uniform Resource Locator (URL), does not specify its location.
VAD - Virtuoso Application Distribution. A packaging and distribution system for extending Virtuoso. A VAD encapsulates the components of a self-contained Virtuoso application, including table creation, default data, stored procedures, web services, and content. VADs are easily installed through Virtuoso's Conductor browser interface.