The recent Wikipedia imbroglio centered around
DBpedia is the fundamental driver for this
particular blog post. At time of writing this blog post,
the DBpedia project definition in Wikipedia
remains unsatisfactory due to the following shortcomings:
- inaccurate and incomplete definition of the Project's What,
Why, Who, Where, When, and How
- inaccurate reflection of project essence, by skewing focus
towards data extraction and data set dump
production, which is at best a quarter of the project.
Here are some insights on DBpedia, from the perspective of
someone intimately involved with the other three-quarters of the
project.
What is DBpedia?
A live Web accessible RDF model database (Quad
Store) derived from Wikipedia content snapshots, taken
periodically. The RDF database underlies a Linked Data Space comprised of: HTML (and most recently
HTML+RDFa) based data browser pages and a SPARQL endpoint.
Note: DBpedia 3.4 now exists in snapshot
(warehouse) and Live Editions (currently being hot-staged).
This post is about the snapshot (warehouse) edition, I'll drop a
different post about the DBpedia Live Edition where a new
Delta-Engine covers both extraction and database record
replacement, in realtime.
When was it Created?
As an idea under the moniker "DBpedia" it was conceptualized in
late 2006 by researchers at University of Leipzig (lead by Soren
Auer) and Freie University, Berlin (lead by Chris Bizer). The first public instance of
DBpedia (as described above) was released in February 2007. The
official DBpedia coming out party occurred at WWW2007, Banff,
during the inaugural Linked Data gathering, where it
showcased the virtues and immense potential of TimBL's Linked Data meme.
Who's Behind It?
OpenLink Software (developers of OpenLink
Virtuoso and providers of Web Hosting
infrastructure), University of Leipzig, and Freie Univerity,
Berlin. In addition, there is a burgeoning community of
collaborators and contributors responsible DBpedia based
applications, cross-linked data sets, ontologies (OpenCyc,
SUMO, UMBEL, and YAGO) and other utilities. Finally, DBpedia
wouldn't be possible without the global content contribution and
curation efforts of Wikipedians, a point typically overlooked
(albeit inadvertently).
How is it Constructed?
The steps are as follows:
- RDF data set dump preparation via Wikipedia content extraction
and transformation to RDF model data, using the N3 data
representation format - Java and PHP
extraction code produced and maintained by the teams at Leipzig and
Berlin
- Deployment of Linked Data that enables Data browsing and
exploration using any HTTP aware user agent (e.g. basic Web
Browsers) - handled by OpenLink Virtuoso (handled by Berlin via the
Pubby Linked Data Server during the early months of the DBpedia
project)
- SPARQL compliant Quad Store, enabling direct access to database
records via SPARQL (Query language, REST or SOAP Web Service, plus
a variety of query results serialization formats) - OpenLink
Virtuoso since first public release of DBpedia
In a nutshell, there are four distinct and vital components to
DBpedia. Thus, DBpedia doesn't exist if all the project offered was
a collection of RDF data dumps. Likewise, it doesn't exist if you
have a SPARQL compliant Quad Store without loaded data sets, and of
course it doesn't exist if you have a fully loaded SPARQL compliant
Quad Store is up to the cocktail of challenges presented by live
Web accessibility.
Why is it Important?
It remains a live exemplar for any individual or organization
seeking to publishing or exploit HTTP based Linked Data on the
World Wide Web. Its existence continues to
stimulate growth in both density and quality of the burgeoning Web
of Linked Data.
How Do I Use it?
In the most basic sense, simply browse the HTML pages en route
to discovery erstwhile relationships that exist across named entities and subject
matter concepts / headings. Beyond that, simply look at DBpedia
as a master lookup table in a Web hosted distributed database setup; enabling you to
mesh your local domain specific details with DBpedia records via
structured relations (triples or 3-tuples records) comprised of
HTTP URIs from both realms e.g., owl:sameAs relations.
What Can I Use it For?
Expanding on the Master-Details point above, you can use its
rich URI corpus to alleviate tedium associated
with activities such as:
- List maintenance - e.g., Countries, States, Companies, Units of
Measurement, Subject Headings etc.
- Tagging - as a compliment to existing practices
- Analytical Research - you're only a LINK (URI) away from
erstwhile difficult to attain research data spread across a broad
range of topics
- Closed Vocabulary Construction - rather than commence the
futile quest of building your own closed vocabulary, simply
leverage Wikipedia's human curated vocabulary as our common
base.
Related