Details
Kingsley Uyi Idehen
Lexington, United States
Subscribe
Post Categories
Subscribe
Recent Articles
Display Settings
|
Data Spaces and Web of Databases
Note: An updated version of a previously unpublished blog
post:
Continuing from our recent
Podcast conversation, Jon Udell sheds further insight into the
essence of our conversation via a “Strategic Developer” column
article titled:
Accessing the web of databases.
Below, I present an initial dump of a DataSpace FAQ below that
hopefully sheds light on the DataSpace vision espoused during my
podcast conversation with Jon.
What is a DataSpace?
A moniker for Web-accessible atomic containers that manage and
expose Data, Information, Services, Processes, and Knowledge.
What would you typically find in a Data Space? Examples
include:
- Raw Data - SQL, HTML, XML (raw), XHTML, RDF etc.
- Information (Data In Context) - XHTML (various microformats),
Blog Posts (in RSS, Atom, RSS-RDF formats), Subscription Lists
(OPML, OCS, etc), Social Networks (FOAF, XFN etc.), and many other
forms of applied XML.
- Web Services (Application/Service Logic) - REST or SOAP based
invocation of application logic for context sensitive and
controlled data access and manipulation.
- Persisted Knowledge - Information in actionable context that is
also available in transient or persistent forms expressed using a
Graph Data Model. A modern knowledgebase would more than likely
have RDF as its Data Language, RDFS as its Schema Language, and OWL
as its Domain Definition (Ontology) Language. Actual Domain,
Schema, and Instance Data would be serialized using formats such as
RDF-XML, N3, Turtle etc).
How do Data Spaces and Databases differ?
Data Spaces are fundamentally problem-domain-specific database
applications. They offer functionality that you would instinctively
expect of a database (e.g. AICD data management) with the additonal
benefit of being data model and query language agnostic. Data
Spaces are for the most part DBMS Engine and Data Access Middleware
hybrids in the sense that ownership and control of data is
inherently loosely-coupled.
How do Data Spaces and Content Management Systems differ?
Data Spaces are inherently more flexible, they support multiple
data models and data representation formats. Content management
systems do not possess the same degree of data model and data
representation dexterity.
How do Data Spaces and Knowledgebases differ?
A Data Space cannot dictate the perception of its content. For
instance, what I may consider as knowledge relative to my Data
Space may not be the case to a remote client that interacts with it
from a distance, Thus, defining my Data Space as Knowledgebase,
purely, introduces constraints that reduce its broader
effectiveness to third party clients (applications, services, users
etc..). A Knowledgebase is based on a Graph Data Model resulting in
significant impedance for clients that are built around alternative
models. To reiterate, Data Spaces support multiple data models.
What Architectural Components make up a Data Space?
- ORDBMS Engine - for Data Modeling agility (via complex purpose
specific data types and data access methods), Data Atomicity, Data
Concurrency, Transaction Isolation, and Durability (aka
ACID).
- Virtual Database Engine - for creating a single view of, and
access point to, heterogeneous SQL, XML, Free Text, and other data.
This is all about Virtualization at the Data Access Level.
- Web Services Platform - enabling controlled access and
manipulation (via application, service, or protocol logic) of
Virtualized or Disparate Data. This layer handles the decoupling of
functionality from monolithic wholes for function specific
invocation via Web Services using either the SOAP or REST
approach.
Where do Data Spaces fit into the Web's rapid evolution?
They are an essential part of the burgeoning Data Web / Semantic
Web. In short, they will take us from data “Mash-ups” (combining
web accessible data that exists without integration and repurposing
in mind) to “Mesh-ups” (combining web accessible data that exists
with integration and repurposing in mind).
Where can I see a DataSpace along the lines described, in
action?
Just look at my blog, and take the journey as follows:
What about other Data Spaces?
There are several and I will attempt to categorize along the
lines of query method available:
Type 1 (Free Text Search over HTTP):
Google, MSN, Yahoo!, Amazon, eBay, and most Web 2.0 plays .
Type 2 (Free Text Search and XQuery/XPath over HTTP)
A few blogs and Wikis (Jon Udell's and a few others)
Type 3 (RDF Data Sets and SPARQL Queryable):
Type 4 (Generic Free Text Search, OpenSearch, GData, XQuery/XPath,
and SPARQL):
Points of Semantic Web presence such as the Data Spaces at:
What About Data Space aware tools?
- OpenLink Ajax
Toolkit - provides Javascript Control level binding to Query
Services such as XMLA for SQL, GData for Free Text, OpenSearch for
Free Text, SPARQL for RDF, in addition to service specific Web
Services (Web 2.0 hosted solutions that expose service specific
APIs)
- Semantic
Radar - a Firefox Extension
- PingTheSemantic - the Semantic
Webs equivalent of Web 2.0's weblogs.com
- PiggyBank - a Firefox
Extension
08/28/2006 19:38 GMT-0500 |
Modified: 09/04/2006 18:58
GMT-0500 |
The WWW Proposal and RDF: Then and Now (circa 1999)
I've just re-read an article penned by Dan Brickley in 1999
titled: The WWW
Proposal and RDF: Then and Now, that retains its prescience to
this very day. Ironically I stumbled across this timeless piece
while revisiting the
RSS name imbroglio that gave us a simple syndication format
(RSS 2.0) that will ultimately implode (IMHO) since "Simple" is
ultimately short lived when dealing with attention challenged
end-users that are always assumed to be dumb when in fact they are
simply ambivalent.
I was compelled to go back to the RSS 2.0 imbroglio when I came
across Dave Winer's
comments re. "the SEC attempting to reinvent RSS 2.0..." response
to Jon
Udell's recent XBRL article.
Although I don't believe in complex entry points into complex
technology realms, I do subscribe to the approach where developers
deal with the complexity associated with a problem domain while
hiding said complexity from ambivalent end-users via coherent
interfaces -- which does not always imply User Interface.
XBRL is a
great piece of work that addresses the complex problem domain of
Financial Reporting. The only thing it's missing right now is an
Ontology that facilitates RDF Data Model based XBRL
Schema and Instance Data which ultimately makes XBRL data available
to RDF query languages such as SPARQL. This line of
thought implies, for instance, an XML Schema to OWL Ontology Mapping for
Schema Data (as explained in a
white paper by the VSIS Group at the university of Hamburg)
leaving the Instance Data to be generated in a myriad of ways that
includes XML to RDF and/or XML->SQL->RDF.
As I stated in an earlier post:
we should not mistake ambivalence to lack of intelligence.
Assuming "Simple" is always right at all times is another way of
subscribing to this profound misconception. You know, assuming the
world was flat (as opposed to geoid) was quite palatable at some
point in the history of mankind, I wonder what would have happened
if we held on to this point of view to this day because of its
"Simplicity"?
08/28/2006 06:20 GMT-0500 |
Modified: 09/30/2006 16:27
GMT-0500 |
OpenLink Ajax Toolkit (OAT) 1.0 Released
We have finally released the 1.0 edition of OAT.
OAT offers a broad Javascript-based, browser-independent widget
set
for building data source independent rich internet applications
that are usable across a broad range of Ajax-capable web
browsers.
OAT's support binding to the following data sources via its Ajax
Database Connectivity Layer:
SQL Data via XML for Analysis (XMLA)
Web Data via SPARQL, GData, and OpenSearch Query Services
Web Services specific Data via service specific binding to SOAP and
REST style web services
The toolkit includes a collection of powerful rich internet
application prototypes include: SQL Query By Example, Visual
Database Modeling, and Data bound Web Form Designer.
Project homepage on sourceforge.net:
http://sourceforge.net/projects/oat
Source Code:
http://sourceforge.net/projects/oat/files
Live demonstration:
http://www.openlinksw.com/oat/
08/08/2006 22:11 GMT-0500 |
Modified: 08/09/2006 05:12
GMT-0500 |
Intermediate RDF Bulk Loading (Wikipedia & Wordnet) Experiment Results
Orri
shares his findings from internal experimentation re. Virtuoso and bulk
loading RDF content such as Wikpedia3 and Wordnet
Data Sets:
Here is a dump of the post titled: Intermediate
RDF Loading Results:
Following from the post about a new
Multithreaded RDF Loader, here are some intermediate results
and action plans based on my findings.
The experiments were made on a dual 1.6GHz Sun SPARC with 4G RAM
and 2 SCSI disks. The data sets were the 48M triple Wikipedia data
set and the 1.9M triple Wordnet data set. 100% CPU means one CPU
constantly active. 100% disk means one thread blocked on the read
system call at all times.
Starting with an empty database, loading the Wikipedia set took
315 minutes, amounting to about 2500 triples per second. After
this, loading the Wordnet data set with cold cache and 48M triples
already in the table took 4 minutes 12 seconds, amounting to 6838
triples per second. Loading the Wikipedia data had CPU usage up to
180% but over the whole run CPU usage was around 50% with disk I/O
around 170%. Loading the larger data set was significantly I/O
bound while loading the smaller set was more CPU bound, yet was not
at full 200% CPU.
The RDF quad table was indexed on GSPO and PGOS. As one would
expect, the bulk of I/O was on the PGOS index. We note that the
pages of this index were on the average only 60% full. Thus the
most relevant optimization seems to be to fill the pages closer to
90%. This will directly cut about a third of all I/O plus will have
an additional windfall benefit in the form of better disk cache hit
rates resulting from a smaller database.
The most practical way of having full index pages in the case of
unpredictable random insert order will be to take sets of adjacent
index leaf pages and compact the rows so that the last page of the
set goes empty. Since this is basically an I/O optimization, this
should be done when preparing to write the pages to disk, hence
concerning mostly old dirty pages. Insert and update times will not
be affected since these operations will not concern themselves with
compaction. Thus the CPU cost of background compaction will be
negligible in comparison with writing the pages to disk. Naturally
this will benefit any relational application as well as free text
indexing. RDF and free text will be the largest beneficiaries due
to the large numbers of short rows inserted in random order.
Looking at the CPU usage of the tests, locating the place in the
index where to insert, which by rights should be the bulk of the
time cost, was not very significant, only about 15%. Thus there are
many unused possibilities for optimization,for example writing some
parts of the loader current done as stored procedures in C. Also
the thread usage of the loader, with one thread parsing and mapping
IRI strings to IRI IDs and 6 threads sharing the inserting could be
refined for better balance, as we have noted that the parser thread
sometimes forms a bottleneck. Doing the updating of the IRI name to
IRI id mapping on the insert thread pool would produce some
benefit.
Anyway, since the most important test was I/O bound, we will
first implement some background index compaction and then revisit
the experiment. We expect to be able to double the throughput of
the Wikipedia data set loading.
07/18/2006 15:21 GMT-0500 |
Modified: 07/18/2006 14:28
GMT-0500 |
More Thoughts on ORDBMS Clients, ADO.NET vNext, and RDF
Additional commentary from Orri Erling. re.
ORDBMS, ADO.NET vNext, and RDF (in relation to Semantic Web
Objects):
More
Thoughts on ORDBMS Clients, .NET and RDF:
Continuing on from the previous post... If Microsoft
opens the right interfaces for independent developers, we see many
exciting possibilities for using ADO .NET 3 with Virtuoso.
Microsoft quite explicitly states that their thrust is to
decouple the client side representation of data as .NET objects
from the relational schema on the database. This is a worthy
goal.
But we can also see other possible applications of the
technology when we move away from strictly relational back ends.
This can go in two directions: Towards object oriented database and
towards making applications for the semantic web.
In the OODBMS direction, we could equate Virtuoso table
hierarchies with .NET classes and create a tighter coupling between
client and database, going as it were in the other direction from
Microsofts intended decoupling. For example, we could do typical
OODBMS tricks such as prefetch of objects based on storage
clustering. The simplest case of this is like virtual memory, where
the request for one byte brings in the whole page or group of
pages. The basic idea is that what is created together probably
gets used together and if all objects are modeled as subclasses of
(subtables) of a common superclass, then, regardless of instance
type, what is created together (has consecutive ids) will indeed
tend to cluster on the same page. These tricks can deliver good
results in very navigational applications like GIS or CAD. But
these are rather specialized things and we do not see OODBMS making
any great comeback.
But what is more interesting and more topical in the present
times is making clients for the RDF world. There, the OWL Ontology Language could be
used to make the .NET classes and the DBMS could, when returning
URIs serving as subjects of triple include specified predicates on
these subjects, enough to allow instantiating .NET instances as
'proxies' of these RDF objects. Of course, only predicates for
which the client has a representation are relevant, thus some
client-server handshake is needed at the start. What data could be
prefetched is like the intersection of a concise bounded
description and what the client has classes for. The rest of the
mapping would be very simple, with IRIs becoming pointers,
multi-valued predicates lists and so on. IRIs for which the RDF
type were not known or inferable could be left out or represented
as a special class with name-value pairs for its attributes, same
with blank nodes.
In this way,.NETs considerable UI capabilities could directly be
exploited for visualizing RDF data, only given that the data
complied reasonably well with a known ontology.
If an SPARQL query returned
a resultset, IRI type columns would be returned as .NET instances
and the server would prefetch enough data for filling them in. For
a SPARQL CONSTRUCT, a collection object could be returned with the
objects materialized inside. If the interfaces allow passing an
Entity SQL string, these could possibly be specialized to allow for
a SPARQL string instead. LINQ might have to be extended to allow
for SPARQL type queries, though.
Many of these questions will be better answerable as we get more
details on Microsofts forthcoming ADO .NET release. We hope that
sufficient latitude exists for exploring all these interesting
avenues of development.
07/18/2006 13:29 GMT-0500 |
Modified: 07/18/2006 14:28
GMT-0500 |
Object Relational Rediscovered?
Microsoft's recent unveiling of the next
generation of ADO.NET has pretty much crystalized a long
running hunch that the era of standardized client/user level
interfaces for "Object-Relational" technology is neigh. Finally,
this application / problem domain is attracting the attention of
industry behemoths such as Microsoft.
In an initial response to these developmentsOrri Erling,
Virtuoso's Program Manager, shares valuable
insights from past re. Object-Relational technology developments
and deliverables challenges. As Orri notes, the Virtuoso team
suspended ORM and ORDBMS work at the onset of the Kubl-Virtuoso
transition due to the lack of standardized client-side
functionality exposure points.
My hope is that Microsoft's efforts trigger community wide
activity that result in a collection of interfaces that make
scenarios such as generating .NET based Semantic Web Objects (where
the S in an S-P->O RDF-Triple becomes a bona fide .NET class
instance generated from OWL).
To be continued since the interface specifics re. ADO.NET 3.0
remain in flux...
07/13/2006 21:59 GMT-0500 |
Modified: 07/13/2006 21:59
GMT-0500 |
Hiding Ontology from the Semantic Web Users
A great piece from Harry Chen via his Geospatial Semantic Web
Blog. I have nothing to add to this bar: Amen! Enjoy the rest
of his post below:
Hiding Ontology from the Semantic Web Users: "
Ontology is a key foundation of the Semantic Web. Without
ontology, it will be difficult for applications to share knowledge
and reason over information that is published on the Web. However,
it is a serious mistake to think that the Semantic Web is simply a
collection of ontologies.
Last week I was invited to be on a panel discussion at
the Humans and
the Semantic Web Workshop. I
talked a bit about the Geospatial Semantic Web and its
associated research issues. Overall the workshop went very well.
You can read about the notes from the workshop here.
New Thinkings
Some of my new thinkings after the workshop are as the
follows.
- People, especially those who are new to the Semantic Web, have
put too much emphasis on developing ontologies and not enough
emphasis on developing application functions.
- While ontology languages such RDF and OWL are important part of
the current Semantic Web development, it’s a mistake to build
Semantic Web applications that assume that average users are fluent
in those languages.
- Many people seem to have forgotten that building Semantic Web
applications don’t have start with ontology development. It’s a
good idea to start with ontology reuse — i.e. reuse ontologies that
have already been developed even if they don’t meet every single
requirements of the application.
- There is no excuse to build ‘crappy’ UI just because developing
Semantic Web applications are challenging.
Hide Low-Level Details from the Semantic Web Users
I was asked the question, ‘What’re user-related issues that
Semantic Web developers must pay attention to?’ I think
building Semantic Web applications are similar to building database
applications. Few things we can learn from our past experience in
building database applications.
When building database-driven applications, we store information
in SQL databases, and we use SQL to access, manipulate, and manage
this information. When building Semantic Web applications, we
express ontologies and information in RDF, and use RDF query
languages (e.g. SPARQL) to access and
manipulate this information.
When building database-driven applications, we hide complexity
from the end-users. For example, we almost never expose raw SQL
statements to the end users, or ask users to process the raw result
sets returned from an SQL engine. We always provide intuitive
interfaces for accessing and representing information.
When building Semantic Web applications, we should also hide
complexity from the end-users. Users shouldn’t need to see or edit
RDF statements. Users shouldn’t need to be fluent in SPARQL queries
or able parse graphs that are returned by a SPARQL engine.
Concluding Remarks
Semantic Web developers should spend more time on building
functional capabilities that solve real world problems and improve
people’s productivity. It’s important to remember that ‘the
Semantic Web != ontologies‘.
"
06/30/2006 12:33 GMT-0500 |
Modified: 06/30/2006 09:32
GMT-0500 |
DBMS Hosted Filesystems & WinFS
The return of WinFS back into SQL Server has re-ignited interest
in the somewhat forgotten “DBMS Engine hosted Unified Storage
System” vision. The WinFS project struggles have more to do with
the futility of “Windows Platform Monoculture” than the actual
vision itself. In today's reality you simply cannot seek to deliver
a “Unified Storage” solution that's inherently operating system
specific, and even worse, ignores existing complimentary industry
standards and the loosely coupled nature of the emerging Web
Operating System.
A quick FYI:
Virtuoso has offered a DBMS hosted Filesystem via WebDAV for a
number of years, but the implications of this functionality have
remained unclear for just as long. Thus, we developed (a few years
ago) and released (recently) an application layer above Virtuoso's
WebDAV storage realm called: “The
OpenLink Briefcase” (nee. oDrive). This application allows you
to view items uploaded by content type and/or kind (People,
Business Cards, Calendars, Business Reports, Office Documents,
Photos, Blog Posts, Feed Channels/Subscriptions, Bookmarks etc..).
it also includes automatic metadata extraction (where feasible) and
indexing. Naturally, as an integral part of our “OpenLink Data
Spaces” (ODS) product offering, it supports GData, URIQA, SPARQL
(note: WebDAV metadata is sync'ed with Virtuoso's
RDF Triplestore), SQL, and WebDAV itself.
You can explore the power of this product via the following
routes:
- Download the Virtuoso Open Source
Edition and the ODS
add-ons or
- Visit our live demo
server (note: this is strictly a demo server with full
functionality available) and simply register and then create a
“Briefcase” application instance
- Digest this
Briefcase Home Page Screenshot
06/26/2006 21:41 GMT-0500 |
Modified: 06/26/2006 21:28
GMT-0500 |
Structured Data vs. Unstructured Data
There is an interesting article at regdeveloper.com titled:
Structured data is boring and useless.. This article provides insight into a serious point of confusion about what exactly is structured vs. unstructured data. Here is a key excerpt:
"We all know that structured data is boring and
useless; while unstructured data is sexy and chock full of value.
Well, only up to a point, Lord Copper. Genuinely unstructured data
can be a real nuisance - imagine extracting the return address from
an unstructured letter, without letterhead and any of the
formatting usually applied to letters. A letter may be thought of
as unstructured data, but most business letters are, in fact,
highly-structured." ....
Duncan Pauly, founder and chief technology officer of Coppereye add's eloquent insight to the conversation:
"The labels "structured data" and "unstructured
data" are often used ambiguously by different interest groups; and
often used lazily to cover multiple distinct aspects of the issue.
In reality, there are at least three orthogonal aspects to
structure:
* The structure of the data
itself.
* The structure of the container that
hosts the data.
* The structure of the access method
used to access the data.
These three dimensions are largely independent and one does not
need to imply another. For example, it is absolutely feasible and
reasonable to store unstructured data in a structured database
container and access it by unstructured search
mechanisms."
Data understanding and appreciation is dwindling at a time when
the reverse should be happening. We are supposed to be in the
throws of the "Information Age", but for some reason this appears
to have no correlation with data and "data access" in the minds of
many -- as reflected in the broad contradictory positions taken re.
unstructured data vs structured data, structured is boring and
useless while unstructured is useful and sexy....
The difference between "Structured Containers" and "Structured
Data" are clearly misunderstood by most (an unfortunate fact).
For instance all DBMS products are "Structured Containers"
aligned to one or more data models (typically one). These products
have been limited by proprietary data access APIs and underlying
data model specificity when used in the "Open-world" model that is
at the core of the World Wide Web. This confusion also carries over
to the misconception that Web 2.0 and the Semantic/Data Web are
mutually exclusive.
But things are changing fast, and the concept of multi-model
DBMS products is beginning to crystalize. On our part, we have
finally released the long promised "OpenLink Data Spaces"
application layer that has been developed using our Virtuoso Universal
Server. We have structured unified storage containment exposed
to the data web cloud via endpoints for querying or accessing data
using a variety of mechanisms that include; GData, OpenSearch,
SPARQL, XQuery/XPath, SQL etc..
To be continued....
06/23/2006 18:35 GMT-0500 |
Modified: 06/27/2006 01:39
GMT-0500 |
|
|