Details
Kingsley Uyi Idehen
Lexington, United States
Subscribe
Post Categories
Subscribe
Recent Articles
Display Settings
|
Data Spaces and Web of Databases
Note: An updated version of a previously unpublished blog post:
Continuing from our recent Podcast conversation, Jon Udell sheds further insight into the essence of our conversation via a “Strategic Developer” column article titled: Accessing the web of databases. Below, I present an initial dump of a DataSpace FAQ below that hopefully sheds light on the DataSpace vision espoused during my podcast conversation with Jon. What is a DataSpace? A moniker for Web-accessible atomic containers that manage and expose Data, Information, Services, Processes, and Knowledge. What would you typically find in a Data Space? Examples include: - Raw Data - SQL, HTML, XML (raw), XHTML, RDF etc.
- Information (Data In Context) - XHTML (various microformats), Blog Posts (in RSS, Atom, RSS-RDF formats), Subscription Lists (OPML, OCS, etc), Social Networks (FOAF, XFN etc.), and many other forms of applied XML.
- Web Services (Application/Service Logic) - REST or SOAP based invocation of application logic for context sensitive and controlled data access and manipulation.
- Persisted Knowledge - Information in actionable context that is also available in transient or persistent forms expressed using a Graph Data Model. A modern knowledgebase would more than likely have RDF as its Data Language, RDFS as its Schema Language, and OWL as its Domain Definition (Ontology) Language. Actual Domain, Schema, and Instance Data would be serialized using formats such as RDF-XML, N3, Turtle etc).
How do Data Spaces and Databases differ? Data Spaces are fundamentally problem-domain-specific database applications. They offer functionality that you would instinctively expect of a database (e.g. AICD data management) with the additonal benefit of being data model and query language agnostic. Data Spaces are for the most part DBMS Engine and Data Access Middleware hybrids in the sense that ownership and control of data is inherently loosely-coupled. How do Data Spaces and Content Management Systems differ? Data Spaces are inherently more flexible, they support multiple data models and data representation formats. Content management systems do not possess the same degree of data model and data representation dexterity. How do Data Spaces and Knowledgebases differ? A Data Space cannot dictate the perception of its content. For instance, what I may consider as knowledge relative to my Data Space may not be the case to a remote client that interacts with it from a distance, Thus, defining my Data Space as Knowledgebase, purely, introduces constraints that reduce its broader effectiveness to third party clients (applications, services, users etc..). A Knowledgebase is based on a Graph Data Model resulting in significant impedance for clients that are built around alternative models. To reiterate, Data Spaces support multiple data models. What Architectural Components make up a Data Space? - ORDBMS Engine - for Data Modeling agility (via complex purpose specific data types and data access methods), Data Atomicity, Data Concurrency, Transaction Isolation, and Durability (aka ACID).
- Virtual Database Engine - for creating a single view of, and access point to, heterogeneous SQL, XML, Free Text, and other data. This is all about Virtualization at the Data Access Level.
- Web Services Platform - enabling controlled access and manipulation (via application, service, or protocol logic) of Virtualized or Disparate Data. This layer handles the decoupling of functionality from monolithic wholes for function specific invocation via Web Services using either the SOAP or REST approach.
Where do Data Spaces fit into the Web's rapid evolution? They are an essential part of the burgeoning Data Web / Semantic Web. In short, they will take us from data “Mash-ups” (combining web accessible data that exists without integration and repurposing in mind) to “Mesh-ups” (combining web accessible data that exists with integration and repurposing in mind). Where can I see a DataSpace along the lines described, in action? Just look at my blog, and take the journey as follows: What about other Data Spaces? There are several and I will attempt to categorize along the lines of query method available: Type 1 (Free Text Search over HTTP): Google, MSN, Yahoo!, Amazon, eBay, and most Web 2.0 plays . Type 2 (Free Text Search and XQuery/XPath over HTTP) A few blogs and Wikis (Jon Udell's and a few others) Type 3 (RDF Data Sets and SPARQL Queryable): Type 4 (Generic Free Text Search, OpenSearch, GData, XQuery/XPath, and SPARQL): Points of Semantic Web presence such as the Data Spaces at: What About Data Space aware tools? - OpenLink Ajax Toolkit - provides Javascript Control level binding to Query Services such as XMLA for SQL, GData for Free Text, OpenSearch for Free Text, SPARQL for RDF, in addition to service specific Web Services (Web 2.0 hosted solutions that expose service specific APIs)
- Semantic Radar - a Firefox Extension
- PingTheSemantic - the Semantic Webs equivalent of Web 2.0's weblogs.com
- PiggyBank - a Firefox Extension
08/28/2006 19:38 GMT-0500
|
Modified:
09/04/2006 18:58 GMT-0500
|
The WWW Proposal and RDF: Then and Now (circa 1999)
I've just re-read an article penned by Dan Brickley in 1999 titled: The WWW Proposal and RDF: Then and Now, that retains its prescience to this very day. Ironically I stumbled across this timeless piece while revisiting the RSS name imbroglio that gave us a simple syndication format (RSS 2.0) that will ultimately implode (IMHO) since "Simple" is ultimately short lived when dealing with attention challenged end-users that are always assumed to be dumb when in fact they are simply ambivalent. I was compelled to go back to the RSS 2.0 imbroglio when I came across Dave Winer's comments re. "the SEC attempting to reinvent RSS 2.0..." response to Jon Udell's recent XBRL article. Although I don't believe in complex entry points into complex technology realms, I do subscribe to the approach where developers deal with the complexity associated with a problem domain while hiding said complexity from ambivalent end-users via coherent interfaces -- which does not always imply User Interface. XBRL is a great piece of work that addresses the complex problem domain of Financial Reporting. The only thing it's missing right now is an Ontology that facilitates RDF Data Model based XBRL Schema and Instance Data which ultimately makes XBRL data available to RDF query languages such as SPARQL. This line of thought implies, for instance, an XML Schema to OWL Ontology Mapping for Schema Data (as explained in a white paper by the VSIS Group at the university of Hamburg) leaving the Instance Data to be generated in a myriad of ways that includes XML to RDF and/or XML->SQL->RDF. As I stated in an earlier post: we should not mistake ambivalence to lack of intelligence. Assuming "Simple" is always right at all times is another way of subscribing to this profound misconception. You know, assuming the world was flat (as opposed to geoid) was quite palatable at some point in the history of mankind, I wonder what would have happened if we held on to this point of view to this day because of its "Simplicity"?
08/28/2006 06:20 GMT-0500
|
Modified:
09/30/2006 16:27 GMT-0500
|
OpenLink Ajax Toolkit (OAT) 1.0 Released
We have finally released the 1.0 edition of OAT.
OAT offers a broad Javascript-based, browser-independent widget set
for building data source independent rich internet applications that are usable across a broad range of Ajax-capable web browsers.
OAT's support binding to the following data sources via its Ajax Database Connectivity Layer:
SQL Data via XML for Analysis (XMLA)
Web Data via SPARQL, GData, and OpenSearch Query Services
Web Services specific Data via service specific binding to SOAP and REST style web services
The toolkit includes a collection of powerful rich internet application prototypes include: SQL Query By Example, Visual Database Modeling, and Data bound Web Form Designer.
Project homepage on sourceforge.net:
http://sourceforge.net/projects/oat
Source Code:
http://sourceforge.net/projects/oat/files
Live demonstration:
http://www.openlinksw.com/oat/
08/08/2006 22:11 GMT-0500
|
Modified:
08/09/2006 05:12 GMT-0500
|
Intermediate RDF Bulk Loading (Wikipedia & Wordnet) Experiment Results
Orri shares his findings from internal experimentation re. Virtuoso and bulk loading RDF content such as Wikpedia3 and Wordnet Data Sets:
Here is a dump of the post titled:
Intermediate RDF Loading Results:
Following from the post about a new Multithreaded RDF Loader, here are some intermediate results and action plans based on my findings.
The experiments were made on a dual 1.6GHz Sun SPARC with 4G RAM and 2 SCSI disks. The data sets were the 48M triple Wikipedia data set and the 1.9M triple Wordnet data set. 100% CPU means one CPU constantly active. 100% disk means one thread blocked on the read system call at all times.
Starting with an empty database, loading the Wikipedia set took 315 minutes, amounting to about 2500 triples per second. After this, loading the Wordnet data set with cold cache and 48M triples already in the table took 4 minutes 12 seconds, amounting to 6838 triples per second. Loading the Wikipedia data had CPU usage up to 180% but over the whole run CPU usage was around 50% with disk I/O around 170%. Loading the larger data set was significantly I/O bound while loading the smaller set was more CPU bound, yet was not at full 200% CPU.
The RDF quad table was indexed on GSPO and PGOS. As one would expect, the bulk of I/O was on the PGOS index. We note that the pages of this index were on the average only 60% full. Thus the most relevant optimization seems to be to fill the pages closer to 90%. This will directly cut about a third of all I/O plus will have an additional windfall benefit in the form of better disk cache hit rates resulting from a smaller database.
The most practical way of having full index pages in the case of unpredictable random insert order will be to take sets of adjacent index leaf pages and compact the rows so that the last page of the set goes empty. Since this is basically an I/O optimization, this should be done when preparing to write the pages to disk, hence concerning mostly old dirty pages. Insert and update times will not be affected since these operations will not concern themselves with compaction. Thus the CPU cost of background compaction will be negligible in comparison with writing the pages to disk. Naturally this will benefit any relational application as well as free text indexing. RDF and free text will be the largest beneficiaries due to the large numbers of short rows inserted in random order.
Looking at the CPU usage of the tests, locating the place in the index where to insert, which by rights should be the bulk of the time cost, was not very significant, only about 15%. Thus there are many unused possibilities for optimization,for example writing some parts of the loader current done as stored procedures in C. Also the thread usage of the loader, with one thread parsing and mapping IRI strings to IRI IDs and 6 threads sharing the inserting could be refined for better balance, as we have noted that the parser thread sometimes forms a bottleneck. Doing the updating of the IRI name to IRI id mapping on the insert thread pool would produce some benefit.
Anyway, since the most important test was I/O bound, we will first implement some background index compaction and then revisit the experiment. We expect to be able to double the throughput of the Wikipedia data set loading.
07/18/2006 15:21 GMT-0500
|
Modified:
07/18/2006 14:28 GMT-0500
|
More Thoughts on ORDBMS Clients, ADO.NET vNext, and RDF
Additional commentary from Orri Erling. re. ORDBMS, ADO.NET vNext, and RDF (in relation to Semantic Web Objects):
More Thoughts on ORDBMS Clients, .NET and RDF:
Continuing on from the previous post... If Microsoft opens the right interfaces for independent developers, we see many exciting possibilities for using ADO .NET 3 with Virtuoso.
Microsoft quite explicitly states that their thrust is to decouple the client side representation of data as .NET objects from the relational schema on the database. This is a worthy goal.
But we can also see other possible applications of the technology when we move away from strictly relational back ends. This can go in two directions: Towards object oriented database and towards making applications for the semantic web.
In the OODBMS direction, we could equate Virtuoso table hierarchies with .NET classes and create a tighter coupling between client and database, going as it were in the other direction from Microsofts intended decoupling. For example, we could do typical OODBMS tricks such as prefetch of objects based on storage clustering. The simplest case of this is like virtual memory, where the request for one byte brings in the whole page or group of pages. The basic idea is that what is created together probably gets used together and if all objects are modeled as subclasses of (subtables) of a common superclass, then, regardless of instance type, what is created together (has consecutive ids) will indeed tend to cluster on the same page. These tricks can deliver good results in very navigational applications like GIS or CAD. But these are rather specialized things and we do not see OODBMS making any great comeback.
But what is more interesting and more topical in the present times is making clients for the RDF world. There, the OWL Ontology Language could be used to make the .NET classes and the DBMS could, when returning URIs serving as subjects of triple include specified predicates on these subjects, enough to allow instantiating .NET instances as 'proxies' of these RDF objects. Of course, only predicates for which the client has a representation are relevant, thus some client-server handshake is needed at the start. What data could be prefetched is like the intersection of a concise bounded description and what the client has classes for. The rest of the mapping would be very simple, with IRIs becoming pointers, multi-valued predicates lists and so on. IRIs for which the RDF type were not known or inferable could be left out or represented as a special class with name-value pairs for its attributes, same with blank nodes.
In this way,.NETs considerable UI capabilities could directly be exploited for visualizing RDF data, only given that the data complied reasonably well with a known ontology.
If an SPARQL query returned a resultset, IRI type columns would be returned as .NET instances and the server would prefetch enough data for filling them in. For a SPARQL CONSTRUCT, a collection object could be returned with the objects materialized inside. If the interfaces allow passing an Entity SQL string, these could possibly be specialized to allow for a SPARQL string instead. LINQ might have to be extended to allow for SPARQL type queries, though.
Many of these questions will be better answerable as we get more details on Microsofts forthcoming ADO .NET release. We hope that sufficient latitude exists for exploring all these interesting avenues of development.
07/18/2006 13:29 GMT-0500
|
Modified:
07/18/2006 14:28 GMT-0500
|
Object Relational Rediscovered?
Microsoft's recent unveiling of the next generation of ADO.NET has pretty much crystalized a long running hunch that the era of standardized client/user level interfaces for "Object-Relational" technology is neigh. Finally, this application / problem domain is attracting the attention of industry behemoths such as Microsoft.
In an initial response to these developmentsOrri Erling, Virtuoso's Program Manager, shares valuable insights from past re. Object-Relational technology developments and deliverables challenges. As Orri notes, the Virtuoso team suspended ORM and ORDBMS work at the onset of the Kubl-Virtuoso transition due to the lack of standardized client-side functionality exposure points.
My hope is that Microsoft's efforts trigger community wide activity that result in a collection of interfaces that make scenarios such as generating .NET based Semantic Web Objects (where the S in an S-P->O RDF-Triple becomes a bona fide .NET class instance generated from OWL).
To be continued since the interface specifics re. ADO.NET 3.0 remain in flux...
07/13/2006 21:59 GMT-0500
|
Modified:
07/13/2006 21:59 GMT-0500
|
Hiding Ontology from the Semantic Web Users
A great piece from Harry Chen via his Geospatial Semantic Web Blog. I have nothing to add to this bar: Amen! Enjoy the rest of his post below:
Hiding Ontology from the Semantic Web Users: "
Ontology is a key foundation of the Semantic Web. Without ontology, it will be difficult for applications to share knowledge and reason over information that is published on the Web. However, it is a serious mistake to think that the Semantic Web is simply a collection of ontologies.
Last week I was invited to be on a panel discussion at the Humans and the Semantic Web Workshop. I talked a bit about the Geospatial Semantic Web and its associated research issues. Overall the workshop went very well. You can read about the notes from the workshop here.
New Thinkings
Some of my new thinkings after the workshop are as the follows.
- People, especially those who are new to the Semantic Web, have put too much emphasis on developing ontologies and not enough emphasis on developing application functions.
- While ontology languages such RDF and OWL are important part of the current Semantic Web development, it’s a mistake to build Semantic Web applications that assume that average users are fluent in those languages.
- Many people seem to have forgotten that building Semantic Web applications don’t have start with ontology development. It’s a good idea to start with ontology reuse — i.e. reuse ontologies that have already been developed even if they don’t meet every single requirements of the application.
- There is no excuse to build ‘crappy’ UI just because developing Semantic Web applications are challenging.
Hide Low-Level Details from the Semantic Web Users
I was asked the question, ‘What’re user-related issues that Semantic Web developers must pay attention to?’ I think building Semantic Web applications are similar to building database applications. Few things we can learn from our past experience in building database applications.
When building database-driven applications, we store information in SQL databases, and we use SQL to access, manipulate, and manage this information. When building Semantic Web applications, we express ontologies and information in RDF, and use RDF query languages (e.g. SPARQL) to access and manipulate this information.
When building database-driven applications, we hide complexity from the end-users. For example, we almost never expose raw SQL statements to the end users, or ask users to process the raw result sets returned from an SQL engine. We always provide intuitive interfaces for accessing and representing information.
When building Semantic Web applications, we should also hide complexity from the end-users. Users shouldn’t need to see or edit RDF statements. Users shouldn’t need to be fluent in SPARQL queries or able parse graphs that are returned by a SPARQL engine.
Concluding Remarks
Semantic Web developers should spend more time on building functional capabilities that solve real world problems and improve people’s productivity. It’s important to remember that ‘the Semantic Web != ontologies‘.
"
06/30/2006 12:33 GMT-0500
|
Modified:
06/30/2006 09:32 GMT-0500
|
DBMS Hosted Filesystems & WinFS
The return of WinFS back into SQL Server has re-ignited interest in the somewhat forgotten “DBMS Engine hosted Unified Storage System” vision. The WinFS project struggles have more to do with the futility of “Windows Platform Monoculture” than the actual vision itself. In today's reality you simply cannot seek to deliver a “Unified Storage” solution that's inherently operating system specific, and even worse, ignores existing complimentary industry standards and the loosely coupled nature of the emerging Web Operating System.
A quick FYI:
Virtuoso has offered a DBMS hosted Filesystem via WebDAV for a number of years, but the implications of this functionality have remained unclear for just as long. Thus, we developed (a few years ago) and released (recently) an application layer above Virtuoso's WebDAV storage realm called: “The OpenLink Briefcase” (nee. oDrive). This application allows you to view items uploaded by content type and/or kind (People, Business Cards, Calendars, Business Reports, Office Documents, Photos, Blog Posts, Feed Channels/Subscriptions, Bookmarks etc..). it also includes automatic metadata extraction (where feasible) and indexing. Naturally, as an integral part of our “OpenLink Data Spaces” (ODS) product offering, it supports GData, URIQA, SPARQL (note: WebDAV metadata is sync'ed with Virtuoso's RDF Triplestore), SQL, and WebDAV itself.
You can explore the power of this product via the following routes:
- Download the Virtuoso Open Source Edition and the ODS add-ons or
- Visit our live demo server (note: this is strictly a demo server with full functionality available) and simply register and then create a “Briefcase” application instance
- Digest this Briefcase Home Page Screenshot
06/26/2006 21:41 GMT-0500
|
Modified:
06/26/2006 21:28 GMT-0500
|
Structured Data vs. Unstructured Data
There is an interesting article at regdeveloper.com titled: Structured data is boring and useless.. This article provides insight into a serious point of confusion about what exactly is structured vs. unstructured data. Here is a key excerpt: "We all know that structured data is boring and useless; while unstructured data is sexy and chock full of value. Well, only up to a point, Lord Copper. Genuinely unstructured data can be a real nuisance - imagine extracting the return address from an unstructured letter, without letterhead and any of the formatting usually applied to letters. A letter may be thought of as unstructured data, but most business letters are, in fact, highly-structured." .... Duncan Pauly, founder and chief technology officer of Coppereye add's eloquent insight to the conversation: "The labels "structured data" and "unstructured data" are often used ambiguously by different interest groups; and often used lazily to cover multiple distinct aspects of the issue. In reality, there are at least three orthogonal aspects to structure:
* The structure of the data itself. * The structure of the container that hosts the data. * The structure of the access method used to access the data. These three dimensions are largely independent and one does not need to imply another. For example, it is absolutely feasible and reasonable to store unstructured data in a structured database container and access it by unstructured search mechanisms." Data understanding and appreciation is dwindling at a time when the reverse should be happening. We are supposed to be in the throws of the "Information Age", but for some reason this appears to have no correlation with data and "data access" in the minds of many -- as reflected in the broad contradictory positions taken re. unstructured data vs structured data, structured is boring and useless while unstructured is useful and sexy.... The difference between "Structured Containers" and "Structured Data" are clearly misunderstood by most (an unfortunate fact). For instance all DBMS products are "Structured Containers" aligned to one or more data models (typically one). These products have been limited by proprietary data access APIs and underlying data model specificity when used in the "Open-world" model that is at the core of the World Wide Web. This confusion also carries over to the misconception that Web 2.0 and the Semantic/Data Web are mutually exclusive. But things are changing fast, and the concept of multi-model DBMS products is beginning to crystalize. On our part, we have finally released the long promised "OpenLink Data Spaces" application layer that has been developed using our Virtuoso Universal Server. We have structured unified storage containment exposed to the data web cloud via endpoints for querying or accessing data using a variety of mechanisms that include; GData, OpenSearch, SPARQL, XQuery/XPath, SQL etc.. To be continued....
06/23/2006 18:35 GMT-0500
|
Modified:
06/27/2006 01:39 GMT-0500
|
|
|