The other day I ran into Jitsu, a new toolkit for creating Ajax-y applications.
Jitsu takes an interesting approach in the Ajaxy space."
(Via Miguel de Icaza.)
]]>Here is a dump of the post titled: Intermediate RDF Loading Results:
]]>Following from the post about a new Multithreaded RDF Loader, here are some intermediate results and action plans based on my findings.
The experiments were made on a dual 1.6GHz Sun SPARC with 4G RAM and 2 SCSI disks. The data sets were the 48M triple Wikipedia data set and the 1.9M triple Wordnet data set. 100% CPU means one CPU constantly active. 100% disk means one thread blocked on the read system call at all times.
Starting with an empty database, loading the Wikipedia set took 315 minutes, amounting to about 2500 triples per second. After this, loading the Wordnet data set with cold cache and 48M triples already in the table took 4 minutes 12 seconds, amounting to 6838 triples per second. Loading the Wikipedia data had CPU usage up to 180% but over the whole run CPU usage was around 50% with disk I/O around 170%. Loading the larger data set was significantly I/O bound while loading the smaller set was more CPU bound, yet was not at full 200% CPU.
The RDF quad table was indexed on GSPO and PGOS. As one would expect, the bulk of I/O was on the PGOS index. We note that the pages of this index were on the average only 60% full. Thus the most relevant optimization seems to be to fill the pages closer to 90%. This will directly cut about a third of all I/O plus will have an additional windfall benefit in the form of better disk cache hit rates resulting from a smaller database.
The most practical way of having full index pages in the case of unpredictable random insert order will be to take sets of adjacent index leaf pages and compact the rows so that the last page of the set goes empty. Since this is basically an I/O optimization, this should be done when preparing to write the pages to disk, hence concerning mostly old dirty pages. Insert and update times will not be affected since these operations will not concern themselves with compaction. Thus the CPU cost of background compaction will be negligible in comparison with writing the pages to disk. Naturally this will benefit any relational application as well as free text indexing. RDF and free text will be the largest beneficiaries due to the large numbers of short rows inserted in random order.
Looking at the CPU usage of the tests, locating the place in the index where to insert, which by rights should be the bulk of the time cost, was not very significant, only about 15%. Thus there are many unused possibilities for optimization,for example writing some parts of the loader current done as stored procedures in C. Also the thread usage of the loader, with one thread parsing and mapping IRI strings to IRI IDs and 6 threads sharing the inserting could be refined for better balance, as we have noted that the parser thread sometimes forms a bottleneck. Doing the updating of the IRI name to IRI id mapping on the insert thread pool would produce some benefit.
Anyway, since the most important test was I/O bound, we will first implement some background index compaction and then revisit the experiment. We expect to be able to double the throughput of the Wikipedia data set loading.
I shopped for everything except food on eBay. When working with foreign-language documents, I used translations from Babel Fish. (This worked only so well. After a Babel Fish round-trip through Italian, the preceding sentence reads, 'That one has only worked therefore well.') Why use up space storing files on my own hard drive when, thanks to certain free utilities, I can store them on Gmail's servers? I saved, sorted, and browsed photos I uploaded to Flickr. I used Skype for my phone calls, decided on books using Amazon's recommendations rather than 'expert' reviews, killed time with videos at YouTube, and listened to music through customizable sites like Pandora and Musicmatch. I kept my schedule on Google Calendar, my to-do list on Voo2do, and my outlines on iOutliner. I voyeured my neighborhood's home values via Zillow. I even used an online service for each stage of the production of this article, culminating in my typing right now in Writely rather than Word. (Being only so confident that Writely wouldn't somehow lose my work -- or as Babel Fish might put it, 'only confident therefore' -- I backed it up into Gmail files.Interesting article, Tim O'Reilly's response is here"
(Via Valentin Zacharias (Student).)
Tim O'Reilly's response provides the following hierarchy for Web 2.0 based on The what he calls: "Web 2.0-ness":
level 3: The application could ONLY exist on the net, and draws its essential power from the network and the connections it makes possible between people or applications. These are applications that harness network effects to get better the more people use them. EBay, craigslist, Wikipedia, del.icio.us, Skype, (and yes, Dodgeball) meet this test. They are fundamentally driven by shared online activity. The web itself has this character, which Google and other search engines have then leveraged. (You can search on the desktop, but without link activity, many of the techniques that make web search work so well are not available to you.) Web crawling is one of the fundamental Web 2.0 activities, and search applications like Adsense for Content also clearly have Web 2.0 at their heart. I had a conversation with Eric Schmidt, the CEO of Google, the other day, and he summed up his philosophy and strategy as "Don't fight the internet." In the hierarchy of web 2.0 applications, the highest level is to embrace the network, to understand what creates network effects, and then to harness them in everything you do.
Level 2: The application could exist offline, but it is uniquely advantaged by being online. Flickr is a great example. You can have a local photo management application (like iPhoto) but the application gains remarkable power by leveraging an online community. In fact, the shared photo database, the online community, and the artifacts it creates (like the tag database) is central to what distinguishes Flickr from its offline counterparts. And its fuller embrace of the internet (for example, that the default state of uploaded photos is "public") is what distinguishes it from its online predecessors.
Level 1: The application can and does exist successfully offline, but it gains additional features by being online. Writely is a great example. If you want to do collaborative editing, its online component is terrific, but if you want to write alone, as Fallows did, it gives you little benefit (other than availability from computers other than your own.)
Level 0: The application has primarily taken hold online, but it would work just as well offline if you had all the data in a local cache. MapQuest, Yahoo! Local, and Google Maps are all in this category (but mashups like housingmaps.com are at Level 3.) To the extent that online mapping applications harness user contributions, they jump to Level 2.
So, in a sense we have near conclusive confirmation that Web 2.0 is simply about APIs (typically service specific Data Silos or Walled-gardens) with little concern, understanding, or interest in truly open data access across the burgeoning "Web of Databases". Or the Web of "Databases and Programs" that I prefer to describe as "Data Spaces"
Thus, we can truly begin to conclude that Web 3.0 (Data Web) is the addition of Flexible and Open Data Access to Web 2.0; where the Open Data Access is achieved by leveraging Semantic Web deliverables such as the RDF Data Model and the SPARQL Query Language :-)
]]>GeoRSS & Geonames for Philanthropy: "
I heard about Kiva.ORG in a BusinessWeek podcast. After visiting its website, I think there are few places where GeoRSS (in the RDF/A syntax) and Geonames can be used to enhance the siteâs functionality.
Itâs a microfinance website for people in the developing countries. Its business model is in the intersection between peer-to-peer financing and philanthropy. The goal is to help developing country businesses to borrow small loans from a large group of Web users, so that they can avoid paying high interests to the banks.
For example, a person in Uganda can request a $500 loan and use it for buying and selling more poultry. One or more lenders (anyone on the Web) may decide to grant loans to that person in increments as tiny as $25. After few years, that person will pay back the loans to the lenders.
I went to the website and discovered the site has a relative weak search and browsing interface. In particular, there is no way to group loan requests based on geographical locations (e.g., countries, cities and regions).
Took a look at individual loan pages. Each page actually has standard ways to describe location information â e.g., Location: Mbale, Uganda.
It should be relative easy to add GeoRSS points (in the RDF/A syntax) to describe these location information (an alternative maybe using Microformat Geo or W3C Geo). Once the location information is annotated, one can imagine building a map mashup to display loan requests in a geospatial perspective. One can also build search engines to support spatial queries such as âfind me all loans with from Mbaleâ.
Since Kiva.ORG webmasters may not be GIS experts, it will be nice if we can find ways to automatically geocode location information and describe that using GeoRSS. This automatic geocoding procedure can be developed using Geonamesâs webservices. Take a string âMbaleâ or âUgandaâ, and send to Geonamesâs search service. The procedure will get back JSON or XML description of the location, which include latitude and longitude. This will then be used to annotate the location information in a Kiva loan page.
Can you think of other ways to help Kiva.ORG to become more âgeospatially intelligentâ?
You can learn more about Kiva.ORG at its website and listen to this podcast.
In an initial response to these developmentsOrri Erling, Virtuoso's Program Manager, shares valuable insights from past re. Object-Relational technology developments and deliverables challenges. As Orri notes, the Virtuoso team suspended ORM and ORDBMS work at the onset of the Kubl-Virtuoso transition due to the lack of standardized client-side functionality exposure points.
My hope is that Microsoft's efforts trigger community wide activity that result in a collection of interfaces that make scenarios such as generating .NET based Semantic Web Objects (where the S in an S-P->O RDF-Triple becomes a bona fide .NET class instance generated from OWL).
To be continued since the interface specifics re. ADO.NET 3.0 remain in flux...
]]>Note to Tim:
Is the RDF.net domain deal still on? I know it's past 1st Jan 2006, but do bear in mind that the critical issue of a broadly supported RDF Query Language only took significant shape approximately 13 months ago (in the form of SPARQL), and this is all so critical to the challenge you posed in 2003.
RDF.net could become a point of semantic-web-presence through which the benefits of SPARQL compliant Triple|Quad Stores, Shared Ontologies, and SPARQL Protocol are unveiled in their well intended glory :-).
]]>