As I cannot post directly to Glenn's blog
titled: This is Not the Near Future (Either), I
have to basically respond to him here, in blog post form :-(
What is our "Search" and "Find" demonstration about?
It is about how you use the "Description" of "Things" to
unambiguously locate things in a database at Web Scale.
To our perpetual chagrin, we are trying to demonstrate an engine
-- not UI prowess -- but the immediate response is to jump to the
UI aesthetics.
Google, Yahoo etc.. offer a simple input form for full text search patterns, they have a
processing window for completing full text searches across Web
Content indexed on their servers. Once the search patterns are
processed, you get a page ranked result set (collection of Web
pages basically that claim/state: we found N pages out of a
document corpus of about M indexed pages).
Note: the estimate aspect of traditional search
results in like "advertising small print" the user lives with the
illusion that all possible documents on the Web (or even Internet) have been searched whereas in
reality: 25% of the possible total is a major stretch; since the
Web and Internet are fractal networks and scale-free, inherently growing at exponential
rates "ad infinitum" across boundless dimensions of human
comprehension.
The power of Linked Data ultimately comes down to the
fact that the user constructs the path to what they seek via the
properties of the "Things" in question. The routes are not
hardwired since URI de-referencing (follow your nose pattern)
is available to Linked Data aware query engines and crawlers.
We are simply trying to demonstrate how you can combine the best
of full text search with the best of structured querying while
reusing familiar interaction patterns from Google/Yahoo. Thus, you
start with full text search, find get all the entities associated
with the pattern, then use the entity types or entity properties to find
what you seek.
You state in your post:
"To state the obvious caveat, the claim OpenLink
is making about this demo is not that it delivers better
search-term relevance, therefore the ranking of searching results
is not the main criteria on which it is intended to be
assessed."
Correct.
"On the other hand, one of the things they are
bragging about is that their server will automatically cut off
long-running queries. So how do you like your first page of
results?".
Not exactly correct. We are performing aggregates using a
configurable interactive time factor. Example: tell me how many entities of type: Person, with
interest: Semantic Web, exist in this database within 2
seconds. Also understand that you could retry the same query and
get different numbers within the same interactive time factor. It
isn't your basic "query cut-off".
"And on the other other hand, the big claim
OpenLink is making about this demo is that the aggregate experience
of using it is better than the aggregate experience of using
"traditional" search. So go ahead, use it. If you
can."
Yes, "Microsoft" was a poor example for sure, the example could
have been pattern: "glenn mcdonald", which should demonstrate the
fundamental utility of what we are trying to demonstrate i.e.,
entity disambiguation courtesy of entity properties and/or entity
type filtering.
Compare Googles results for: Glenn McDonald
with those from our demo (which dissambiguate "Glenn McDonald" via
associated properties and/or types), assuming we both agree that
your Web Site or Blog Home isn't the center of your entity graph or
personal data space (i.e., data about you); so getting
your home page at the top of the Google page rank offers limited
value, in reality.
What are we bragging about? A little more than what you attempt
to explain. Yes, we are showing that we can find stuff within a
processing window, but understand the following:
- Processing Time Window (or interactive time) is
configurable
- Data Corpus is a Billion+ Triples (from Billion
Triples Challenge Data Set)
-
SPARQL doesn't have Aggregation capabilities
by default (we have implemented SPARQL-BI to deliver aggregates for
analytics against large data sets, we even handle the TPC-H
industry standard benchmark with SPARQL-BI)
- Paging isn't possible without aggregates, and doing aggregates
on a Billion+ triples as part of a query processing cycle isn't
trivial stuff (otherwise it would be everywhere due to inherent and
obvious necessity).
I hope I've clarified what's going on with our demo? If not,
pose your challenge via examples and I will respond with solutions
or simply cry out loud: "no mas!".
As for your "Mac OX X Leopard" comments, I can only say this: I
emphasized that this is a demo, the data is pretty old, and the
input data has issues (i.e. some of the input data is bad as your
example shows). The purpose of this demo is not about the text per
se., it's about the size of the data corpus and faceted querying.
We are going to have the entire LOD Cloud loaded into the real thing, and
in addition to that our Sponger Middleware will be enabled, and
then you can take issue with data quality as per your reference to
"Cyndi Lauper" (btw - it takes one property filter to find information about her quickly using
"dbpprop:name" after filtering for properties with
text values).
Of all things, this demo had nothing to do with UI and Information presentation aesthetics. It was
all about combining full text search and structured queries (sparql
behind the scenes) against a huge data corpus en route to solving
challenges associated with faceted browsing over large data sets.
We have built a service that resides inside Virtuoso.
The Service is naturally of the "Web Service" variety and can be
used from any consumer / client environment that speaks HTTP
(directly or indirectly).
To be continued ...