Details

Orri Erling

Subscribe

Post Categories

Recent Articles

Display Settings

articles per page.
order.
Showing posts in all categories RefreshRefresh
European Commission and the Data Overflow

The European Commission recently circulated a questionnaire to selected experts on what could be done for the future of big data.

Since the questionnaire is public, I am publishing my answers below.

  1. Data and data types

    1. What volumes of data are we dealing with today? What is the growth rate? Where can we expect to be in 2015?

      Private data warehouses of corporations have more than doubled yearly for the past years; hundreds of TB is not exceptional. This will continue. The real shift is in structured data being published in increasing quantities with a minimum level of integrate-ability through use of RDF and linked data principles. There are rewards for use of standard vocabularies and identifiers through search engines recognizing such data. There is convergence around DBpedia identifiers for real-world entities, e.g., most things that would be in the news.

      This also means that internal data processes and silos may be enriched with this content. There is consequent pressure for accommodating more diversity of data, with more flexible schema.

      Ultimately, all content presently stored in RDBs and presented in public accessible dynamic web pages will end up on the web of linked data. Examples are product catalogs, price lists, event schedules and the like.

      The volume of the well known linked data sets is around 10 billion statements. With the above mentioned trends, growth by two or three orders of magnitude by 2015 seems reasonable, This is so especially if explicit semantics are extracted from the document web and if there is some further progress in the precision/recall of such extraction.

      Relevant sections of this mass of data are a potential addition to any present or future analytics application.

      Since arbitrary analytics over the database which is the web cannot be economically provided by a centralized search engine, a cloud model may be used for on-demand selection of relevant data and mixing it with private data. This will drive database innovation for the next years even more than the continued classical warehouse growth.

      Science data is another driver of the data overflow. For example, faster gene sequencing, more accurate measurements in high energy physics, better imaging, and remote sensing will produce large volumes of data. This data has highly regular structure but labeling this data with source and lineage calls for a flexible, schema-last, self-describing model, such as RDF and linked data. Data and metadata should travel together but may have different data models.

      By and large, the metadata of science data will be another stream to the web of linked data, at least to the degree it is publicly accessible. Restricted circles can and likely will implement similar ideas.

    2. What types of data can we deal with intelligently due to their inherent structure (geospatial, temporal, social or knowledge graphs, 3D, sensor streams...)?

      All the above types should be supported inside one DBMS so as to allow efficient querying combining conditions on all these types of data, e.g., photos of sunsets taken last summer in Ibiza, with over 20 megapixels, by people I know.

      Note that the test for being a sunset is an operation on the image blob that should be taken to the data; the images cannot be economically transferred.

      Interleaving of all database functions and types becomes increasingly important.

  2. Industries, communities

    1. Who is producing these data and why? Could they do it better? How?

      Right now, projects such as Bio2RDF, Neurocommons, and DBPedia produce this data. The processes are in place and are reasonable. Incremental improvement is to be expected. These processes, along with the linked data meme generally taking off, drive demand for better NLP (Natural Language Processing), e.g., entity and relationship extraction, especially extraction that can produce instance data in given ontologies (e.g., events) using common identifiers (e.g., DBPedia URIs).

      Mapping of RDBs to RDF is possible, and a W3C working group is developing standards for this. The required baseline level has been reached; the rest is a matter of automating deployment. Within the enterprise, there are advantages to be gained for information integration; e.g., all entities in the CRM space can be integrated with all email and support tickets through giving everything a URI. Some of this information may even be published on an extranet for self-service and web-service interfaces. This has been done at small scales and the rest is a matter of spreading adoption and lowering the entry barrier. Incremental progress will take place, eventually resulting in qualitatively better integration along the value chain when adoption is sufficiently widespread.

    2. Who is consuming these data and why? Could they do it better? How?

      Consumers are various. The greatest need is for tools that summarize complex data and allow getting a bird's eye view of what data is in the first instance available. Consuming the data is hindered by the user not even necessarily knowing what data there is. This is somewhat new, as traditionally the business analyst did know the schema of the warehouse and was proficient with SQL report generators and statistics packages.

      Where Web 2.0 made the citizen journalist, the web of linked data will make the citizen analyst. For this to happen, with benefits for individuals, enterprises, and governments alike, more work in user interfaces, knowledge discovery, and query composition will be useful. We may envision a "meshup economy" where data is plentiful, but the unit of value and exchange is the smart report that crystallizes actionable value from this ocean.

    3. What industrial sectors in Europe could become more competitive if they became much better at managing data?

      Any sector could benefit. Early adopters are seen in the biomedical field and to an extent in media.

    4. Is the regulation landscape imposing constraints (privacy, compliance ...) that don't have today good tool support?

      The regulation landscape drives database demand through data retention requirements and the like.

      With data integration, especially with privacy-sensitive data (as in medicine), there are issues of whether one dares put otherwise-shareable information online. Regulation is needed to protect individuals, but integration should still be possible for science.

      For this, we see a need for progress in applying policy-based approaches (e.g., row level security) to relatively schema-last data such as RDF. This is possible but needs some more work. Also, creating on-the-fly-anonymizing views on data might help.

      More research is needed for reconciling the need for security with the advantages of broad-based ad hoc integration. Ideally, data should be intelligent, aware of its origins and classification and cautious of whom it interacts with, all of this supported under the covers so that the user could ask anything but the data might refuse to answer or might restrict answers according to the user's profile. This is a tall order and implementing something of the sort is an open question.

    5. What are the main practical problem identified for individuals and organizations? Please give examples and tell us about the main obstacles and barriers.

      We have come across the following:

      • Knowing that the data exists in the first place.
      • If the data is found, figuring out the provenance, units and precision of measurement, identifiers, and the like.
      • Compatible subject matter but incompatible representation: For example, one has numbers on a map with different maps for different points in time; another has time series of instrument data with geo-location for the instrument. It is only to be expected that the time interval between measurements is not the same. So there is need for a lot of one-off programming to align data.

      Other problems have to do with sheer volume, i.e., transfer of data even in a local area network is too slow, let alone over a wide area network. Computation needs to go to the data, and databases need to support this.

  3. Services, software stacks, protocols, standards, benchmarks

    1. What combinations of components are needed to deal with these problems?

      Recent times have seen a proliferation of special purpose databases. Since the data needs of the future are about combining data with maximum agility and minimum performance hit, there is need to gather the currently-separate functionality into an integrated system with sufficient flexibility. We see some of this in integration of map-reduce and scale-out databases. The former antagonists have become partners. Vertica, Greenplum, and OpenLink Virtuoso are example of DBMS featuring work in this direction.

      Interoperability and at least de facto standards in ways of doing this will emerge.

    2. What data exchange and processing mechanisms will be needed to work across platforms and programming languages?

      HTTP, XML, and RDF are in fact very verbose, yet these are the formats and models that have uptake. Thus, these will continue to be used even though one might think binary formats to be more efficient.

      There are of course science data set standards that are more compressed and these will continue, hopefully adding a practice of rich metadata in RDF.

      For internals of systems, MPI and TCP/IP with proprietary optimized wire formats will continue. Inter-system communication will likely continue to be HTTP, XML, and RDF as appropriate.

    3. What data environments are today so wastefully messy that they would benefit from the development of standards?

      RDF and OWL are not messy but they could use some more performance; we are working on this. SPARQL is finally acquiring the capabilities of a serious query language, so things are slowly coming together.

      Community process for developing application domain specific vocabularies works quite well, even though one could argue it is ad hoc and not up to what a modeling purist might wish.

      Top-down imposition of standards has a mixed history, with long and expensive development and sometimes no or little uptake, consider some WS* standards for example.

    4. What kind of performance is expected or required of these systems? Who will measure it reliably? How?

      Relational databases have a history of substantial investment in optimization and some of them are very good for what they do, e.g., the newer generation of analytics databases.

      The very large schema-last, no-SQL, sometimes eventually consistent key-value stores have a somewhat shorter history but do fill a real need.

      These trends will merge: Extreme scale, schema-last, complex queries, even more complex inference, custom code for in-database machine learning and other bulk processing.

      We find RDF augmented with some binary types at this crossroads. This point of the design space will have to provide performance roughly on the level of today's best relational solution for workloads that fit the relational model. The added cost of schema-last and inference must come down. We are working on this. Research work such as carried out with MonetDB gives clues as to how these aims can be reached.

      The separation of query language and inference is artificial. After the concepts are mature, these functions will merge and execute close to the data; there are clear evolutionary pressures in this direction.

      Benchmarks are key. Some gain can be had even from repurposing standard relational benchmarks like TPC-H. But the TPC-H rules do not allow official reporting of such.

      Development of benchmarks for RDF, complex queries, and inference is needed. A bold challenge to the community, it should be rooted in real-life integration needs and involve high heterogeneity. A key-value store benchmark might also be conceived. A transaction benchmark like TPC-C might be the basis, maybe augmented with massive user-generated content like reviews and blogs.

      If benchmarks exist and are not too easy nor inaccessibly difficult nor too expensive to run — think of the high end TPC-C results — then TPC-style rules and processes would be quite adequate. The threshold to publish should be lowered: Everybody runs the TPC workloads internally but few publish.

      Some EC initiative for benchmarking could make sense, similar to the TREC initiative of the US government. Industry should be consulted for the specific content; possibly the answers to the present questionnaire can provide an approximate direction.

      Benchmarks should be run by software vendors on their own systems, tuned by themselves. But there should be a process of disclosure and auditing; the TPC rules give an example. Compliance should not be too expensive or time consuming. Some community development for automating these things would be a worthwhile target for EC funding.

  4. Usability and training

    1. How difficult will it be for a developer of average competence to deploy components whose core is based on rather deep computer science? Do we all need to understand Monads and Continuations? What can be done to make it ever easier?

      In the database world, huge advances in technology have taken place behind a relatively simple and stable interface: SQL. For the linked data web, the same will take place behind SPARQL.

      Beyond these, for example, programming with MPI with good utilization of a cluster platform for an arbitrary algorithm, is quite difficult. The casual amateur is hereby warned.

      There is no single solution. For automatic parallelization, since explicit, programmatic parallelization of things with MPI for example is very unscalable in terms of required skill, we should favor declarative and/or functional approaches.

      Developing a debugger and explanation engine for rule-based and description-logics-based inference would be an idea.

      For procedural workloads, things like Erlang may be good in cases and are not overly difficult in principle, especially if there are good debugging facilities.

      For shipping functions in a cluster or cloud, the BOOM (Berkeley Orders Of Magnitude) approach or logic programming with explicit specification of compute location seem promising, surely more flexible than map-reduce. The question is whether a PHP developer can be made to do logic programming.

      This bridge will be crossed only with actual need and even then reluctantly. We may look at the Web 2.0 practice of sharding MySQL, inconvenient as this may be, for an example. There is inertia and thus re-architecting is a constant process that is generally in reaction to facts, post hoc, often a point solution. One could argue that planning ahead would be smarter but by and large the world does not work so.

      One part of the answer is an infinitely-scalable SQL database that expands and shrinks in the clouds, with the usual semantics, maybe optional eventual consistency and built-in map reduce. If such a thing is inexpensive enough and syntax-level-compatible with present installed base, many developers do not have to learn very much more.

      This is maybe good for the bread-and-butter IT, but European competitiveness should not rest on this. Therefore we wish to go for bold new application types for which the client-server database application is not the model. Data-centric languages like BOOM, if they can be made very efficient and have good debugging support, are attractive there. These do require more intellectual investment but that is not a problem since the less-inquisitive part of the developer community is served by the first part of the answer.

    2. How is a developer of average skills going to learn about these new advanced tools? How can we plan for excellent documentation and training, community mentoring, exchange of good practices, etc... across all EU countries?

      For the most part, developers do not learn things for the sake of learning. When they have learned something and it is adequate, they stay with it for the most part and are even reluctant to engage in cross-camps interaction. The research world is often similarly insular. A new inflection in the application landscape is needed to drive learning. This inflection is provided by the ubiquity of mobile devices, sensor data, explicit semantics, NLP concept extraction, web of linked data, and such factors.

      RDFa is a good example of a new technique piggybacking on something everybody uses, namely HTML. These new things should, within possibility, be deployed in the usual technology stack, LAMP or Java. Of course these do not have to be LAMP or Java or HTML or HTTP themselves but they must manifest through these.

      A lot of the semantic web potential can be realized within the client-server database application model, thus no fundamental re-architecting, just some new data types and queries.

      For data- or processing-intensive tasks, an on-demand hookup to cloud-based servers with Erlang and/or BOOM for programming model would be easy enough to learn and utilize.

      The question is one of providing challenges. Addressing actual challenges with these techniques will lead to maturity, documentation, examples, and training. With virtual, Europe-wide distributed teams a reality in many places, Europe-wide dissemination is no longer insurmountable.

      As the data overflow proceeds, its victims will multiply and create demand for solutions. The EC could here encourage research project use cases gaining an extended life past the end of research projects, possibly being maintained and multiplied and spun off.

      If such things could be mutated into self-sustaining service businesses with pay-per-use revenue, say through a cloud SaaS business model, still primarily leveraging an open source technology stack, we could have self-propagating and self-supporting models for exploiting advanced IT. This would create interest, and interest would drive training and dissemination.

      The problem is creating the pull.

  5. Challenges

    1. What should be, in this domain, the equivalent of the Netflix challenge, Ansari X Prize, Google Lunar X Prize, etc. ... ?

      The EC itself no doubt suffers from data overflow in one function or another. Unless security/secrecy prohibits, simply publishing a large data set and a description of what operations should be done on it would be a start. The more real the data, the better — reality is consistently more complex and surprising than imagination. Since many interesting problems touch on fraud detection and law enforcement, there may be some security obstacles for using these application domains as subject matters of open challenges.

      Once there is a good benchmark, as discussed above, there can be some prize money allocated for the winners, specially if the race is tight.

      The Semantic Web Challenge and the Billion Triples Challenge exist and are useful as such, but do not seem to have any huge impact.

      The incentives should be sufficient and part of the expenses arising from running for such challenges could be funded. Otherwise investing in existing business development will be more interesting to industry. Some industry participation seems necessary; we would wish academia and industry to work closer. Also, having industry supply the baseline guarantees that academia actually does further the state of the art. This is not always certain.

      If challenges are based on actual problems, whether of the EC, its member governments, or private entities, and winning the challenge may lead to a contract for supplying an actual solution, these will naturally become more interesting for consortia involving integrators, specialist software vendors, and academia. Such a model would build actual capacity to deploy leading edge technologies in production, which is sorely needed.

    2. What should one do to set up such a challenge, administer, and monitor it?

      The EC should probably circulate a call for actual problem scenarios involving big data. If the matter of the overflow is as dire as represented, cases should be easy to find. A few should be selected and then anonymized if needed.

      The party with the use case would benefit by having hopefully the best work on it. The contestants would benefit from having real world needs guide R&D. The EC would not have to do very much, except possibly use some money for funding the best proposals. The winner would possibly get a large account and related sales and service income. The contestants would have to be teams possibly involving many organizations; for example, development and first-line services and support could come from different companies along a systems integrator model such as is widely used in the US.

      There may be a good benchmark at the time, possibly resulting from FP7 itself. In such a case, the EC could offer a prize for winners. Details would have to be worked out case by case. Such a challenge could be repeated a few times, as benchmark-driven progress in databases or TREC for example have taken some years to reach a point of slowdown in progress.

      Administrating such an activity should not be prohibitive, as most of the expertise can be found with the stakeholders.

# PermaLink Comments [0]
10/27/2009 13:29 GMT Modified: 10/27/2009 14:57 GMT
VLDB 2009 TPC Workshop (3 of 5)

Michael Stonebraker gave the keynote at the TPC workshop. His message was that the TPC, at the venerable age of 21, was already a decade late in reinventing itself. From the height of relevance at the time of the debit/credit benchmark twenty years back, it was slipping into the sunset of irrelevance unless it paid attention.

Now we are great fans of the TPC and while we have not published results by the TPC book, we have extensively used TPC material for guiding optimization, as has pretty much everybody else.

It is true that the rules encourage unrealistic configurations. The emphasis on random access from disk that is built into the rules leads to disk configurations that are very improbable in practice, such as 1PB of disks for 3TB of data, just so there are enough disk arms in parallel. Stonebraker also pointed out that replication and failover were ubiquitous in real life and that roll forward from logs was unrealistic as a recovery model since it took so long. Benchmarks should therefore include replication.

Further, Stonebraker challenged the TPC to go for the new frontier, which he described as the huge data sets in science and on big web sites. Scientists, the ones who would save our planet from the diverse ills confronting it, do not like relational databases. They avoid them when can. They want arrays for physics, and graphs for biology and chemistry. MapReduce is eating database's lunch; what will you do about this?

I later suggested incorporating an RDF metadata benchmark into the TPC suite. We'll see about this; we'll first have to come up with a suitable one. There is a great deal of pressure for making good RDF benchmarks but this is not yet in the center of the mainstream that TPC tends to cover.

TPC's own talk was about the life cycle of benchmarks. A benchmark begins a bit ahead of the mainstream, with a problem that is difficult but not so difficult as to be uncommon. When the solution to this problem becomes commonplace, the benchmark's relevance gradually drops.

There was a talk on robustness of query plans which was well to the point. Indeed, there are performance cliffs at certain points; for example, when passing from memory-only to disk-pageable data structures, or when switching from indexed access to table scans, or from loop to hash joins. Quite so. The analysis I really would have liked to see would have been one of what happens when passing from single server to a cluster, and from local joins to cross-partition ones. Also contrasting of cache fusion and partitioning. We have our own data and experience but we find we don't have time to measure all the other systems.

Anyway it is good to raise the question of smooth and predictable performance.

# PermaLink Comments [0]
09/01/2009 11:51 GMT Modified: 09/01/2009 17:32 GMT
Updated hardware improves LUBM 8000 load rate in Virtuoso 6

We repeated the earlier LUBM 8000 experiment on a newer machine, with 2 x Xeon 5520 and 72G 1333MHz memory, and once again with the 2 machines as a networked cluster. Otherwise the settings were the same.

The load rate is now 160,739 triples-per-second.

   Virtuoso 6
(previous run)
   Virtuoso 6
(new run)
   Virtuoso 6
(newest run)
blades    1    1    2
processors    2 x Xeon 5410    2 x Xeon 5520    2 x Xeon 5520
+
2 x Xeon 5410
with 1x1GigE
interconnect
memory    16G 667 MHz    72G 1333 MHz    72G 1333 MHz
+
16G 667 MHz
respectively
reported load rate
triples-per-second
   110,532    160,739    214,188

Again, if others talk about loading LUBM, so must we. Otherwise, this metric is rather uninteresting.

# PermaLink Comments [0]
08/14/2009 19:01 GMT Modified: 08/15/2009 15:27 GMT
Single Virtuoso host loads 110,500 triples-per-second on LUBM 8000

LUBM load speed still seems to be a metric that is quoted in comparisons of RDF stores. Consequently, we too measured the load time of LUBM 8000, 1,068-million triples, on the newest Virtuoso.

The real time for the load was 161m 3s. The rate was 110,532 triples-per-second. The hardware was one machine with 2 x Xeon 5410 (quad core, 2.33 GHz) and 16G 667 MHz RAM. The software was Virtuoso 6 Cluster, configured into 8 partitions (processes) — one partition per CPU core. Each partition had its database striped over 6 disks total; the 6 disks on the system were shared between the 8 database processes.

The load was done on 8 streams, one per server process. At the beginning of the load, the CPU usage was 740% with no disk; at the end, it was around 700% with 25% disk wait. 100% counts here for one CPU core or one disk being constantly busy.

The RDF store was configured with the default two indices over quads, these being GSPO and OGPS. Text indexing of literals was not enabled. No materialization of entailed triples was made.

We think that LUBM loading is not a realistic benchmark for the world but since other people publish such numbers, so do we.

# PermaLink Comments [0]
06/29/2009 12:12 GMT Modified: 08/15/2009 16:06 GMT
Virtuoso RDF: A Getting Started Guide for the Developer

It is a long standing promise of mine to dispel the false impression that using Virtuoso to work with RDF is complicated.

The purpose of this presentation is to show a programmer how to put RDF into Virtuoso and how to query it. This is done programmatically, with no confusing user interfaces.

You should have a Virtuoso Open Source tree built and installed. We will look at the LUBM benchmark demo that comes with the package. All you need is a Unix shell. Running the shell under emacs (m-x shell) is the best. But the open source isql utility should have command line editing also. The emacs shell is however convenient for cutting and pasting things between shell and files.

To get started, cd into binsrc/tests/lubm.

To verify that this works, you can do

./test_server.sh virtuoso-t

This will test the server with the LUBM queries. This should report 45 tests passed. After this we will do the tests step-by-step.

Loading the Data

The file lubm-load.sql contains the commands for loading the LUBM single university qualification database.

The data files themselves are in lubm_8000, 15 files in RDFXML.

There is also a little ontology called inf.nt. This declares the subclass and subproperty relations used in the benchmark.

So now let's go through this procedure.

Start the server:

$ virtuoso-t -f &

This starts the server in foreground mode, and puts it in the background of the shell.

Now we connect to it with the isql utility.

$ isql 1111 dba dba 

This gives a SQL> prompt. The default username and password are both dba.

When a command is SQL, it is entered directly. If it is SPARQL, it is prefixed with the keyword sparql. This is how all the SQL clients work. Any SQL client, such as any ODBC or JDBC application, can use SPARQL if the SQL string starts with this keyword.

The lubm-load.sql file is quite self-explanatory. It begins with defining an SQL procedure that calls the RDF/XML load function, DB..RDF_LOAD_RDFXML, for each file in a directory.

Next it calls this function for the lubm_8000 directory under the server's working directory.

sparql 
   CLEAR GRAPH <lubm>;

sparql 
   CLEAR GRAPH <inf>;

load_lubm ( server_root() || '/lubm_8000/' );

Then it verifies that the right number of triples is found in the <lubm> graph.

sparql 
   SELECT COUNT(*) 
     FROM <lubm> 
    WHERE { ?x ?y ?z } ;

The echo commands below this are interpreted by the isql utility, and produce output to show whether the test was passed. They can be ignored for now.

Then it adds some implied subOrganizationOf triples. This is part of setting up the LUBM test database.

sparql 
   PREFIX  ub:  <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#>
   INSERT 
      INTO GRAPH <lubm> 
      { ?x  ub:subOrganizationOf  ?z } 
   FROM <lubm> 
   WHERE { ?x  ub:subOrganizationOf  ?y  . 
           ?y  ub:subOrganizationOf  ?z  . 
         };

Then it loads the ontology file, inf.nt, using the Turtle load function, DB.DBA.TTLP. The arguments of the function are the text to load, the default namespace prefix, and the URI of the target graph.

DB.DBA.TTLP ( file_to_string ( 'inf.nt' ), 
              'http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl', 
              'inf' 
            ) ;
sparql 
   SELECT COUNT(*) 
     FROM <inf> 
    WHERE { ?x ?y ?z } ;

Then we declare that the triples in the <inf> graph can be used for inference at run time. To enable this, a SPARQL query will declare that it uses the 'inft' rule set. Otherwise this has no effect.

rdfs_rule_set ('inft', 'inf');

This is just a log checkpoint to finalize the work and truncate the transaction log. The server would also eventually do this in its own time.

checkpoint;

Now we are ready for querying.

Querying the Data

The queries are given in 3 different versions: The first file, lubm.sql, has the queries with most inference open coded as UNIONs. The second file, lubm-inf.sql, has the inference performed at run time using the ontology information in the <inf> graph we just loaded. The last, lubm-phys.sql, relies on having the entailed triples physically present in the <lubm> graph. These entailed triples are inserted by the SPARUL commands in the lubm-cp.sql file.

If you wish to run all the commands in a SQL file, you can type load <filename>; (e.g., load lubm-cp.sql;) at the SQL> prompt. If you wish to try individual statements, you can paste them to the command line.

For example:

SQL> sparql 
   PREFIX ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#>
   SELECT * 
     FROM <lubm>
    WHERE { ?x  a                     ub:Publication                                                . 
            ?x  ub:publicationAuthor  <http://www.Department0.University0.edu/AssistantProfessor0> 
          };

VARCHAR
_______________________________________________________________________

http://www.Department0.University0.edu/AssistantProfessor0/Publication0
http://www.Department0.University0.edu/AssistantProfessor0/Publication1
http://www.Department0.University0.edu/AssistantProfessor0/Publication2
http://www.Department0.University0.edu/AssistantProfessor0/Publication3
http://www.Department0.University0.edu/AssistantProfessor0/Publication4
http://www.Department0.University0.edu/AssistantProfessor0/Publication5

6 Rows. -- 4 msec.

To stop the server, simply type shutdown; at the SQL> prompt.

If you wish to use a SPARQL protocol end point, just enable the HTTP listener. This is done by adding a stanza like —

[HTTPServer]
ServerPort    = 8421
ServerRoot    = .
ServerThreads = 2

— to the end of the virtuoso.ini file in the lubm directory. Then shutdown and restart (type shutdown; at the SQL> prompt and then virtuoso-t -f & at the shell prompt).

Now you can connect to the end point with a web browser. The URL is http://localhost:8421/sparql. Without parameters, this will show a human readable form. With parameters, this will execute SPARQL.

We have shown how to load and query RDF with Virtuoso using the most basic SQL tools. Next you can access RDF from, for example, PHP, using the PHP ODBC interface.

To see how to use Jena or Sesame with Virtuoso, look at Native RDF Storage Providers. To see how RDF data types are supported, see Extension datatype for RDF

To work with large volumes of data, you must add memory to the configuration file and use the row-autocommit mode, i.e., do log_enable (2); before the load command. Otherwise Virtuoso will do the entire load as a single transaction, and will run out of rollback space. See documentation for more.

# PermaLink Comments [0]
12/17/2008 12:31 GMT Modified: 12/17/2008 12:41 GMT
Virtuoso Vs. MySQL: Setting the Berlin Record Straight (update 2)

In the context of the Berlin SPARQL Benchmark, I have repeatedly written about measurement procedures and steady state. The point is that the numbers at larger scales are unreliable due to cache behavior if one is not careful about measurement and does not have adequate warmup. Thus it came to pass that one cut of the BSBM paper had 3 seconds for MySQL and 100 for Virtuoso, basically through ignoring cache effects.

So we decided to do it ourselves.

The score is (updated with revised innodb_buffer_pool_size setting, based on advice noted down below):

n-clients Virtuoso MySQL
(with increased buffer pool size)
MySQL
(with default buffer poll size)
1 41,161.33 27,023.11 12,171.41
4 127,918.30 (pending) 37,566.82
8 218,162.29 105,524.23 51,104.39
16 214,763.58 98,852.42 47,589.18

The metric is the query mixes per hour from the BSBM test driver output. For the interested, the complete output is here.

The benchmark is pure SQL, nothing to do with SPARQL or RDF.

The hardware is 2 x Xeon 5345 (2 x quad core, 2.33 GHz), 16 G RAM. The OS is 64-bit Debian Linux.

The benchmark was run at a scale of 200,000. Each run had 2000 warm-up query mixes and 500 measured query mixes, which gives steady state, eliminating any effects of OS disk cache and the like. Both databases were configured to use 8G for disk cache. The test effectively runs from memory. We ran an analyze table on each MySQL table but noticed that this had no effect. Virtuoso does the stats sampling on the go; possibly MySQL also since the explicit stats did not make any difference. The MySQL tables were served by the InnoDB engine. MySQL appears to cache results of queries in some cases. This was not apparent in the tests.

The versions are 5.09 for Virtuoso and 5.1.29 for MySQL. You can download and examine --

MySQL ought to do better. We suspect that here, just as in the TPC-D experiment we made way back, the query plans are not quite right. Also we rarely saw over 300% CPU utilization for MySQL. It is possible there is a config parameter that affects this. The public is invited to tell us about such.

Update:

Andreas Schultz of the BSBM team advised us to increase the innodb_buffer_pool_size setting in the MySQL config. We did and it produced some improvement. Indeed, this is more like it, as we now see CPU utilization around 700% instead of the 300% in the previously published run, which rendered it suspect. Also, our experiments with TPC-D led us to expect better. We ran these things a few times so as to have warm cache.

On the first run, we noticed that the Innodb warm up time was somewhere well in excess of 2000 query mixes. Another time, we should make a graph of throughput as a function of time for both MySQL and Virtuoso. We recently made a greedy prefetch hack that should give us some mileage there. For the next BSBM, all we can advise is to run larger scale system for half an hour first and then measure and then measure again. If the second measurement is the same as the first then it is good.

As always, since MySQL is not our specialty, we confidently invite the public to tell us how to make it run faster. So, unless something more turns up, our next trial is a revisit of TPC-H.

# PermaLink Comments [0]
11/20/2008 11:06 GMT Modified: 11/24/2008 10:15 GMT
ISWC 2008: Some Questions

Inference: Is it always forward chaining?

We got a number of questions about Virtuoso's inference support. It seems that we are the odd one out, as we do not take it for granted that inference ought to consist of materializing entailment.

Firstly, of course one can materialize all one wants with Virtuoso. The simplest way to do this is using SPARUL. With the recent transitivity extensions to SPARQL, it is also easy to materialize implications of transitivity with a single statement. Our point is that for trivial entailment such as subclass, sub-property, single transitive property, and owl:sameAs, we do not require materialization, as we can resolve these at run time also, with backward-chaining built into the engine.

For more complex situations, one needs to materialize the entailment. At the present time, we know how to generalize our transitive feature to run arbitrary backward-chaining rules, including recursive ones. We could have a sort of Datalog backward-chaining embedded in our SQL/SPARQL and could run this with good parallelism, as the transitive feature already works with clustering and partitioning without dying of message latency. Exactly when and how we do this will be seen. Even if users want entailment to be materialized, such a rule system could be used for producing the materialization at good speed.

We had a word with Ian Horrocks on the question. He noted that it is often naive on behalf of the community to tend to equate description of semantics with description of algorithm. The data need not always be blown up.

The advantage of not always materializing is that the working set stays better. Once the working set is no longer in memory, response times jump disproportionately. Also, if the data changes or is retracted or is unreliable, one can end up doing a lot of extra work with materialization. Consider the effect of one malicious sameAs statement. This can lead to a lot of effects that are hard to retract. On the other hand, if running in memory with static data such as the LUBM benchmark, the queries run some 20% faster if entailment subclasses and sub-properties are materialized rather than done at run time.

Genetic Algorithms for SPARQL?

Our compliments for the wildest idea of the conference go to Eyal Oren, Christophe Guéret, and Stefan Schlobach, et al, for their paper on using genetic algorithms for guessing how variables in a SPARQL query ought to be instantiated. Prisoners of our "conventional wisdom" as we are, this might never have occurred to us.

Schema Last?

It is interesting to see how the industry comes to the semantic web conferences talking about schema last while at the same time the traditional semantic web people stress enforcing schema constraints and making more predictably performing and database friendlier logics. So do the extremes converge.

There is a point to schema last. RDF is very good for getting a view of ad hoc or unknown data. One can just load and look at what there is. Also, additions of unforeseen optional properties or relations to the schema are easy and efficient. However, it seems that a really high traffic online application would always benefit from having some application specific data structures. Such could also save considerably in hardware.

It is not a sharp divide between RDF and relational application oriented representation. We have the capabilities in our RDB to RDF mapping. We just need to show this and have SPARUL and data loading

# PermaLink Comments [0]
11/04/2008 15:54 GMT Modified: 11/04/2008 14:36 GMT
ISWC 2008: The Scalable Knowledge Systems Workshop

Mike Dean of BBN Technologies opened the Scalable Knowledge Systems Workshop with an invited talk. He reminded us of the facts of nature as concern the cost of distributed computing and running out of space for the working set. Developers in the semantic web field deplorably often ignore these facts, or alternatively recognize them and admit that they are unbeatable, that one just can't join across partitions.

I gave a talk about the Virtuoso Cluster edition, wherein I repeated essentially the same ground facts as Mike and outlined how we (in spite of these) profit from distributed memory multiprocessing. To those not intimate with these questions, let me affirm that deriving benefit from threading in a symmetric multiprocessor box, let alone a cluster connected by a network, totally depends on having many relatively long running things going at a time and blocking as seldom as possible.

Further, Mike Dean talked about ASIO, the BBN suite of semantic web tools. His most challenging statement was about the storage engine, a network-database-inspired triple-store using memory-mapped files.

Will the CODASYL days come back, and will the linked list on disk be the way to store triples/quads? I would say that this will have, especially with a memory-mapped file, probably a better best-case as a B-tree but that this also will be less predictable with fragmentation. With Virtuoso, using a B-tree index, we see about 20-30% of CPU time spent on index lookup when running LUBM queries. With a disk-based memory-mapped linked-list storage, we would see some improvements in this while getting hit probably worse than now in the case of fragmentation. Plus compaction on the fly would not be nearly as easy and surely far less local, if there were pointers between pages. So it is my intuition that trees are a safer bet with varying workloads while linked lists can be faster in a query-dominated in-memory situation.

Chris Bizer presented the Berlin SPARQL Benchmark (BSBM), which has already been discussed here in some detail. He did acknowledge that the next round of the race must have a real steady-state rule. This just means that the benchmark must be run long enough for the system under test to reach a state where the cache is full and the performance remains indefinitely at the same level. Reaching steady state can take 20-30 minutes in some cases.

Regardless of steady state, BSBM has two generally valid conclusions:

  1. mapping relational to RDF, where possible, is faster than triple storage; and
  2. the equivalent relational solution can be some 10x faster than the pure triples representation.

Mike Dean asked whether BSBM was a case of a setup to have triple stores fail. Not necessarily, I would say; we should understand that one motivation of BSBM is testing mapping technologies. Therefore it must have a workload where mapping makes sense. Of course there are workloads where triples are unchallenged — take the Billion Triples Challenge data set for one.

Also, with BSBM, once should note that the query optimization time plays a fairly large role since most queries touch relatively little data. Also, even if the scale is large, the working set is not nearly the size of the database. This in fact penalizes mapping technologies against native SQL since the difference there is compiling the query, especially since parameters are not used. So, Chris, since we both like to map, let's make a benchmark that shows mapping closer to native SQL.

Bridging the 10x Gap?

When we run Virtuoso relational against Virtuoso triple store with the TPC-H workload, we see that the relational case is significantly faster. These are long queries, thus query optimization time is negligible; we are here comparing memory-based access times. Why is this? The answer is that a single index lookup gives multiple column values with almost no penalty for the extra column. Also, since the number of total joins is lower, the overhead coming from moving from join to next join is likewise lower. This is just a meter of count of executed instructions.

A column store joins in principle just as much as a triple store. However, since the BI workload often consists of scanning over large tables, the joins tend to be local, the needed lookup can often use the previous location as a starting point. A triple store can do the same if queries have high locality. We do this in some SQL situations and can try this with triples also. The RDF workload is typically more random in its access pattern, though. The other factor is the length of control path. A column store has a simpler control flow if it knows that the column will have exactly one value per row. With RDF, this is not a given. Also, the column store's row is identified by a single number and not a multipart key. These two factors give the column store running with a fixed schema some edge over the more generic RDF quad store.

There was some discussion on how much closer a triple store could come to a relational one. Some gains are undoubtedly possible. We will see. For the ideal row store workload, the RDBMS will continue to have some edge. Large online systems typically have a large part of the workload that is simple and repetitive. There is nothing to prevent one having special indices for supporting such workload, even while retaining the possibility of arbitrary triples elsewhere. Some degree of application-specific data structure does make sense. We just need to show how this is done. In this way, we have a continuum and not an either/or choice of triples vs. tables.

Scale, Where Next?

Concerning the future direction of the workshop, there were a few directions suggested. One of the more interesting ones was Mike Dean's suggestion about dealing with a large volume of same-as assertions, specifically a volume where materializing all the entailed triples was no longer practical. Of course, there is the question of scale. This time, we were the only ones focusing on a parallel database with no restrictions on joining.

# PermaLink Comments [0]
11/03/2008 13:16 GMT Modified: 11/03/2008 12:33 GMT
Virtuoso - Are We Too Clever for Our Own Good? (updated)

"Physician, heal thyself," it is said. We profess to say what the messaging of the semantic web ought to be, but is our own perfect?

I will here engage in some critical introspection as well as amplify on some answers given to Virtuoso-related questions in recent times.

I use some conversations from the Vienna Linked Data Practitioners meeting as a starting point. These views are mine and are limited to the Virtuoso server. These do not apply to the ODS (OpenLink Data Spaces) applications line, OAT (OpenLink Ajax Toolkit), or ODE (OpenLink Data Explorer).

"It is not always clear what the main thrust is, we get the impression that you are spread too thin," said Sören Auer.

Well, personally, I am all for core competence. This is why I do not participate in all the online conversations and groups as much as I could, for example. Time and energy are critical resources and must be invested where they make a difference. In this case, the real core competence is running in the database race. This in itself, come to think of it, is a pretty broad concept.

This is why we put a lot of emphasis on Linked Data and the Data Web for now, as this is the emerging game. This is a deliberate choice, not an outside imperative or built-in limitation. More specifically, this means exposing any pre-existing relational data as linked data plus being the definitive RDF store.

We can do this because we own our database and SQL and data access middleware and have a history of connecting to any RDBMS out there.

The principal message we have been hearing from the RDF field is the call for scale of triple storage. This is even louder than the call for relational mapping. We believe that in time mapping will exceed triple storage as such, once we get some real production strength mappings deployed, enough to outperform RDF warehousing.

There are also RDF middleware things like RDF-ization and demand-driven web harvesting (i.e, the so-called Sponger). These are SPARQL options, thus accessed via standard interfaces. We have little desire to create our own languages or APIs, or to tell people how to program. This is why we recently introduced Sesame- and Jena-compatible APIs to our RDF store. From what we hear, these work. On the other hand, we do not hesitate to move beyond the standards when there is obvious value or necessity. This is why we brought SPARQL up to and beyond SQL expressivity. It is not a case of E3 (Embrace, Extend, Extinguish).

Now, this message could be better reflected in our material on the web. This blog is a rather informal step in this direction; more is to come. For now we concentrate on delivering.

The conventional communications wisdom is to split the message by target audience. For this, we should split the RDF, relational, and web services messages from each other. We believe that a challenger, like the semantic web technology stack, must have a compelling message to tell for it to be interesting. This is not a question of research prototypes. The new technology cannot lack something the installed technology takes for granted.

This is why we do not tend to show things like how to insert and query a few triples: No business out there will insert and query triples for the sake of triples. There must be a more compelling story — for example, turning the whole world into a database. This is why our examples start with things like turning the TPC-H database into RDF, queries and all. Anything less is not interesting. Why would an enterprise that has business intelligence and integration issues way more complex than the rather stereotypical TPC-H even look at a technology that pretends to be all for integration and all for expressivity of queries, yet cannot answer the first question of the entry exam?

The world out there is complex. But maybe we ought to make some simple tutorials? So, as a call to the people out there, tell us what a good tutorial would be. The question is more about figuring out what is out there and adapting these and making a sort of compatibility list. Jena and Sesame stuff ought to run as is. We could offer a webinar to all the data web luminaries showing how to promote the data web message with Virtuoso. After all, why not show it on the best platform?

"You are arrogant. When I read your papers or documentation, the impression I get is that you say you are smart and the reader is stupid."

We should answer in multiple parts.

For general collateral, like web sites and documentation:

The web site gives a confused product image. For the Virtuoso product, we should divide at the top into

  • Data web and RDF - Host linked data, expose relational assets as linked data;
  • Relational Database - Full function, high performance, open source, Federated/Virtual Relational DBMS, expose heterogeneous RDB assets through one point of contact for integration;
  • Web Services - access all the above over standard protocols, dynamic web pages, web hosting.

For each point, one simple statement. We all know what the above things mean?

Then we add a new point about scalability that impacts all the above, namely the Virtuoso version 6 Cluster, meaning that you can do all these things at 10 to 1000 times the scale. This means this much more data or in some cases this much more requests per second. This too is clear.

Far as I am concerned, hosting Java or .NET does not have to be on the front page. Also, we have no great interest in going against Apache when it comes to a web server only situation. The fact that we have a web listener is important for some things but our claim to fame does not rest on this.

Then for documentation and training materials: The documentation should be better. Specifically it should have more of a how-to dimension since nobody reads the whole thing anyhow. About online tutorials, the order of presentation should be different. They do not really reflect what is important at the present moment either.

Now for conference papers: Since taking the data web as a focus area, we have submitted some papers and had some rejected because these do not have enough references and do not explain what is obvious to ourselves.

I think that the communications failure in this case is that we want to talk about end to end solutions and the reviewers expect research. For us, the solution is interesting and exists only if there is an adequate functionality mix for addressing a specific use case. This is why we do not make a paper about query cost model alone because the cost model, while indispensable, is a thing that is taken for granted where we come from. So we mention RDF adaptations to cost model, as these are important to the whole but do not find these to be the justification for a whole paper. If we made papers on this basis, we would have to make five times as many. Maybe we ought to.

"Virtuoso is very big and very difficult"

One thing that is not obvious from the Virtuoso packaging is that the minimum installation is an executable under 10MB and a config file. Two files.

This gives you SQL and SPARQL out of the box. Adding ODBC and JDBC clients is as simple as it gets. After this, there is basic database functionality. Tuning is a matter of a few parameters that are explained on this blog and elsewhere. Also, the full scale installation is available as an Amazon EC2 image, so no installation required.

Now for the difficult side:

Use SQL and SPARQL; use stored procedures whenever there is server side business logic. For some time critical web pages, use VSP. Do not use VSPX. Otherwise, use whatever you are used to — PHP or Java or anything else. For web services, simple is best. Stick to basics. "The engineer is one who can invent a simple thing." Use SQL statements rather than admin UI.

Know that you can start a server with no database file and you get an initial database with nothing extra. The demo database, the way it is produced by installers is cluttered.

We should put this into a couple of use case oriented how-tos.

Also, we should create a network of "friendly local virtuoso geeks" for providing basic training and services so we do not have to explain these things all the time. To all you data-web-ers out there — please sign up and we will provide instructions, etc. Contact Yrjänä Rankka (ghard[at-sign]openlinksw.com), or go through the mailing lists; do not contact me directly.

"OK, we understand that you may be good at the large end of the spectrum but how do you reconcile this with the lightweight or embedded end, like the semantic desktop?"

Now, what is good for one end is usually good for the other. Namely, a database, no matter the scale, needs to have space efficient storage, fast index lookup, and correct query plans. Then there are things that occur only at the high-end, like clustering, but these are separate things. For embedding, the initial memory footprint needs to be small. With Virtuoso, this is accomplished by leaving out some 200 built-in tables and 100,000 lines of SQL procedures that are normally in by default, supporting things such as DAV and diverse other protocols. After all, if SPARQL is all one wants these are not needed.

If one really wants to do one's server logic (like web listener and thread dispatching) oneself, this is not impossible but requires some advice from us. On the other hand, if one wants to have logic for security close to the data, then using stored procedures is recommended; these execute right next to the data, and support inline SPARQL and SQL. Depending on the license status of the other code, some special licensing arrangements may apply.

We are talking about such things with different parties at present.

"How webby are you? What is webby?"

"Webby means distributed, heterogeneous, open; not monolithic consolidation of everything."

We are philosophically webby. We come from open standards; we are after all called OpenLink; our history consists of connecting things. We believe in choice — the user should be able to pick the best of breed for components and have them work together. We cannot and do not wish to force replacement of existing assets. Transforming data on the fly and connecting systems, leaving data where it originally resides, is the first preference. For the data web, the first preference is a federation of independent SPARQL end points. When there is harvesting, we prefer to do it on demand, as with our Sponger. With the immense amount of data out there we believe in finding what is relevant when it is relevant, preferably close at hand, leveraging things like social networks. With a data web, many things which are now siloized, such as marketplaces and social networks, will return to the open.

Google-style crawling of everything becomes less practical if one needs to run complex ad hoc queries against the mass of data. For these types of scenarios, if one needs to warehouse, the data cloud will offer solutions where one pays for database on demand. While we believe in loosely coupled federation where possible, we have serious work on the scalability side for the data center and the compute-on-demand cloud.

"How does OpenLink see the next five years unfolding?"

Personally, I think we have the basics for the birth of a new inflection in the knowledge economy. The URI is the unit of exchange; its value and competitive edge lie in the data it links you with. A name without context is worth little, but as a name gets more use, more information can be found through that name. This is anything from financial statistics, to legal precedents, to news reporting or government data. Right now, if the SEC just added one line of markup to the XBRL template, this would instantaneously make all SEC-mandated reporting into linked data via GRDDL.

The URI is a carrier of brand. An information brand gets traffic and references, and this can be monetized in diverse ways. The key word is context. Information overload is here to stay, and only better context offers the needed increase in productivity to stay ahead of the flood.

Semantic technologies on the whole can help with this. Why these should be semantic web or data web technologies as opposed to just semantic is the linked data value proposition. Even smart islands are still islands. Agility, scale, and scope, depend on the possibility of combining things. Therefore common terminologies and dereferenceability and discoverability are important. Without these, we are at best dealing with closed systems even if they were smart. The expert systems of the 1980s are a case in point.

Ever since the .com era, the URL has been a brand. Now it becomes a URI. Thus, entirely hiding the URI from the user experience is not always desirable. The URI is a sort of handle on the provenance and where more can be found; besides, people are already used to these.

With linked data, information value-add products become easy to build and deploy. They can be basically just canned SPARQL queries combining data in a useful and insightful manner. And where there is traffic there can be monetization, whether by advertizing, subscription, or other means. Such possibilities are a natural adjunct to the blogosphere. To publish analysis, one no longer needs to be a think tank or media company. We could call this scenario the birth of a meshup economy.

For OpenLink itself, this is our roadmap. The immediate future is about getting our high end offerings like clustered RDF storage generally available, both on the cloud and for private data centers. Ourselves, we will offer the whole Linked Open Data cloud as a database. The single feature to come in version 2 of this is fully automatic partitioning and repartitioning for on-demand scale; now, you have to choose how many partitions you have.

This makes some things possible that were hard thus far.

On the mapping front, we go for real-scale data integration scenarios where we can show that SPARQL can unify terms and concepts across databases, yet bring no added cost for complex queries. Enterprises can use their existing warehouses and have an added level of abstraction, the possibility of cross systems interlinking, the advantages of using the same taxonomies and ontologies across systems, and so forth.

Then there will be developments in the direction of smarter web harvesting on demand with the Virtuoso Sponger, and federation of heterogeneous SPARQL end points. The federation is not so unlike clustering, except the time scales are 2 orders of magnitude longer. The work on SPARQL end point statistics and data set description and discovery is a good development in the community.

Then there will be NLP integration, as exemplified by the Open Calais linked data wrapper and more.

Can we pull this off or is this being spread too thin? We know from experience that all this can be accomplished. Scale is already here; we show it with the billion triples set. Mapping is here; we showed it last in the Berlin Benchmark. We will also show some TPC-H results after we get a little quiet after the ISWC event. Then there is ongoing maintenance but with this we have shown a steady turnaround and quick time to fix for pretty much anything.

# PermaLink Comments [0]
10/26/2008 12:15 GMT Modified: 10/27/2008 12:07 GMT
Virtuoso Update, Billion Triples and Outlook

I will say a few things about what we have been doing and where we can go.

Firstly, we have a fairly scalable platform with Virtuoso 6 Cluster. It was most recently tested with the workload discussed in the previous Billion Triples post.

There is an updated version of the paper about this. This will be presented at the web scale workshop of ISWC 2008 in Karlsruhe.

Right now, we are polishing some things in Virtuoso 6 -- some optimizations for smarter balancing of interconnect traffic over multiple network interfaces, and some more SQL optimizations specific to RDF. The must-have basics, like parallel running of sub-queries and aggregates, and all-around unrolling of loops of every kind into large partitioned batches, is all there and proven to work.

We spent a lot of time around the Berlin SPARQL Benchmark story, so we got to the more advanced stuff like the Billion Triples Challenge rather late. We did along the way also run BSBM with an Oracle back-end, with Virtuoso mapping SPARQL to SQL. This merits its own analysis in the near future. This will be the basic how-to of mapping OLTP systems to RDF. Depending on the case, one can use this for lookups in real-time or ETL.

RDF will deliver value in complex situations. An example of a complex relational mapping use case came from Ordnance Survey, presented at the RDB2RDF XG. Examples of complex warehouses include the Neurocommons database, the Billion Triples Challenge, and the Garlik DataPatrol.

In comparison, the Berlin workload is really simple and one where RDF is not at its best, as amply discussed on the Linked Data forum. BSBM's primary value is as a demonstrator for the basic mapping tasks that will be repeated over and over for pretty much any online system when presence on the data web becomes as indispensable as presence on the HTML web.

I will now talk about the complex warehouse/web-harvesting side. I will come to the mapping in another post.

Now, all the things shown in the Billion Triples post can be done with a relational system specially built for each purpose. Since we are a general purpose RDBMS, we use this capability where it makes sense. For example, storing statistics about which tags or interests occur with which other tags or interests as RDF blank nodes makes no sense. We do not even make the experiment; we know ahead of time that the result is at least an order of magnitude in favor of the relational row-oriented solution in both space and time.

Whenever there is a data structure specially made for answering one specific question, like joint occurrence of tags, RDB and mapping is the way to go. With Virtuoso, this can fully-well coexist with physical triples, and can still be accessed in SPARQL and mixed with triples. This is territory that we have not extensively covered yet, but we will be giving some examples about this later.

The real value of RDF is in agility. When there is no time to design and load a new warehouse for every new question, RDF is unparalleled. Also SPARQL, once it has the necessary extensions of aggregating and sub-queries, is nicer than SQL, especially when we have sub-classes and sub-properties, transitivity, and "same as" enabled. These things have some run time cost and if there is a report one is hitting absolutely all the time, then chances are that resolving terms and identity at load-time and using materialized views in SQL is the reasonable thing. If one is inventing a new report every time, then RDF has a lot more convenience and flexibility.

We are just beginning to explore what we can do with data sets such as the online conversation space, linked data, and the open ontologies of UMBEL and OpenCyc. It is safe to say that we can run with real world scale without loss of query expressivity. There is an incremental cost for performance but this is not prohibitive. Serving the whole billion triples set from memory would cost about $32K in hardware. $8K will do if one can wait for disk part of the time. One can use these numbers as a basis for costing larger systems. For online search applications, one will note that running the indexes pretty much from memory is necessary for flat response time. For back office analytics this is not necessarily as critical. It all depends on the use case.

We expect to be able to combine geography, social proximity, subject matter, and named entities, with hierarchical taxonomies and traditional full text, and to present this through a simple user interface.

We expect to do this with online response times if we have a limited set of starting points and do not navigate more than 2 or 3 steps from each starting point. An example would be to have a full text pattern and news group, and get the cloud of interests from the authors of matching posts. Another would be to make a faceted view of the properties of the 1000 people most closely connected to one person.

Queries like finding the fastest online responders to questions about romance across the global board-scape, or finding the person who initiates the most long running conversations about crime, take a bit longer but are entirely possible.

The genius of RDF is to be able to do these things within a general purpose database, ad hoc, in a single query language, mostly without materializing intermediate results. Any of these things could be done with arbitrary efficiency in a custom built system. But what is special now is that the cost of access to this type of information and far beyond drops dramatically as we can do these things in a far less labor intensive way, with a general purpose system, with no redesigning and reloading of warehouses at every turn. The query becomes a commodity.

Still, one must know what to ask. In this respect, the self-describing nature of RDF is unmatched. A query like list the top 10 attributes with the most distinct values for all persons cannot be done in SQL. SQL simply does not allow the columns to be variable.

Further, we can accept queries as text, the way people are used to supplying them, and use structure for drill-down or result-relevance, and also recognize named entities and subject matter concepts in query text. Very simple NLP will go a long way towards keeping SPARQL out of the user experience.

The other way of keeping query complexity hidden is to publish hand-written SPARQL as parameter-fed canned reports.

Between now and ISWC 2008, the last week of October, we will put out demos showing some of these things. Stay tuned.

# PermaLink Comments [0]
10/02/2008 09:31 GMT Modified: 10/02/2008 12:47 GMT
 <<     | 1 | 2 | 3 | 4 |     >>
Powered by OpenLink Virtuoso Universal Server
Running on Linux platform