Orri Erling's Weblog

Updated hardware improves LUBM 8000 load rate in Virtuoso 6

Fri, 14 Aug 2009 19:01:30 GMT

We repeated the earlier LUBM 8000 experiment on a newer machine, with 2 x Xeon 5520 and 72G 1333MHz memory, and once again with the 2 machines as a networked cluster. Otherwise the settings were the same.

The load rate is now 160,739 triples-per-second.

	Virtuoso 6 (previous run)	Virtuoso 6 (new run)	Virtuoso 6 (newest run)
blades	1	1	2
processors	2 x Xeon 5410	2 x Xeon 5520	2 x Xeon 5520 + 2 x Xeon 5410 with 1x1GigE interconnect
memory	16G 667 MHz	72G 1333 MHz	72G 1333 MHz + 16G 667 MHz respectively
reported load rate triples-per-second	110,532	160,739	214,188

Again, if others talk about loading LUBM, so must we. Otherwise, this metric is rather uninteresting.

Single Virtuoso host loads 110,500 triples-per-second on LUBM 8000

Mon, 29 Jun 2009 16:12:34 GMT

LUBM load speed still seems to be a metric that is quoted in comparisons of RDF stores. Consequently, we too measured the load time of LUBM 8000, 1,068-million triples, on the newest Virtuoso.

The real time for the load was 161m 3s. The rate was 110,532 triples-per-second. The hardware was one machine with 2 x Xeon 5410 (quad core, 2.33 GHz) and 16G 667 MHz RAM. The software was Virtuoso 6 Cluster, configured into 8 partitions (processes) — one partition per CPU core. Each partition had its database striped over 6 disks total; the 6 disks on the system were shared between the 8 database processes.

The load was done on 8 streams, one per server process. At the beginning of the load, the CPU usage was 740% with no disk; at the end, it was around 700% with 25% disk wait. 100% counts here for one CPU core or one disk being constantly busy.

The RDF store was configured with the default two indices over quads, these being GSPO and OGPS. Text indexing of literals was not enabled. No materialization of entailed triples was made.

We think that LUBM loading is not a realistic benchmark for the world but since other people publish such numbers, so do we.

Virtuoso RDF: A Getting Started Guide for the Developer

Wed, 17 Dec 2008 12:31:34 GMT

It is a long standing promise of mine to dispel the false impression that using Virtuoso to work with RDF is complicated.

The purpose of this presentation is to show a programmer how to put RDF into Virtuoso and how to query it. This is done programmatically, with no confusing user interfaces.

You should have a Virtuoso Open Source tree built and installed. We will look at the LUBM benchmark demo that comes with the package. All you need is a Unix shell. Running the shell under emacs (m-x shell) is the best. But the open source isql utility should have command line editing also. The emacs shell is however convenient for cutting and pasting things between shell and files.

To get started, cd into binsrc/tests/lubm.

To verify that this works, you can do

./test_server.sh virtuoso-t

This will test the server with the LUBM queries. This should report 45 tests passed. After this we will do the tests step-by-step.

Loading the Data

The file lubm-load.sql contains the commands for loading the LUBM single university qualification database.

The data files themselves are in lubm_8000, 15 files in RDFXML.

There is also a little ontology called inf.nt. This declares the subclass and subproperty relations used in the benchmark.

So now let's go through this procedure.

Start the server:

$ virtuoso-t -f &

This starts the server in foreground mode, and puts it in the background of the shell.

Now we connect to it with the isql utility.

$ isql 1111 dba dba

This gives a SQL> prompt. The default username and password are both dba.

When a command is SQL, it is entered directly. If it is SPARQL, it is prefixed with the keyword sparql. This is how all the SQL clients work. Any SQL client, such as any ODBC or JDBC application, can use SPARQL if the SQL string starts with this keyword.

The lubm-load.sql file is quite self-explanatory. It begins with defining an SQL procedure that calls the RDF/XML load function, DB..RDF_LOAD_RDFXML, for each file in a directory.

Next it calls this function for the lubm_8000 directory under the server's working directory.

sparql 
   CLEAR GRAPH <lubm>;

sparql 
   CLEAR GRAPH <inf>;

load_lubm ( server_root() || '/lubm_8000/' );

Then it verifies that the right number of triples is found in the <lubm> graph.

sparql 
   SELECT COUNT(*) 
     FROM <lubm> 
    WHERE { ?x ?y ?z } ;

The echo commands below this are interpreted by the isql utility, and produce output to show whether the test was passed. They can be ignored for now.

Then it adds some implied subOrganizationOf triples. This is part of setting up the LUBM test database.

sparql 
   PREFIX  ub:  <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#>
   INSERT 
      INTO GRAPH <lubm> 
      { ?x  ub:subOrganizationOf  ?z } 
   FROM <lubm> 
   WHERE { ?x  ub:subOrganizationOf  ?y  . 
           ?y  ub:subOrganizationOf  ?z  . 
         };

Then it loads the ontology file, inf.nt, using the Turtle load function, DB.DBA.TTLP. The arguments of the function are the text to load, the default namespace prefix, and the URI of the target graph.

DB.DBA.TTLP ( file_to_string ( 'inf.nt' ), 
              'http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl', 
              'inf' 
            ) ;
sparql 
   SELECT COUNT(*) 
     FROM <inf> 
    WHERE { ?x ?y ?z } ;

Then we declare that the triples in the <inf> graph can be used for inference at run time. To enable this, a SPARQL query will declare that it uses the 'inft' rule set. Otherwise this has no effect.

rdfs_rule_set ('inft', 'inf');

This is just a log checkpoint to finalize the work and truncate the transaction log. The server would also eventually do this in its own time.

checkpoint;

Now we are ready for querying.

Querying the Data

The queries are given in 3 different versions: The first file, lubm.sql, has the queries with most inference open coded as UNIONs. The second file, lubm-inf.sql, has the inference performed at run time using the ontology information in the <inf> graph we just loaded. The last, lubm-phys.sql, relies on having the entailed triples physically present in the <lubm> graph. These entailed triples are inserted by the SPARUL commands in the lubm-cp.sql file.

If you wish to run all the commands in a SQL file, you can type load <filename>; (e.g., load lubm-cp.sql;) at the SQL> prompt. If you wish to try individual statements, you can paste them to the command line.

For example:

SQL> sparql 
   PREFIX ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#>
   SELECT * 
     FROM <lubm>
    WHERE { ?x  a                     ub:Publication                                                . 
            ?x  ub:publicationAuthor  <http://www.Department0.University0.edu/AssistantProfessor0> 
          };

VARCHAR
_______________________________________________________________________

http://www.Department0.University0.edu/AssistantProfessor0/Publication0
http://www.Department0.University0.edu/AssistantProfessor0/Publication1
http://www.Department0.University0.edu/AssistantProfessor0/Publication2
http://www.Department0.University0.edu/AssistantProfessor0/Publication3
http://www.Department0.University0.edu/AssistantProfessor0/Publication4
http://www.Department0.University0.edu/AssistantProfessor0/Publication5

6 Rows. -- 4 msec.

To stop the server, simply type shutdown; at the SQL> prompt.

If you wish to use a SPARQL protocol end point, just enable the HTTP listener. This is done by adding a stanza like â

[HTTPServer]
ServerPort    = 8421
ServerRoot    = .
ServerThreads = 2

â to the end of the virtuoso.ini file in the lubm directory. Then shutdown and restart (type shutdown; at the SQL> prompt and then virtuoso-t -f & at the shell prompt).

Now you can connect to the end point with a web browser. The URL is http://localhost:8421/sparql. Without parameters, this will show a human readable form. With parameters, this will execute SPARQL.

We have shown how to load and query RDF with Virtuoso using the most basic SQL tools. Next you can access RDF from, for example, PHP, using the PHP ODBC interface.

To see how to use Jena or Sesame with Virtuoso, look at Native RDF Storage Providers. To see how RDF data types are supported, see Extension datatype for RDF

To work with large volumes of data, you must add memory to the configuration file and use the row-autocommit mode, i.e., do log_enableÂ (2); before the load command. Otherwise Virtuoso will do the entire load as a single transaction, and will run out of rollback space. See documentation for more.

ISWC 2008: Some Questions

Tue, 04 Nov 2008 15:54:42 GMT

Inference: Is it always forward chaining?

We got a number of questions about Virtuoso's inference support. It seems that we are the odd one out, as we do not take it for granted that inference ought to consist of materializing entailment.

Firstly, of course one can materialize all one wants with Virtuoso. The simplest way to do this is using SPARUL. With the recent transitivity extensions to SPARQL, it is also easy to materialize implications of transitivity with a single statement. Our point is that for trivial entailment such as subclass, sub-property, single transitive property, and owl:sameAs, we do not require materialization, as we can resolve these at run time also, with backward-chaining built into the engine.

For more complex situations, one needs to materialize the entailment. At the present time, we know how to generalize our transitive feature to run arbitrary backward-chaining rules, including recursive ones. We could have a sort of Datalog backward-chaining embedded in our SQL/SPARQL and could run this with good parallelism, as the transitive feature already works with clustering and partitioning without dying of message latency. Exactly when and how we do this will be seen. Even if users want entailment to be materialized, such a rule system could be used for producing the materialization at good speed.

We had a word with Ian Horrocks on the question. He noted that it is often naive on behalf of the community to tend to equate description of semantics with description of algorithm. The data need not always be blown up.

The advantage of not always materializing is that the working set stays better. Once the working set is no longer in memory, response times jump disproportionately. Also, if the data changes or is retracted or is unreliable, one can end up doing a lot of extra work with materialization. Consider the effect of one malicious sameAs statement. This can lead to a lot of effects that are hard to retract. On the other hand, if running in memory with static data such as the LUBM benchmark, the queries run some 20% faster if entailment subclasses and sub-properties are materialized rather than done at run time.

Genetic Algorithms for SPARQL?

Our compliments for the wildest idea of the conference go to Eyal Oren, Christophe GuÃ©ret, and Stefan Schlobach, et al, for their paper on using genetic algorithms for guessing how variables in a SPARQL query ought to be instantiated. Prisoners of our "conventional wisdom" as we are, this might never have occurred to us.

Schema Last?

It is interesting to see how the industry comes to the semantic web conferences talking about schema last while at the same time the traditional semantic web people stress enforcing schema constraints and making more predictably performing and database friendlier logics. So do the extremes converge.

There is a point to schema last. RDF is very good for getting a view of ad hoc or unknown data. One can just load and look at what there is. Also, additions of unforeseen optional properties or relations to the schema are easy and efficient. However, it seems that a really high traffic online application would always benefit from having some application specific data structures. Such could also save considerably in hardware.

It is not a sharp divide between RDF and relational application oriented representation. We have the capabilities in our RDB to RDF mapping. We just need to show this and have SPARUL and data loading

ISWC 2008: The Scalable Knowledge Systems Workshop

Mon, 03 Nov 2008 13:16:47 GMT

Mike Dean of BBN Technologies opened the Scalable Knowledge Systems Workshop with an invited talk. He reminded us of the facts of nature as concern the cost of distributed computing and running out of space for the working set. Developers in the semantic web field deplorably often ignore these facts, or alternatively recognize them and admit that they are unbeatable, that one just can't join across partitions.

I gave a talk about the Virtuoso Cluster edition, wherein I repeated essentially the same ground facts as Mike and outlined how we (in spite of these) profit from distributed memory multiprocessing. To those not intimate with these questions, let me affirm that deriving benefit from threading in a symmetric multiprocessor box, let alone a cluster connected by a network, totally depends on having many relatively long running things going at a time and blocking as seldom as possible.

Further, Mike Dean talked about ASIO, the BBN suite of semantic web tools. His most challenging statement was about the storage engine, a network-database-inspired triple-store using memory-mapped files.

Will the CODASYL days come back, and will the linked list on disk be the way to store triples/quads? I would say that this will have, especially with a memory-mapped file, probably a better best-case as a B-tree but that this also will be less predictable with fragmentation. With Virtuoso, using a B-tree index, we see about 20-30% of CPU time spent on index lookup when running LUBM queries. With a disk-based memory-mapped linked-list storage, we would see some improvements in this while getting hit probably worse than now in the case of fragmentation. Plus compaction on the fly would not be nearly as easy and surely far less local, if there were pointers between pages. So it is my intuition that trees are a safer bet with varying workloads while linked lists can be faster in a query-dominated in-memory situation.

Chris Bizer presented the Berlin SPARQL Benchmark (BSBM), which has already been discussed here in some detail. He did acknowledge that the next round of the race must have a real steady-state rule. This just means that the benchmark must be run long enough for the system under test to reach a state where the cache is full and the performance remains indefinitely at the same level. Reaching steady state can take 20-30 minutes in some cases.

Regardless of steady state, BSBM has two generally valid conclusions:

mapping relational to RDF, where possible, is faster than triple storage; and
the equivalent relational solution can be some 10x faster than the pure triples representation.

Mike Dean asked whether BSBM was a case of a setup to have triple stores fail. Not necessarily, I would say; we should understand that one motivation of BSBM is testing mapping technologies. Therefore it must have a workload where mapping makes sense. Of course there are workloads where triples are unchallenged â take the Billion Triples Challenge data set for one.

Also, with BSBM, once should note that the query optimization time plays a fairly large role since most queries touch relatively little data. Also, even if the scale is large, the working set is not nearly the size of the database. This in fact penalizes mapping technologies against native SQL since the difference there is compiling the query, especially since parameters are not used. So, Chris, since we both like to map, let's make a benchmark that shows mapping closer to native SQL.

Bridging the 10x Gap?

When we run Virtuoso relational against Virtuoso triple store with the TPC-H workload, we see that the relational case is significantly faster. These are long queries, thus query optimization time is negligible; we are here comparing memory-based access times. Why is this? The answer is that a single index lookup gives multiple column values with almost no penalty for the extra column. Also, since the number of total joins is lower, the overhead coming from moving from join to next join is likewise lower. This is just a meter of count of executed instructions.

A column store joins in principle just as much as a triple store. However, since the BI workload often consists of scanning over large tables, the joins tend to be local, the needed lookup can often use the previous location as a starting point. A triple store can do the same if queries have high locality. We do this in some SQL situations and can try this with triples also. The RDF workload is typically more random in its access pattern, though. The other factor is the length of control path. A column store has a simpler control flow if it knows that the column will have exactly one value per row. With RDF, this is not a given. Also, the column store's row is identified by a single number and not a multipart key. These two factors give the column store running with a fixed schema some edge over the more generic RDF quad store.

There was some discussion on how much closer a triple store could come to a relational one. Some gains are undoubtedly possible. We will see. For the ideal row store workload, the RDBMS will continue to have some edge. Large online systems typically have a large part of the workload that is simple and repetitive. There is nothing to prevent one having special indices for supporting such workload, even while retaining the possibility of arbitrary triples elsewhere. Some degree of application-specific data structure does make sense. We just need to show how this is done. In this way, we have a continuum and not an either/or choice of triples vs. tables.

Scale, Where Next?

Concerning the future direction of the workshop, there were a few directions suggested. One of the more interesting ones was Mike Dean's suggestion about dealing with a large volume of same-as assertions, specifically a volume where materializing all the entailed triples was no longer practical. Of course, there is the question of scale. This time, we were the only ones focusing on a parallel database with no restrictions on joining.

Virtuoso 5.0.6 Updates

Tue, 25 Mar 2008 16:59:08 GMT

I will here summarize the developments since the last Virtuoso 5 Open Source release.

On the RDF side, the bitmap intersection join has been improved quite a bit so that it is now almost always more than 2x more efficient than the equivalent nested loop join.

XML trees in the object position in RDF quads were in some cases incorrectly indexed, leading to failure to retrieve quads. This is fixed and should problems occur in existing databases, they can be corrected by simply dropping and re-creating an index.

Also the cost model has been further tuned. We have run the TPC-H queries with larger databases and have profiled it extensively. There are improvements to locking, especially for concurrency of transactions with large shared lock sets, as is the case in the TPC-H queries. The rules stipulate that these have to be run with repeatable read. There are also optimizations for decimal floating point.

A sampling of TPC-H queries translated into SPARQL comes with the new demo database. These show a live sample of the TPC-H schema translated into linked data, complete with SPARQL translations of the original queries. Some work is still ongoing there but the relational to RDF mapping is mature enough for real business intelligence applications now.

On the closed source side, we have some adjustments to the virtual database. When using Virtuoso as a front end to Oracle, using the TPC-H queries as a metric, the virtual database overhead is minimal. Previously, we had some overhead because some queries were rewritten in a way that Oracle would not optimize as well as the original TPC-H text. Specifically, turning an IN sub-query predicate into an equivalent EXISTS did not sit well with Oracle.

What's Wrong With LUBM?

Tue, 05 Feb 2008 11:47:11 GMT

In the interest of participating in a community benchmark development process, I will here outline some desiderata and explain how we could improve on LUBM. I will also touch on the message such an effort ought to convey.

A blow-by-blow analysis of the performance of a complex system such as a DBMS is more than fits within the scope of human attention at one go. This is why this all must be abbreviated into a single metric. Only when thus abbreviated, can this information be used in context. The metric's practical value is relative to how well it predicts the performance of the system in some real task. This means a task not likely to be addressed by an alternative technology, unless the challenger clearly beats the incumbent.

A benchmark is promotional material, both well as for the technology being benchmarked as a whole. This is why the benchmark, whatever it does, should do something that the technology does well, surely better than any alternative technology. A case in point is that one ought not to take a pure relational workload and RDF-ize it, for then the relational variant is likely to come out on top.

In this regard LUBM is not so bad because its reliance on class and property hierarchies and the occasional transitivity or inference rule makes the workload typically RDF, a little ways apart from a purely relational implementation of the task.

RDF's claim to fame is linked data. This means giving things globally unique names and thereby making anything joinable with anything else, insofar there is agreement on the names. RDF is a key to a new class of problems, call it web scale database. Web scale here refers first to heterogeneity and multiplicity of independent sources and secondly to volume of data.

Now there are plenty of relational applications with very large volumes of data. On the non-relational side, there are even larger applications, such as web search engines. All these have a set schema and a specific workload they are meant to address. RDF versions of such are conceivable but hold no intrinsic advantage if considered in the specific niche alone.

The claim to fame of RDF is not to outperform these on their home turf but to open another turf altogether, allowing agile joining and composing of all these resources.

This is why a benchmark, i.e., an an advertisement for the RDF value proposition, should not just take a relational workload and RDF-ize it. The benchmark should carry some of the web in it.

If we just intend to measure how well an RDF store joins triples to other triples, LUBM is almost good enough. If it defined a query mix with different frequencies for short and long queries and a concurrent query metric, it would be pretty much there. Our adaptation of it is adequate for counting joins per second. But joins per second is not a value proposition.

So we have two questions:

If we just take the RDF model and SPARQL, how do we make a benchmark that fills in what LUBM does not cover?
How do we make a benchmark that displays RDF's strengths against a comparable relational solution? A priori, by going somewhere where SQL has trouble reaching.

The answers to the first are not very complex:

Add some optionals. Have different frequencies of occurrence for some properties.
Add different graphs. Make queries joining between graphs and drawing on different graphs. Querying against all graphs of the store is not a part of the language. Still this would be useful but leave it out for now.
Add some filters and arithmetic. Not much can be done there, though because expressions cannot be returned and there is no aggregation or grouping.
Split the workload into short and long queries. The short should be typical for online use and the long ones for analysis. Different execution frequencies for different queries is a must. Analysis is limited by lack of grouping, expressions or aggregation. Still, something can be contrived by looking for a pattern that does not exist or occurs extremely rarely. Producing result sets of millions of rows is not realistic.
Many of the LUBM queries return thousands of rows, even when scoped to a single university. This is not very realistic. No user interface displays that sort of quantity. Of course, the intermediate results can be large as you please but the output must be somehow ranked. SPARQL has order by and limit, so these will have to be used. TPC H for example has almost always a group by/order by combination and sometimes a result rows limit.
The degree of inference in LUBM is about right, mostly sub-classes and sub-properties, nothing complex. We certainly regard this as a database benchmark more than a knowledge representation or rule system one.
LUBM does an OK job of defining a scale factor. I think that a concurrent query metric can just be so many queries per time at a given scale. The number of clients, I would say, can be decided by the test sponsor, taking whatever works best. A load balancer or web server can always be tuned to enforce some limit on concurrency. I don't think that a scale rule like in TPC C, where it says that only so many transactions per minute are allowed per warehouse is needed here. The effect of this is that when reporting a higher throughput, one has to automatically have a bigger database.

There is nothing to prevent these improvements from being put into a subsequent version of LUBM.

Building something that shows RDF at its best is a slightly different proposition. For this, we cannot be limited to the SPARQL recommendation and must allow custom application code and language extensions. Examples would be scripting similar to SQL stored procedures and extensions such as we have made for sub-queries and aggregation, explained a couple of posts back.

Maybe the Billion Triples challenge produces some material that we can use for this. We need to go for spaces that are not easily reached with SQL, have distributed computing, federation, discovery, demand driven import of data and such like.

I'll write more about ways of making RDF shine in some future post.

There are two kinds of workloads: online and offline. Online is what must be performed in an interactive situation, without significant human perceptible delay, i.e. within 500 ms. Anything else is offline.

Because this is how any online system is designed, this should be reflected in the benchmark. Ideally we would make two benchmarks.

LUBM results with Virtuoso 6.0

Mon, 04 Feb 2008 09:58:03 GMT

We have now run the LUBM benchmark on Virtuoso v6, with the same configuration as discussed last Friday.

We had a database of 8000 universities, and we ran 8 clients on slices of 100, 1000 and 8000 universities — same data but different sizes of working set.

 100 universities: 35.3 qps
1000 universities: 26.3 qps
8000 universities: 13.1 qps

The 100 universities slice is about the same as with v5.0.5 (35.3 vs 33.1 qps).
The 8000 universities set is almost 3x better (13.1 vs. 4.8 qps).

This comes from the fact that the v6 database takes half of the space of the v5.0.5 one. Further, this is with 64-bit IDs for everything. If the 5.5 database were with 64-bit IDs, we'd have a difference of over 3x. This is worth something if it lets you get by with only 1 terabyte of RAM for the 100 billion triple application, instead of 3 TB.

In a few more days, we'll give the results for Virtuoso v6 Cluster.

Latest LUBM Benchmark results for Virtuoso

Fri, 01 Feb 2008 14:39:04 GMT

We have now taken a close look at the query side of the LUBM benchmark, as promised a couple of blog posts ago.

We load 8000 universities and run a query mix consisting of the 14 LUBM queries with different numbers of clients against different portions of the database.

When it is all in memory, we get 33 queries per second with 8 concurrent clients; when it is so I/O bound that 7.7 of 8 threads wait for disk, we get 5 qps. This was run in 8G RAM with 2 Xeon 5130.

We adapted some of the queries so that they do not run over the whole database. In terms of retrieving triples per second, this would be about 330000 for the rate of 33 qps, with 4 cores at 2GHz. This is a combination of random access and linear scans and bitmap merge intersections; lookups for non-found triples are not counted. The rate of random lookups alone based on known G, S, P, O, without any query logic, is about 250000 random lookups per core per second.

The article LUBM and Virtuoso gives the details.

In the process of going through the workload we made some cost model adjustments and optimized the bitmap intersection join. In this way we can quickly determine which subjects are, for example, professors holding a degree from a given university. So the benchmark served us well in that it provided an incentive to further optimize some things.

Now, what has been said about RDF benchmarking previously still holds. What does it mean to do so many LUBM queries per second? What does this say about the capacity to run an online site off RDF data? Or about information integration? Not very much. But then this was not the aim of the authors either.

So we still need to make a benchmark for online queries and search, and another for E-science and business intelligence. But we are getting there.

In the immediate future, we have the general availability of Virtuoso Open Source 5.0.5 early next week. This comes with a LUBM test driver and a test suite running against the LUBM qualification database.

After this we will give some numbers for the cluster edition with LUBM and TPC-H.

LUBM and Virtuoso 5.5

Fri, 01 Feb 2008 12:37:53 GMT

We have now taken a close look at the query side of the LUBM benchmark, as promised a couple of blog posts ago.

We load 8000 universities and run a query mix consisting of the 14 LUBM queries with different numbers of clients against different portions of the database.

When it is all in memory, we get 33 queries per second with 8 concurrent clients; when it is so I/O bound that 7.7 of 8 threads wait for disk, we get 5 qps. This was run in 8G RAM with 2 Xeon 5130.

The article LUBM and Virtuoso gives the details.

In the process of going through the workload, we made some cost model adjustments and optimized the bitmap intersection join. In this way we can quickly determine which subjects are, for example, professors holding a degree from a given university. So the benchmark served us well in that it provided an incentive to further optimize some things.

So we still need to make a benchmark for online queries and search, and another for E-science and business intelligence. But we are getting there.

After this we will give some numbers for the cluster edition with LUBM and TPC H.

SPARQL Extensions for Subqueries

Wed, 16 Jan 2008 15:11:00 GMT

Last time I said we had extended SPARQL for sub-queries. As a preview of the new functionality, let us look at a query from TPC H.

Below is the Virtuoso SPARQL version of Q2.

sparql
define sql:signal-void-variables 1
prefix tpcd: <http://www.openlinksw.com/schemas/tpcd#>
prefix oplsioc: <http://www.openlinksw.com/schemas/oplsioc#>
prefix sioc: <http://rdfs.org/sioc/ns#>
prefix foaf: <http://xmlns.com/foaf/0.1/>
select
  ?supp+>tpcd:acctbal,
  ?supp+>tpcd:name,
  ?supp+>tpcd:has_nation+>tpcd:name as ?nation_name,
  ?part+>tpcd:partkey,
  ?part+>tpcd:mfgr,
  ?supp+>tpcd:address,
  ?supp+>tpcd:phone,
  ?supp+>tpcd:comment
from <http://example.com/tpcd>
where {
  ?ps a tpcd:partsupp ; tpcd:has_supplier ?supp ; tpcd:has_part ?part .
  ?supp+>tpcd:has_nation+>tpcd:has_region tpcd:name 'EUROPE' .
  ?part tpcd:size 15 .
  ?ps tpcd:supplycost ?minsc .
  { select ?p min(?ps+>tpcd:supplycost) as ?minsc
    where {
        ?ps a tpcd:partsupp ; tpcd:has_part ?p ; tpcd:has_supplier ?ms .
        ?ms+>tpcd:has_nation+>tpcd:has_region tpcd:name 'EUROPE' .
      }
  }
    filter (?part+>tpcd:type like '%BRASS') }
order by
  desc (?supp+>tpcd:acctbal)
  ?supp+>tpcd:has_nation+>tpcd:name
  ?supp+>tpcd:name
  ?part+>tpcd:partkey ;

Note the pattern { ?ms+>tpcd:has_nation+>tpcd:has_region tpcd:name 'EUROPE' } which is a shorthand for { ?ms tpcd:has_nation ?t1 . ?t1 tpcd:has-region ?t2 . ?t2 tpcd:has_region ?t3 . ?t3 tpcd:name "EUROPE" }

Also note a sub-query is used for determining the lowest supply cost for a part.

The SQL text of the query can be found in the TPC H benchmark specification, reproduced below:

select s_acctbal, s_name, n_name,
        p_partkey, p_mfgr, s_address,
        s_phone, s_comment
from part, supplier, partsupp, nation, region
where
        p_partkey = ps_partkey
        and s_suppkey = ps_suppkey
        and p_size = 15
        and p_type like '%BRASS'
        and s_nationkey = n_nationkey
        and n_regionkey = r_regionkey
        and r_name = 'EUROPE'
        and ps_supplycost = (
                        select min(ps_supplycost)
                        from partsupp, supplier, nation, region
                        where
                                p_partkey = ps_partkey
                                and s_suppkey = ps_suppkey
                                and s_nationkey = n_nationkey
                                and n_regionkey = r_regionkey
                                and r_name = 'EUROPE')
order by
        s_acctbal desc, n_name, s_name, p_partkey;

For brevity we have omitted the declarations for mapping the TPC H schema to its RDF equivalent. The mapping is straightforward, with each column mapping to a predicate and each table to a class.

This is now part of the next Virtuoso Open Source cut, due around next week.

As of this writing we are going through the TPC H query by query and testing with mapping going to Virtuoso and Oracle databases.

Also we have been busy measuring Virtuoso 6. Even after switching from 32-bit to 64-bit IDs for IRIs and objects, the new databases are about half the size of the same Virtuoso 5.0.2 databases. This does not include any stream compression like gzip for disk pages. The load and query speeds are higher because of better working set. For all in memory, they are about even with 5.0.2. So now on an 8G box, we load 1067 million LUBM triples at 39.7 Kt/s instead of 29 Kt/s with 5.0.2. Right now we experimenting with clusters at Amazon EC2. We'll write about that in a bit.

Retrospective and Outlook for 2008

Tue, 18 Dec 2007 10:53:40 GMT

At this close of the year, I'll give a little recap of the past year in terms of Virtuoso development, and take a look at where we are headed for 2008.

A year ago, I was in the middle of redoing the Virtuoso database engine for better SMP performance. We redid the way traversal of index structures and cache buffers was serialized for SMP, and generally compared Virtuoso and Oracle engines function by function. We had just returned from the ISWC 2006 in Athens, Georgia, and the Virtuoso database was becoming a usable triple store.

Soon thereafter, we confirmed that all this worked when we put out the first cut of Dbpedia with Chris Bizer, et al, and were working with Alan Ruttenberg on what would become the Banff health care and life sciences demo.

The WWW 2007 conference in Banff, Canada, was a sort of kick-off for the Linking Open Data movement, which started as a community project under SWEO, the W3C interest group for Semantic Web Education and Outreach, and has gained a life of its own since.

Right after WWW 2007, the Virtuoso development effort split onto two tracks: one for enhancing the then new 5.0 release; and one for building a new generation of Virtuoso, notably featuring clustering and double storage density for RDF.

The first track produced constant improvements to the relational-to-RDF mapping functionality, SPARQL enhancements, and Redland-, Jena- and Sesame-compatible client libraries with Virtuoso as as a triple store. These things have been out with testers for a while and are all generally available as of this writing.

The second track started with adding key compression to the storage engine, specifically with regard to RDF, even though there are some gains in relational applications as well. With RDF, the space consumption drops to about half, all without recourse to any non-random access compatible compression like gzip. Since the start of August, we turned to clustering and are now code complete, pretty much with all the tricks one would expect, of course full function SQL and taking advantage of co-located joins and doing aggregation and generally all possible processing where the data is. I have covered details of this along the way in previous posts. The key point is that now the thing is written and works with test cases.

In late October, we were at the W3C workshop for mapping relational data to RDF. For us, this confirmed the importance of mapping and scalability in general. Ivan Herman proposed forming a W3C incubator group on benchmarking. Also a W3C incubator group of relational to RDF mapping is being formed.

Now, scalability has two sides. One is dealing with volume, and the other is dealing with complexity. Volume alone will not help if interesting queries cannot be formulated. Hence, we recently extended SPARQL with sub-queries so that we can now express at least any SQL workloads, which was previously not the case. It is sort of a contradiction in terms to say that SPARQL is the universal language for information integration while not being able to express, for example, the TPC-H queries. Well, we fixed this. A separate post will highlight how. The W3C process will eventually follow, as the necessity of these things is undeniable, on the unimpeachable authority of the whole SQL world. Anyway, for now, SPARQL as it is ought to become a recommendation and extensions can be addressed later.

For now, the only RDF benchmark that seems to be out there is the loading part of the LUBM. We did a couple of enhancements of our own for that just recently, but much bigger things are on the way. Also, the billion triples challenge is an interesting initiative in the area. We all recognize that loading any number of triples is a finite problem with known solutions. The challenge is running interesting queries on large volumes.

Our present emphasis is demonstrating both RDF data warehousing and RDF mapping with complex queries and large data. We start with the TPC-H benchmark and doing the queries both through mapping to SQL against any RDBMS â Oracle, DB2, Virtuoso or other â and by querying the physical RDF rendition of the data in Virtuoso. From there, we move to querying a collection of RDBMS hosting similar data.

Doing this with performance at the level of direct SQL in the case of mapping and not very much slower with physical triples is an important milestone on the way to a real world enterprise data web. Real life has harder and more unexpected issues than a benchmark, but at any rate doing the benchmark without breaking a sweat is a step on the way. We sent a paper to ESWC 2008 about that but it was rather incomplete. By the time of the VLDB submissions deadline in March we'll have more meat.

Another tack soon to start is a re-architecting of Zitgist around clustered Virtuoso. Aside matters of scale, we will make a number of qualitatively new things possible. Again, more will be released in the first quarter of 2008.

Beyond these short and mid-term goals we have the introduction of entirely dynamic and demand driven partitioning, a la Google Bigtable or Amazon Dynamo. Now, regular partitioning will do for a while yet but this is the future when we move the the vision of linked data everywhere.

In conclusion, this year we have built the basis and the next year is about deployment. The bulk of really new development is behind us and now we start applying. Also, the community will find adoption easier due to our recent support of the common RDF APIs.

Virtuoso LUBM Load Update

Thu, 06 Dec 2007 13:35:21 GMT

As part of the recent conversation on benchmarking RDF stores, we re-ran the LUBM 8000 load test (1067 million triples) with the current Virtuoso.

We did it on two different machines, one with 2 Xeon 5130 2Ghz and 8G RAM and one with 2 Xeon 5330 2GHZ and 16G RAM. Both had 6 x 7800 rpm SATA-2 drives. The load rate on the 16G configuration was 36.8 Ktriples per second. The load rate on the 8G configuration was 29.7 Ktriples per second. Both loads were made using 6 concurrent load streams. Some small changes to the numbers may be released later as a result of changing tuning.

The Virtuoso version was 5.0, in the update to be released on the week of Dec 10, 2007. This is an incremental release of Virtuoso 5.0 and has the same engine as the prior 5.0s, with some optimizations for RDF loading and diverse bug fixes, notably in RDF mapping of relational data. This release will be further described in a separate post.

The load does not include forward chaining but then Virtuoso supports sub-class and sub-property without materializing the entailed triples.

Most of the LUBM entailed triples represent sub-classes and sub-properties. The LUBM query and forward chaining side deserves a separate treatment but this is for another time.

Most recent posts on this blog refer to Virtuoso 6, which is presently under development. We will publish results with the 6.0 engine later. Also, further enhancements to triple store performance will take place on the Virtuoso 6 platform.

Storage News

Thu, 12 Jul 2007 14:16:40 GMT

I have been away from the world for a few weeks, concentrating on technology.

We have now implemented an entirely new storage layout. With RDF data, we have now successfully doubled the working set.

This means that the number of triples that will fit in memory is doubled for any configuration. For any database in the hundreds of millions of triples, this is very significant. For LUBM data, we go from 75b to 35b per triple with the default indices.

This is obtained without using gzip or some other stream compression. Thus no decompression is needed at read time. Random access speeds are within 5% of those of Virtuoso v5.0.1, but the space requirement is halved and you can still locate a random triple in cache in a few microseconds.

What is better still, when using 8-byte IDs for IRIs instead of 4-byte ones, the space consumption stays almost the same since unique values are stored only once per page.

When applying gzip to the new storage layout, we usually get 3x compression. This means that 99% of 8K pages fit in 3K after compression. This is no real surprise since an index is repetitive pretty much by definition, even if the repeated sections are now shorter than in v5.0.1.

Gzip applied to pages does nothing for the working set since a page must remain random accessible for fast search but will cut disk usage to between half and a third. We will make this an option later. There are other tricks to be done with compression, like using a separate dictionary for non key text columns in relational applications. This would improve the working set in TPC-C and TPC-D quite a bit so we may do this also while on the subject.

Right now we are writing the clustering support, revising all internal APIs to run with batches of rows instead of single rows. We will most likely release clustering and the new storage layout together, towards the end of summer, at least in internal deployments.

I will blog about results as and when they are obtained, over the next few weeks.

Comparison of Open Source Databases with TPC D Queries

Mon, 05 Feb 2007 11:45:32 GMT

Last time we talked about database engine and transactions. Now we have come to the realm of query processing in our revisiting of the DBMS side of Virtuoso.

Now the well established, respectable standard benchmark for the basics of query processing is TPC D with its derivatives H and R. So we have, for testing how different SQL optimizers manage the 22 queries, run a mini version of the D queries with a 1% scale database, some 30M of data, all in memory. This basically catches whether SQL implementations miss some of the expected tricks and how efficient in memory loop and hash joins and aggregation are.

When we get to our next stop, high volume I/O, we will run the same with D databases in the 10G ballpark.

The databases were tested on the same machine, with warm cache, taking the best run of 3. All had full statistics and were running with read committed isolation, where applicable. The data was generated using the procedures from the Virtuoso test suite. The Virtuoso version tested was 5.0, to be released shortly. The MySQL was 5.0.27, the PostgreSQL was 8.1.6.

Query	Query Times in Milliseconds
Query	Virtuoso	PostgreSQL	MySQL	MySQL with InnoDB
Q1	206	763	312	198
Q2	4	6	3	3
Q3	13	51	254	64
Q4	4	16	24	60
Q5	15	22	64	68
Q6	9	70	189	65
Q7	52	143	211	84
Q8	29	31	13	11
Q9	36	114	97	61
Q10	32	51	117	57
Q11	16	9	12	10
Q12	8	21	18	130
Q13	18	74	-	-
Q14	7	21	418	1425
Q15	14	43	389	122
Q16	16	22	18	25
Q17	1	54	26	10
Q18	82	120	-	-
Q19	19	8	2	17
Q20	7	15	66	52
Q21	34	86	524	278
Q22	4	323	3311	805
Total (msec)	626	2063	6068	3545

We lead by a fair margin but MySQL is hampered by obviously getting some execution plans wrong and not doing Q13 and Q18 at all, at least not under several tens of seconds; so we left these out of the table in the interest of having comparable totals.

As usual, we also ran the workload on Oracle 10g R2. Since Oracle does not like their numbers being published without explicit approval, we will just say that we are even with them within the parameters described above. Oracle has a more efficient decimal type so it wins where that is central, as on Q1. Also it seems to notice that the GROUP BYs of Q18 are produced in order of grouping columns, so it needs no intermediate table for storing the aggregates. If we addressed these matters, we'd lead by some 15% whereas now we are even. A faster decimal arithmetic implementation may be in the release after next.

In the next posts, we will look at IO and disk allocation, and also return to RDF and LUBM.

Virtuoso 5.0 Preview

Wed, 10 Jan 2007 15:08:43 GMT

As previously said, we have a Virtuoso with brand new engine multithreading. It is now complete and passes its regular test suite. This is the basis for Virtuoso 5.0, to be available as the open source and commercial cuts as before.

As one benchmark, we used the TPC-C test driver that has always been bundled with Virtuoso. We ran 100000 new orders worth of the TPC-C transaction mix first with one client and then with 4 clients, each client going to its own warehouse, so there was not much lock contention. We did this on a 4 core Intel, the working set in RAM. With the old one, 1 client took 1m43 and 4 clients took 3m47. With the new one, one client took 1m30 and 4 clients took 2m37. So, 400000 new orders in 2m37, for 152820 new orders per minute as opposed to 105720 per minute previously. Do not confuse with the official tpmC metric, that one involves a whole bunch of further rules.

TPC-C has activity spread over a few different tables. With tests dealing with fewer tables, improvements in parallelism are far greater.

Aside from better parallelism, we have other features. One of them is a change in the read committed isolation, so that we now return the previous committed state for uncommitted changed rows instead of waiting for the updating transaction to terminate. This is similar to what Oracle does for read committed. Also we now do log checkpoints without having to abort pending write transactions.

When we have faster inserts, we actually see the RDF bulk loader run slower. This is really backwards. The reason is that while one thread parses, other threads insert and if the inserting threads are done they go to wait on a semaphore and this whole business of context switching absolutely kills performance. With slower inserts, the parser keeps ahead so there is less context switching, hence better overall throughput. I still do not get it how the OS can spend between 1.5 and 6 microseconds, several thousand instructions, deciding what to do next when there are only 3-4 eligible threads and all the rest is background which goes with a few dozen slices per second. Solaris is a little better than Linux at this but not dramatically so. Mac OS X is way worse.

As said, we use Oracle 10G2 on the same platform (Linux FC5 64 bit) for sparring. It is really a very good piece of software. We have written the TPC C transactions in SQL/PL. What is surprising is that these procedures run amazingly slowly, even with a single client. Otherwise the Oracle engine is very fast. Well, as I recall, the official TPC C runs with Oracle use an OCI client and no stored procedures. Strange. While Virtuoso for example fills the initial TPC C state a little faster than Oracle, the procedures run 5-10 times slower with Oracle than with Virtuoso, all data in warm cache and a single client. While some parts of Oracle are really well optimized, all basic joins and aggregates etc, we are surprised at how they could have neglected such a central piece as the PL.

Also, we have looked at transaction semantics. Serializable is mostly serializable with Oracle but does not always keep a steady count. Also it does not prevent inserts into a space that has been found empty by a serializable transaction. True, it will not show these inserts to the serializable transaction, so in this it follows the rules. Also, to make a read really repeatable, it seems that the read has to be FOR UPDATE. Otherwise one can not implement a reliable resource transaction, like changing the balance of an account.

Anyway, the Virtuoso engine overhaul is now mostly complete. This is of course an open ended topic but the present batch is nearing completion. We have gone through as many as 3 implementations of hash joins, some things have yet to be finished there. Oracle has very good hash joins. The only way we could match that was to do it all in memory, dropping any persistent storage of the hash. This is of course OK if the hash is not very large and anyway hash joins go sour if the hash does not fit in working set.

As next topics, we have more RDF and the LUBM benchmark to finish. Also we should revisit TPC-D.

Databases are really quite complicated and extensive pieces of software. Much more so than the casual observer might think.

Ideas on RDF Store Benchmarking

Tue, 21 Nov 2006 14:09:21 GMT

This post presents some ideas and use cases for RDF store benchmarking.

Use Cases

Basic triple storage and retrieval. The LUBM benchmark captures many aspects of this.
Recursive rule application. The simpler cases of this are things like transitive closure.
Mapping of relational data to RDF. Since relational benchmarks are well established, as in the TPC benchmarks, the schemas and test data generation can come from there. The problem is that the D/H/R benchmarks consist of aggregates and grouping exclusively but SPARQL does not have these.

Benchmarking Triple Stores

An RDF benchmark suite should meet the following criteria:

Have a single scale factor.
Produce a single metric, queries per unit of time, for example. The metric should be concisely expressible, for example 10 qpsR at 100M, options 1, 2, 3. Due to the heterogeneous nature of the systems under test, the result's short form likely needs to specify the metric, scale and options included in the test.
Have optional parts, such as different degrees of inferencing and maybe language extensions such as full text, as this is a likely component of any social software.
Have a specification for a full disclosure report, TPC style, even though we can skip the auditing part in the interest of making it easy for vendors to publish results and be listed.
Have a subject domain where real data are readily available and which is broadly understood by the community. For example, SIOC data about on-line communities seems appropriate. Typical degree of connectedness, number of triples per person etc can be measured from real files .
Have a diverse enough workload. This should include initial bulk load of data, some adding of triples during the run and continuous query load.

The query load should illustrate the following types of operations:

Basic lookups, such as would be made for filling in a person's home page in a social networks app. List data of user plus names and emails of friends. Relatively short joins, unions, and optionals.
Graph operations like shortest path from individual to individual in a social network.
Selecting data with drill down, as in faceted browsing. For example, start with articles having tag t, see distinct tags of articles with tag t, select another tag t2 to see the distinct tags of articles with both t and t2 and so forth.
Retrieving all closely related nodes, as in composing a SIOC snapshot over a person's post in different communities, the recent activity report for a forum etc. These will be construct or describe queries. The coverage of describe is unclear, hence construct may be better.

If we take an application like LinkedIn as a model, we can get a reasonable estimate of the relative frequency of different queries. For the queries per second metric, we can define the mix similarly to TPC C. We count executions of the main query and divide by running time. Within this time, for every 10 executions of the main query there are varying numbers of executions of secondary queries, typically more complex ones.

Full Disclosure Report

The report contains basic TPC-like items such as:

Metric qps/scale/options
Software used, DBMS, RDF toolkit if separate
Hardware. Number, clock and type of CPUs per machine, number of machines in cluster, RAM per machine, disks per machine, manufacturer, price of hardware/software

These can go into a summary spreadsheet that is just like the TPC ones.

Additionally, the full report should include:

Configuration files for DBMS, web server, other components.
Parameters for test driver, i.e., number of clicks, how many concurrent clicks. The tester determines the degree of parallelism that gets the best throughput and should indicate this in the report. Making a graph of throughput as function of concurrent clients is a lot of work and maybe not necessary here.
Duration in real time. Since for any large database with a few G of working set the warm up time is easily 30 minutes, the warm up time should be mentioned but not included in the metric. The measured interval should not be less than 1h in duration and should reflect a "steady state," as defined in the TPC rules.
Source code of server side application logic. This can be inference rules, stored procedures, dynamic web pages or any other server side software-like thing that exists or is modified for the purpose of the test.
Specification of test driver. If there is a commonly used test driver, its type, parameters and version. If the test driver is custom, reference to its source code.
Database sizes. For a preallocated database of n G, how much was free after the initial load, how much after the test run? How many bytes per triple.
CPU/IO. This may not always be readily measurable but is interesting still. Maybe a realistic spec is listing the sum of CPU minutes across allÂ server machines and server processes. For IO, maybe the system totals from iostat before and after the full run, including load and warm-up. If the DBMS and RDF toolkits are separate, it is interesting to know the division of CPU time between them.

Test Drivers

OpenLink has a multithreaded C program that simulates n web users multiplexed over m threads. For example, 10000 users with 100 threads, each user with its own state, so that they carry out their respective usage patterns independently, getting served as soon as the server is available, still having no more than m requests going at any time. The usage pattern is something like go check the mail, browse the catalogue, add to shopping cart etc. This can be modified to browse a social network database and produce the desired query mix. This generates HTTP requests, hence would work against a SPARQL end point or any set of dynamic web pages.

The program produces a running report of the clicks per second rate and statistics at the end, listing the min/avg/max times per operation.

This can be packaged as a separate open source download once the test spec is agreed upon.

For generating test data, a modification of the LUBM generator is probably the most convenient choice.

Benchmarking Relational to RDF Mapping

This area is somewhat more complex than triple storage.

At least the following factors enter into the evaluation:Â

Degree of SPARQL compliance. For example, can one have a variable as predicate? Are there limits on optionals and unions?
Are the data being queried split over multiple RDBMS and joined between them?
Type of use case. Is this about navigational lookups or about statistics? OLTP or OLAP? It would be the former, as SPARQL does not really have aggregation. Still, many of the interesting queries are about comparing large data sets.

The rationale for mapping relational data to RDF is often data integration. Even in simple cases like the OpenLink Data Spaces applications, a single SPARQL query will often result in a union of queries over distinct relational schemas, each somewhat similar but different in its details.

A test for mapping should represent this aspect. Of course, translating a column into a predicate is easy and useful, specially when copying data. Still, the full power of mapping seems to involve a single query over disparate sources with disparate schemas.

A real world case is OpenLink's ongoing work for mapping WordPress, Mediawiki, phpBB, Drupal, and possibly other popular web applications into SIOC.

Using this as a benchmark might make sense because the source schemas are widely known, there is a lot of real world data in these systems, and the test driver might even be the same as with the above proposed triple store benchmark. The query mix might have to be somewhat tailored.

Another "enterprise style" scenario might be to take the TPC C and TPC D databases â after all both have products, customers and orders â and map them into a common ontology. Then there could be queries sometimes running on only one, sometimes joining both.

Considering the times and the audience, the WordPress/Mediawiki scenario might be culturally more interesting and more fun to demo.

The test has two aspects: Throughput and coverage. I think these should be measured separately.

The throughput can be measured with queries that are generally sensible, such as "get articles by an author that I know with tags t1 and t2."

Then there are various pathological queries that work specially poorly with mapping. For example, if the types of subjects are not given, if the predicate is known at run time only, if the graph is not given, we get a union of everything joined with another union of everything and many of the joins between the terms of the different unions are identically empty but the software may not know this.

In a real world case, I would simply forbid such queries. In the benchmarking case, these may be of some interest. If the mapping is clever enough, it may survive cases like "list all predicates and objects of everything called gizmo where the predicate is in the product ontology".

It may be good to divide the test into a set of straightforward mappings and special cases and measure them separately. The former will be queries that a reasonably written application would do for producing user reports.

More RDF scalability tests

Wed, 01 Nov 2006 19:26:40 GMT

We have lately been busy with RDF scalability. We work with the 8000 university LUBM data set, a little over a billion triples. We can load it in 23h 46m on a box with 8G RAM. With 16G we probably could get it in 16h.

The resulting database is 75G, 74 bytes per triple which is not bad. It will shrink a little more if explicitly compacted by merging adjacent partly filled pages. See Advances in Virtuoso RDF Triple Storage for an in-depth treatment of the subject.

The real question of RDF scalability is finding a way of having more than one CPU on the same index tree without them hitting the prohibitive penalty of waiting for a mutex. The sure solution is partitioning, would probably have to be by range of the whole key. but before we go to so much trouble, we'll look at dropping a couple of critical sections from index random access. Also some kernel parameters may be adjustable, like a spin count before calling the scheduler when trying to get an occupied mutex. Still we should not waste too much time on platform specifics. We'll see.

We just updated the Virtuoso Open Source cut. The latest RDF refinements are not in, so maybe the cut will have to be refreshed shortly.

We are also now applying the relational to RDF mapping discussed in Declarative SQL Schema to RDF Ontology Mapping to the ODS applications.

There is a form of the mapping in the VOS cut on the net but it is not quite ready yet. We must first finish testing it through mapping all the relational schemas of the ODS apps before we can really recommend it. This is another reason for a VOS update in the near future.

We will be looking at the query side of LUBM after the ISWC 2006 conference. So far, we find queries compile OK for many SIOC use cases with the cost model that there is now. A more systematic review of the cost model for SPARQL will come when we get to the queries.

We put some ideas about inferencing in the Advances in Triple Storage paper. The question is whether we should forward chain such things as class subsumption and subproperties. If we build these into the SQL engine used for running SPARQL, we probably can do these as unions at run time with good performance and better working set due to not storing trivial entailed triples. Some more thought and experimentation needs to go into this.

RDF Bulk Loading Revisited

Thu, 28 Sep 2006 11:39:52 GMT

We have made new benchmarks with loading the 47 million triples of the Wikipedia links data set. So far, our best result is 40 minutes with a dual core Xeon with 8G memory. This comes to about 18000 triples per second with between 1.2 and 2 CPU cores busy, slightly depending on configuration parameters. Our previous best result was with a dual 1.6GHz SPARC with 7700 triples per second on loading the 2M triple Wordnet data set.

These are memory based speeds. We have implemented an automatic background compaction for database tables and have tried the Wikipedia load with and without. The CPU cost of the compaction was about 10% with a slight gain in real time due to less IO.

But the real deal remains IO. With the compaction on, we got 91 bytes per triple, all included, i.e., two indices on the triples table, dictionaries from IRI IDs to URIs, etc. The compaction is rather simple — it just detects adjacent dirty pages about to be written to disk and sees if the set of contiguous dirty pages would fit on fewer pages than they now take. If so, it rewrites the pages and frees the ones left over. It does not touch clean pages. With some more logic it could also compact clean pages, provided the result did not have more dirty pages than the initial situation. With more aggressive compaction we will get about 75 bytes per triple. We will try this.

But the real gains will come from index compression with bitmaps. For the Wikipedia data set, this will cut one of the indices to about a third of its current size. This is also the index with the more random access, so the benefit is compounded in terms of working set. At that point we will be looking at about 50 bytes per triple. We will see next week how this works with the LUBM RDF benchmark.