Details

Orri Erling

Subscribe

Post Categories

Recent Articles

Display Settings

articles per page.
order.
Updated hardware improves LUBM 8000 load rate in Virtuoso 6

We repeated the earlier LUBM 8000 experiment on a newer machine, with 2 x Xeon 5520 and 72G 1333MHz memory, and once again with the 2 machines as a networked cluster. Otherwise the settings were the same.

The load rate is now 160,739 triples-per-second.

   Virtuoso 6
(previous run)
   Virtuoso 6
(new run)
   Virtuoso 6
(newest run)
blades    1    1    2
processors    2 x Xeon 5410    2 x Xeon 5520    2 x Xeon 5520
+
2 x Xeon 5410
with 1x1GigE
interconnect
memory    16G 667 MHz    72G 1333 MHz    72G 1333 MHz
+
16G 667 MHz
respectively
reported load rate
triples-per-second
   110,532    160,739    214,188

Again, if others talk about loading LUBM, so must we. Otherwise, this metric is rather uninteresting.

# PermaLink Comments [0]
08/14/2009 19:01 GMT Modified: 08/15/2009 15:27 GMT
Single Virtuoso host loads 110,500 triples-per-second on LUBM 8000

LUBM load speed still seems to be a metric that is quoted in comparisons of RDF stores. Consequently, we too measured the load time of LUBM 8000, 1,068-million triples, on the newest Virtuoso.

The real time for the load was 161m 3s. The rate was 110,532 triples-per-second. The hardware was one machine with 2 x Xeon 5410 (quad core, 2.33 GHz) and 16G 667 MHz RAM. The software was Virtuoso 6 Cluster, configured into 8 partitions (processes) — one partition per CPU core. Each partition had its database striped over 6 disks total; the 6 disks on the system were shared between the 8 database processes.

The load was done on 8 streams, one per server process. At the beginning of the load, the CPU usage was 740% with no disk; at the end, it was around 700% with 25% disk wait. 100% counts here for one CPU core or one disk being constantly busy.

The RDF store was configured with the default two indices over quads, these being GSPO and OGPS. Text indexing of literals was not enabled. No materialization of entailed triples was made.

We think that LUBM loading is not a realistic benchmark for the world but since other people publish such numbers, so do we.

# PermaLink Comments [0]
06/29/2009 12:12 GMT Modified: 08/15/2009 16:06 GMT
Virtuoso RDF: A Getting Started Guide for the Developer

It is a long standing promise of mine to dispel the false impression that using Virtuoso to work with RDF is complicated.

The purpose of this presentation is to show a programmer how to put RDF into Virtuoso and how to query it. This is done programmatically, with no confusing user interfaces.

You should have a Virtuoso Open Source tree built and installed. We will look at the LUBM benchmark demo that comes with the package. All you need is a Unix shell. Running the shell under emacs (m-x shell) is the best. But the open source isql utility should have command line editing also. The emacs shell is however convenient for cutting and pasting things between shell and files.

To get started, cd into binsrc/tests/lubm.

To verify that this works, you can do

./test_server.sh virtuoso-t

This will test the server with the LUBM queries. This should report 45 tests passed. After this we will do the tests step-by-step.

Loading the Data

The file lubm-load.sql contains the commands for loading the LUBM single university qualification database.

The data files themselves are in lubm_8000, 15 files in RDFXML.

There is also a little ontology called inf.nt. This declares the subclass and subproperty relations used in the benchmark.

So now let's go through this procedure.

Start the server:

$ virtuoso-t -f &

This starts the server in foreground mode, and puts it in the background of the shell.

Now we connect to it with the isql utility.

$ isql 1111 dba dba 

This gives a SQL> prompt. The default username and password are both dba.

When a command is SQL, it is entered directly. If it is SPARQL, it is prefixed with the keyword sparql. This is how all the SQL clients work. Any SQL client, such as any ODBC or JDBC application, can use SPARQL if the SQL string starts with this keyword.

The lubm-load.sql file is quite self-explanatory. It begins with defining an SQL procedure that calls the RDF/XML load function, DB..RDF_LOAD_RDFXML, for each file in a directory.

Next it calls this function for the lubm_8000 directory under the server's working directory.

sparql 
   CLEAR GRAPH <lubm>;

sparql 
   CLEAR GRAPH <inf>;

load_lubm ( server_root() || '/lubm_8000/' );

Then it verifies that the right number of triples is found in the <lubm> graph.

sparql 
   SELECT COUNT(*) 
     FROM <lubm> 
    WHERE { ?x ?y ?z } ;

The echo commands below this are interpreted by the isql utility, and produce output to show whether the test was passed. They can be ignored for now.

Then it adds some implied subOrganizationOf triples. This is part of setting up the LUBM test database.

sparql 
   PREFIX  ub:  <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#>
   INSERT 
      INTO GRAPH <lubm> 
      { ?x  ub:subOrganizationOf  ?z } 
   FROM <lubm> 
   WHERE { ?x  ub:subOrganizationOf  ?y  . 
           ?y  ub:subOrganizationOf  ?z  . 
         };

Then it loads the ontology file, inf.nt, using the Turtle load function, DB.DBA.TTLP. The arguments of the function are the text to load, the default namespace prefix, and the URI of the target graph.

DB.DBA.TTLP ( file_to_string ( 'inf.nt' ), 
              'http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl', 
              'inf' 
            ) ;
sparql 
   SELECT COUNT(*) 
     FROM <inf> 
    WHERE { ?x ?y ?z } ;

Then we declare that the triples in the <inf> graph can be used for inference at run time. To enable this, a SPARQL query will declare that it uses the 'inft' rule set. Otherwise this has no effect.

rdfs_rule_set ('inft', 'inf');

This is just a log checkpoint to finalize the work and truncate the transaction log. The server would also eventually do this in its own time.

checkpoint;

Now we are ready for querying.

Querying the Data

The queries are given in 3 different versions: The first file, lubm.sql, has the queries with most inference open coded as UNIONs. The second file, lubm-inf.sql, has the inference performed at run time using the ontology information in the <inf> graph we just loaded. The last, lubm-phys.sql, relies on having the entailed triples physically present in the <lubm> graph. These entailed triples are inserted by the SPARUL commands in the lubm-cp.sql file.

If you wish to run all the commands in a SQL file, you can type load <filename>; (e.g., load lubm-cp.sql;) at the SQL> prompt. If you wish to try individual statements, you can paste them to the command line.

For example:

SQL> sparql 
   PREFIX ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#>
   SELECT * 
     FROM <lubm>
    WHERE { ?x  a                     ub:Publication                                                . 
            ?x  ub:publicationAuthor  <http://www.Department0.University0.edu/AssistantProfessor0> 
          };

VARCHAR
_______________________________________________________________________

http://www.Department0.University0.edu/AssistantProfessor0/Publication0
http://www.Department0.University0.edu/AssistantProfessor0/Publication1
http://www.Department0.University0.edu/AssistantProfessor0/Publication2
http://www.Department0.University0.edu/AssistantProfessor0/Publication3
http://www.Department0.University0.edu/AssistantProfessor0/Publication4
http://www.Department0.University0.edu/AssistantProfessor0/Publication5

6 Rows. -- 4 msec.

To stop the server, simply type shutdown; at the SQL> prompt.

If you wish to use a SPARQL protocol end point, just enable the HTTP listener. This is done by adding a stanza like —

[HTTPServer]
ServerPort    = 8421
ServerRoot    = .
ServerThreads = 2

— to the end of the virtuoso.ini file in the lubm directory. Then shutdown and restart (type shutdown; at the SQL> prompt and then virtuoso-t -f & at the shell prompt).

Now you can connect to the end point with a web browser. The URL is http://localhost:8421/sparql. Without parameters, this will show a human readable form. With parameters, this will execute SPARQL.

We have shown how to load and query RDF with Virtuoso using the most basic SQL tools. Next you can access RDF from, for example, PHP, using the PHP ODBC interface.

To see how to use Jena or Sesame with Virtuoso, look at Native RDF Storage Providers. To see how RDF data types are supported, see Extension datatype for RDF

To work with large volumes of data, you must add memory to the configuration file and use the row-autocommit mode, i.e., do log_enable (2); before the load command. Otherwise Virtuoso will do the entire load as a single transaction, and will run out of rollback space. See documentation for more.

# PermaLink Comments [0]
12/17/2008 12:31 GMT Modified: 12/17/2008 12:41 GMT
ISWC 2008: Some Questions

Inference: Is it always forward chaining?

We got a number of questions about Virtuoso's inference support. It seems that we are the odd one out, as we do not take it for granted that inference ought to consist of materializing entailment.

Firstly, of course one can materialize all one wants with Virtuoso. The simplest way to do this is using SPARUL. With the recent transitivity extensions to SPARQL, it is also easy to materialize implications of transitivity with a single statement. Our point is that for trivial entailment such as subclass, sub-property, single transitive property, and owl:sameAs, we do not require materialization, as we can resolve these at run time also, with backward-chaining built into the engine.

For more complex situations, one needs to materialize the entailment. At the present time, we know how to generalize our transitive feature to run arbitrary backward-chaining rules, including recursive ones. We could have a sort of Datalog backward-chaining embedded in our SQL/SPARQL and could run this with good parallelism, as the transitive feature already works with clustering and partitioning without dying of message latency. Exactly when and how we do this will be seen. Even if users want entailment to be materialized, such a rule system could be used for producing the materialization at good speed.

We had a word with Ian Horrocks on the question. He noted that it is often naive on behalf of the community to tend to equate description of semantics with description of algorithm. The data need not always be blown up.

The advantage of not always materializing is that the working set stays better. Once the working set is no longer in memory, response times jump disproportionately. Also, if the data changes or is retracted or is unreliable, one can end up doing a lot of extra work with materialization. Consider the effect of one malicious sameAs statement. This can lead to a lot of effects that are hard to retract. On the other hand, if running in memory with static data such as the LUBM benchmark, the queries run some 20% faster if entailment subclasses and sub-properties are materialized rather than done at run time.

Genetic Algorithms for SPARQL?

Our compliments for the wildest idea of the conference go to Eyal Oren, Christophe Guéret, and Stefan Schlobach, et al, for their paper on using genetic algorithms for guessing how variables in a SPARQL query ought to be instantiated. Prisoners of our "conventional wisdom" as we are, this might never have occurred to us.

Schema Last?

It is interesting to see how the industry comes to the semantic web conferences talking about schema last while at the same time the traditional semantic web people stress enforcing schema constraints and making more predictably performing and database friendlier logics. So do the extremes converge.

There is a point to schema last. RDF is very good for getting a view of ad hoc or unknown data. One can just load and look at what there is. Also, additions of unforeseen optional properties or relations to the schema are easy and efficient. However, it seems that a really high traffic online application would always benefit from having some application specific data structures. Such could also save considerably in hardware.

It is not a sharp divide between RDF and relational application oriented representation. We have the capabilities in our RDB to RDF mapping. We just need to show this and have SPARUL and data loading

# PermaLink Comments [0]
11/04/2008 15:54 GMT Modified: 11/04/2008 14:36 GMT
ISWC 2008: The Scalable Knowledge Systems Workshop

Mike Dean of BBN Technologies opened the Scalable Knowledge Systems Workshop with an invited talk. He reminded us of the facts of nature as concern the cost of distributed computing and running out of space for the working set. Developers in the semantic web field deplorably often ignore these facts, or alternatively recognize them and admit that they are unbeatable, that one just can't join across partitions.

I gave a talk about the Virtuoso Cluster edition, wherein I repeated essentially the same ground facts as Mike and outlined how we (in spite of these) profit from distributed memory multiprocessing. To those not intimate with these questions, let me affirm that deriving benefit from threading in a symmetric multiprocessor box, let alone a cluster connected by a network, totally depends on having many relatively long running things going at a time and blocking as seldom as possible.

Further, Mike Dean talked about ASIO, the BBN suite of semantic web tools. His most challenging statement was about the storage engine, a network-database-inspired triple-store using memory-mapped files.

Will the CODASYL days come back, and will the linked list on disk be the way to store triples/quads? I would say that this will have, especially with a memory-mapped file, probably a better best-case as a B-tree but that this also will be less predictable with fragmentation. With Virtuoso, using a B-tree index, we see about 20-30% of CPU time spent on index lookup when running LUBM queries. With a disk-based memory-mapped linked-list storage, we would see some improvements in this while getting hit probably worse than now in the case of fragmentation. Plus compaction on the fly would not be nearly as easy and surely far less local, if there were pointers between pages. So it is my intuition that trees are a safer bet with varying workloads while linked lists can be faster in a query-dominated in-memory situation.

Chris Bizer presented the Berlin SPARQL Benchmark (BSBM), which has already been discussed here in some detail. He did acknowledge that the next round of the race must have a real steady-state rule. This just means that the benchmark must be run long enough for the system under test to reach a state where the cache is full and the performance remains indefinitely at the same level. Reaching steady state can take 20-30 minutes in some cases.

Regardless of steady state, BSBM has two generally valid conclusions:

  1. mapping relational to RDF, where possible, is faster than triple storage; and
  2. the equivalent relational solution can be some 10x faster than the pure triples representation.

Mike Dean asked whether BSBM was a case of a setup to have triple stores fail. Not necessarily, I would say; we should understand that one motivation of BSBM is testing mapping technologies. Therefore it must have a workload where mapping makes sense. Of course there are workloads where triples are unchallenged — take the Billion Triples Challenge data set for one.

Also, with BSBM, once should note that the query optimization time plays a fairly large role since most queries touch relatively little data. Also, even if the scale is large, the working set is not nearly the size of the database. This in fact penalizes mapping technologies against native SQL since the difference there is compiling the query, especially since parameters are not used. So, Chris, since we both like to map, let's make a benchmark that shows mapping closer to native SQL.

Bridging the 10x Gap?

When we run Virtuoso relational against Virtuoso triple store with the TPC-H workload, we see that the relational case is significantly faster. These are long queries, thus query optimization time is negligible; we are here comparing memory-based access times. Why is this? The answer is that a single index lookup gives multiple column values with almost no penalty for the extra column. Also, since the number of total joins is lower, the overhead coming from moving from join to next join is likewise lower. This is just a meter of count of executed instructions.

A column store joins in principle just as much as a triple store. However, since the BI workload often consists of scanning over large tables, the joins tend to be local, the needed lookup can often use the previous location as a starting point. A triple store can do the same if queries have high locality. We do this in some SQL situations and can try this with triples also. The RDF workload is typically more random in its access pattern, though. The other factor is the length of control path. A column store has a simpler control flow if it knows that the column will have exactly one value per row. With RDF, this is not a given. Also, the column store's row is identified by a single number and not a multipart key. These two factors give the column store running with a fixed schema some edge over the more generic RDF quad store.

There was some discussion on how much closer a triple store could come to a relational one. Some gains are undoubtedly possible. We will see. For the ideal row store workload, the RDBMS will continue to have some edge. Large online systems typically have a large part of the workload that is simple and repetitive. There is nothing to prevent one having special indices for supporting such workload, even while retaining the possibility of arbitrary triples elsewhere. Some degree of application-specific data structure does make sense. We just need to show how this is done. In this way, we have a continuum and not an either/or choice of triples vs. tables.

Scale, Where Next?

Concerning the future direction of the workshop, there were a few directions suggested. One of the more interesting ones was Mike Dean's suggestion about dealing with a large volume of same-as assertions, specifically a volume where materializing all the entailed triples was no longer practical. Of course, there is the question of scale. This time, we were the only ones focusing on a parallel database with no restrictions on joining.

# PermaLink Comments [0]
11/03/2008 13:16 GMT Modified: 11/03/2008 12:33 GMT
Virtuoso 5.0.6 Updates

I will here summarize the developments since the last Virtuoso 5 Open Source release.

On the RDF side, the bitmap intersection join has been improved quite a bit so that it is now almost always more than 2x more efficient than the equivalent nested loop join.

XML trees in the object position in RDF quads were in some cases incorrectly indexed, leading to failure to retrieve quads.  This is fixed and should problems occur in existing databases, they can be corrected by simply dropping and re-creating an index.

Also the cost model has been further tuned.  We have run the TPC-H queries with larger databases and have profiled it extensively.  There are improvements to locking, especially for concurrency of transactions with large shared lock sets, as is the case in the TPC-H queries.  The rules stipulate that these have to be run with repeatable read.  There are also optimizations for decimal floating point.

A sampling of TPC-H queries translated into SPARQL comes with the new demo database.  These show a live sample of the TPC-H schema translated into linked data, complete with SPARQL translations of the original queries.  Some work is still ongoing there but the relational to RDF mapping is mature enough for real business intelligence applications now.

On the closed source side, we have some adjustments to the virtual database.  When using Virtuoso as a front end to Oracle, using the TPC-H queries as a metric, the virtual database overhead is minimal.  Previously, we had some overhead because some queries were rewritten in a way that Oracle would not optimize as well as the original TPC-H text.  Specifically, turning an IN sub-query predicate into an equivalent EXISTS did not sit well with Oracle.

# PermaLink Comments [0]
03/25/2008 16:59 GMT Modified: 03/26/2008 11:59 GMT
What's Wrong With LUBM?

In the interest of participating in a community benchmark development process, I will here outline some desiderata and explain how we could improve on LUBM. I will also touch on the message such an effort ought to convey.

A blow-by-blow analysis of the performance of a complex system such as a DBMS is more than fits within the scope of human attention at one go. This is why this all must be abbreviated into a single metric. Only when thus abbreviated, can this information be used in context. The metric's practical value is relative to how well it predicts the performance of the system in some real task. This means a task not likely to be addressed by an alternative technology, unless the challenger clearly beats the incumbent.

A benchmark is promotional material, both well as for the technology being benchmarked as a whole. This is why the benchmark, whatever it does, should do something that the technology does well, surely better than any alternative technology. A case in point is that one ought not to take a pure relational workload and RDF-ize it, for then the relational variant is likely to come out on top.

In this regard LUBM is not so bad because its reliance on class and property hierarchies and the occasional transitivity or inference rule makes the workload typically RDF, a little ways apart from a purely relational implementation of the task.

RDF's claim to fame is linked data. This means giving things globally unique names and thereby making anything joinable with anything else, insofar there is agreement on the names. RDF is a key to a new class of problems, call it web scale database. Web scale here refers first to heterogeneity and multiplicity of independent sources and secondly to volume of data.

Now there are plenty of relational applications with very large volumes of data. On the non-relational side, there are even larger applications, such as web search engines. All these have a set schema and a specific workload they are meant to address. RDF versions of such are conceivable but hold no intrinsic advantage if considered in the specific niche alone.

The claim to fame of RDF is not to outperform these on their home turf but to open another turf altogether, allowing agile joining and composing of all these resources.

This is why a benchmark, i.e., an an advertisement for the RDF value proposition, should not just take a relational workload and RDF-ize it. The benchmark should carry some of the web in it.

If we just intend to measure how well an RDF store joins triples to other triples, LUBM is almost good enough. If it defined a query mix with different frequencies for short and long queries and a concurrent query metric, it would be pretty much there. Our adaptation of it is adequate for counting joins per second. But joins per second is not a value proposition.

So we have two questions:

  1. If we just take the RDF model and SPARQL, how do we make a benchmark that fills in what LUBM does not cover?

  2. How do we make a benchmark that displays RDF's strengths against a comparable relational solution? A priori, by going somewhere where SQL has trouble reaching.

The answers to the first are not very complex:

  • Add some optionals. Have different frequencies of occurrence for some properties.

  • Add different graphs. Make queries joining between graphs and drawing on different graphs. Querying against all graphs of the store is not a part of the language. Still this would be useful but leave it out for now.

  • Add some filters and arithmetic. Not much can be done there, though because expressions cannot be returned and there is no aggregation or grouping.

  • Split the workload into short and long queries. The short should be typical for online use and the long ones for analysis. Different execution frequencies for different queries is a must. Analysis is limited by lack of grouping, expressions or aggregation. Still, something can be contrived by looking for a pattern that does not exist or occurs extremely rarely. Producing result sets of millions of rows is not realistic.

  • Many of the LUBM queries return thousands of rows, even when scoped to a single university. This is not very realistic. No user interface displays that sort of quantity. Of course, the intermediate results can be large as you please but the output must be somehow ranked. SPARQL has order by and limit, so these will have to be used. TPC H for example has almost always a group by/order by combination and sometimes a result rows limit.

  • The degree of inference in LUBM is about right, mostly sub-classes and sub-properties, nothing complex. We certainly regard this as a database benchmark more than a knowledge representation or rule system one.

  • LUBM does an OK job of defining a scale factor. I think that a concurrent query metric can just be so many queries per time at a given scale. The number of clients, I would say, can be decided by the test sponsor, taking whatever works best. A load balancer or web server can always be tuned to enforce some limit on concurrency. I don't think that a scale rule like in TPC C, where it says that only so many transactions per minute are allowed per warehouse is needed here. The effect of this is that when reporting a higher throughput, one has to automatically have a bigger database.

There is nothing to prevent these improvements from being put into a subsequent version of LUBM.

Building something that shows RDF at its best is a slightly different proposition. For this, we cannot be limited to the SPARQL recommendation and must allow custom application code and language extensions. Examples would be scripting similar to SQL stored procedures and extensions such as we have made for sub-queries and aggregation, explained a couple of posts back.

Maybe the Billion Triples challenge produces some material that we can use for this. We need to go for spaces that are not easily reached with SQL, have distributed computing, federation, discovery, demand driven import of data and such like.

I'll write more about ways of making RDF shine in some future post.

There are two kinds of workloads: online and offline. Online is what must be performed in an interactive situation, without significant human perceptible delay, i.e. within 500 ms. Anything else is offline.

Because this is how any online system is designed, this should be reflected in the benchmark. Ideally we would make two benchmarks.

# PermaLink Comments [0]
02/05/2008 11:47 GMT Modified: 03/25/2008 14:43 GMT
LUBM results with Virtuoso 6.0

We have now run the LUBM benchmark on Virtuoso v6, with the same configuration as discussed last Friday.

We had a database of 8000 universities, and we ran 8 clients on slices of 100, 1000 and 8000 universities — same data but different sizes of working set.

 100 universities: 35.3 qps
1000 universities: 26.3 qps
8000 universities: 13.1 qps

The 100 universities slice is about the same as with v5.0.5 (35.3 vs 33.1 qps).
The 8000 universities set is almost 3x better (13.1 vs. 4.8 qps).

This comes from the fact that the v6 database takes half of the space of the v5.0.5 one.  Further, this is with 64-bit IDs for everything.  If the 5.5 database were with 64-bit IDs, we'd have a difference of over 3x.  This is worth something if it lets you get by with only 1 terabyte of RAM for the 100 billion  triple application, instead of 3 TB.

In a few more days, we'll give the results for Virtuoso v6 Cluster.

# PermaLink Comments [0]
02/04/2008 09:58 GMT Modified: 08/28/2008 12:06 GMT
Latest LUBM Benchmark results for Virtuoso

We have now taken a close look at the query side of the LUBM benchmark, as promised a couple of blog posts ago.

We load 8000 universities and run a query mix consisting of the 14 LUBM queries with different numbers of clients against different portions of the database.

When it is all in memory, we get 33 queries per second with 8 concurrent clients; when it is so I/O bound that 7.7 of 8 threads wait for disk, we get 5 qps. This was run in 8G RAM with 2 Xeon 5130.

We adapted some of the queries so that they do not run over the whole database. In terms of retrieving triples per second, this would be about 330000 for the rate of 33 qps, with 4 cores at 2GHz. This is a combination of random access and linear scans and bitmap merge intersections; lookups for non-found triples are not counted. The rate of random lookups alone based on known G, S, P, O, without any query logic, is about 250000 random lookups per core per second.

The article LUBM and Virtuoso gives the details.

In the process of going through the workload we made some cost model adjustments and optimized the bitmap intersection join. In this way we can quickly determine which subjects are, for example, professors holding a degree from a given university. So the benchmark served us well in that it provided an incentive to further optimize some things.

Now, what has been said about RDF benchmarking previously still holds. What does it mean to do so many LUBM queries per second? What does this say about the capacity to run an online site off RDF data? Or about information integration? Not very much. But then this was not the aim of the authors either.

So we still need to make a benchmark for online queries and search, and another for E-science and business intelligence. But we are getting there.

In the immediate future, we have the general availability of Virtuoso Open Source 5.0.5 early next week. This comes with a LUBM test driver and a test suite running against the LUBM qualification database.

After this we will give some numbers for the cluster edition with LUBM and TPC-H.

# PermaLink Comments [0]
02/01/2008 14:39 GMT Modified: 08/28/2008 12:06 GMT
LUBM and Virtuoso 5.5

We have now taken a close look at the query side of the LUBM benchmark, as promised a couple of blog posts ago.

We load 8000 universities and run a query mix consisting of the 14 LUBM queries with different numbers of clients against different portions of the database.

When it is all in memory, we get 33 queries per second with 8 concurrent clients; when it is so I/O bound that 7.7 of 8 threads wait for disk, we get 5 qps. This was run in 8G RAM with 2 Xeon 5130.

We adapted some of the queries so that they do not run over the whole database. In terms of retrieving triples per second, this would be about 330000 for the rate of 33 qps, with 4 cores at 2GHz. This is a combination of random access and linear scans and bitmap merge intersections; lookups for non-found triples are not counted. The rate of random lookups alone based on known G, S, P, O, without any query logic, is about 250000 random lookups per core per second.

The article LUBM and Virtuoso gives the details.

In the process of going through the workload, we made some cost model adjustments and optimized the bitmap intersection join. In this way we can quickly determine which subjects are, for example, professors holding a degree from a given university. So the benchmark served us well in that it provided an incentive to further optimize some things.

Now, what has been said about RDF benchmarking previously still holds. What does it mean to do so many LUBM queries per second? What does this say about the capacity to run an online site off RDF data? Or about information integration? Not very much. But then this was not the aim of the authors either.

So we still need to make a benchmark for online queries and search, and another for E-science and business intelligence. But we are getting there.

In the immediate future, we have the general availability of Virtuoso Open Source 5.0.5 early next week. This comes with a LUBM test driver and a test suite running against the LUBM qualification database.

After this we will give some numbers for the cluster edition with LUBM and TPC H.

# PermaLink Comments [0]
02/01/2008 12:37 GMT Modified: 03/25/2008 14:43 GMT
 <<     | 1 | 2 |     >>
Powered by OpenLink Virtuoso Universal Server
Running on Linux platform