Details
OpenLink Software
Burlington, United States
Subscribe
Post Categories
Recent Articles
Community Member Blogs
Display Settings
Translate
|
Showing posts in all categories Refresh
Updated hardware improves LUBM 8000 load rate in Virtuoso 6
[
Orri Erling
]
We repeated the earlier LUBM 8000 experiment on a newer machine, with 2 x Xeon 5520 and 72G 1333MHz memory, and once again with the 2 machines as a networked cluster. Otherwise the settings were the same.
The load rate is now 160,739 triples-per-second.
|
|
Virtuoso 6 (previous run) |
|
Virtuoso 6 (new run) |
|
Virtuoso 6 (newest run) |
| blades |
|
1 |
|
1 |
|
2 |
| processors |
|
2 x Xeon 5410 |
|
2 x Xeon 5520 |
|
2 x Xeon 5520 + 2 x Xeon 5410 with 1x1GigE interconnect |
| memory |
|
16G 667 MHz |
|
72G 1333 MHz |
|
72G 1333 MHz + 16G 667 MHz respectively |
reported load rate triples-per-second |
|
110,532 |
|
160,739 |
|
214,188 |
Again, if others talk about loading LUBM, so must we. Otherwise, this metric is rather uninteresting.
|
08/14/2009 19:01 GMT
|
Modified:
08/15/2009 15:27 GMT
|
Updated hardware improves LUBM 8000 load rate in Virtuoso 6
[
Virtuso Data Space Bot
]
We repeated the earlier LUBM 8000 experiment on a newer machine, with 2 x Xeon 5520 and 72G 1333MHz memory, and once again with the 2 machines as a networked cluster. Otherwise the settings were the same.
The load rate is now 160,739 triples-per-second.
|
|
Virtuoso 6 (previous run) |
|
Virtuoso 6 (new run) |
|
Virtuoso 6 (newest run) |
| blades |
|
1 |
|
1 |
|
2 |
| processors |
|
2 x Xeon 5410 |
|
2 x Xeon 5520 |
|
2 x Xeon 5520 + 2 x Xeon 5410 with 1x1GigE interconnect |
| memory |
|
16G 667 MHz |
|
72G 1333 MHz |
|
72G 1333 MHz + 16G 667 MHz respectively |
reported load rate triples-per-second |
|
110,532 |
|
160,739 |
|
214,188 |
Again, if others talk about loading LUBM, so must we. Otherwise, this metric is rather uninteresting.
|
08/14/2009 19:01 GMT
|
Modified:
08/15/2009 15:27 GMT
|
Single Virtuoso host loads 110,500 triples-per-second on LUBM 8000
[
Orri Erling
]
LUBM load speed still seems to be a metric that is quoted in comparisons of RDF stores. Consequently, we too measured the load time of LUBM 8000, 1,068-million triples, on the newest Virtuoso.
The real time for the load was 161m 3s. The rate was 110,532 triples-per-second. The hardware was one machine with 2 x Xeon 5410 (quad core, 2.33 GHz) and 16G 667 MHz RAM. The software was Virtuoso 6 Cluster, configured into 8 partitions (processes) — one partition per CPU core. Each partition had its database striped over 6 disks total; the 6 disks on the system were shared between the 8 database processes.
The load was done on 8 streams, one per server process. At the beginning of the load, the CPU usage was 740% with no disk; at the end, it was around 700% with 25% disk wait. 100% counts here for one CPU core or one disk being constantly busy.
The RDF store was configured with the default two indices over quads, these being GSPO and OGPS. Text indexing of literals was not enabled. No materialization of entailed triples was made.
We think that LUBM loading is not a realistic benchmark for the world but since other people publish such numbers, so do we.
|
06/29/2009 12:12 GMT
|
Modified:
08/15/2009 16:06 GMT
|
Single Virtuoso host loads 110,500 triples-per-second on LUBM 8000
[
Virtuso Data Space Bot
]
LUBM load speed still seems to be a metric that is quoted in comparisons of RDF stores. Consequently, we too measured the load time of LUBM 8000, 1,068-million triples, on the newest Virtuoso.
The real time for the load was 161m 3s. The rate was 110,532 triples-per-second. The hardware was one machine with 2 x Xeon 5410 (quad core, 2.33 GHz) and 16G 667 MHz RAM. The software was Virtuoso 6 Cluster, configured into 8 partitions (processes) — one partition per CPU core. Each partition had its database striped over 6 disks total; the 6 disks on the system were shared between the 8 database processes.
The load was done on 8 streams, one per server process. At the beginning of the load, the CPU usage was 740% with no disk; at the end, it was around 700% with 25% disk wait. 100% counts here for one CPU core or one disk being constantly busy.
The RDF store was configured with the default two indices over quads, these being GSPO and OGPS. Text indexing of literals was not enabled. No materialization of entailed triples was made.
We think that LUBM loading is not a realistic benchmark for the world but since other people publish such numbers, so do we.
|
06/29/2009 12:12 GMT
|
Modified:
08/15/2009 16:06 GMT
|
Virtuoso RDF: A Getting Started Guide for the Developer
[
Orri Erling
]
It is a long standing promise of mine to dispel the false impression that using Virtuoso to work with RDF is complicated.
The purpose of this presentation is to show a programmer how to put RDF into Virtuoso and how to query it. This is done programmatically, with no confusing user interfaces.
You should have a Virtuoso Open Source tree built and installed. We will look at the LUBM benchmark demo that comes with the package. All you need is a Unix shell. Running the shell under emacs (m-x shell) is the best. But the open source isql utility should have command line editing also. The emacs shell is however convenient for cutting and pasting things between shell and files.
To get started, cd into binsrc/tests/lubm.
To verify that this works, you can do
./test_server.sh virtuoso-t
This will test the server with the LUBM queries. This should report 45 tests passed. After this we will do the tests step-by-step.
Loading the Data
The file lubm-load.sql contains the commands for loading the LUBM single university qualification database.
The data files themselves are in lubm_8000, 15 files in RDFXML.
There is also a little ontology called inf.nt. This declares the subclass and subproperty relations used in the benchmark.
So now let's go through this procedure.
Start the server:
$ virtuoso-t -f &
This starts the server in foreground mode, and puts it in the background of the shell.
Now we connect to it with the isql utility.
$ isql 1111 dba dba
This gives a SQL> prompt. The default username and password are both dba.
When a command is SQL, it is entered directly. If it is SPARQL, it is prefixed with the keyword sparql. This is how all the SQL clients work. Any SQL client, such as any ODBC or JDBC application, can use SPARQL if the SQL string starts with this keyword.
The lubm-load.sql file is quite self-explanatory. It begins with defining an SQL procedure that calls the RDF/XML load function, DB..RDF_LOAD_RDFXML, for each file in a directory.
Next it calls this function for the lubm_8000 directory under the server's working directory.
sparql
CLEAR GRAPH <lubm>;
sparql
CLEAR GRAPH <inf>;
load_lubm ( server_root() || '/lubm_8000/' );
Then it verifies that the right number of triples is found in the <lubm> graph.
sparql
SELECT COUNT(*)
FROM <lubm>
WHERE { ?x ?y ?z } ;
The echo commands below this are interpreted by the isql utility, and produce output to show whether the test was passed. They can be ignored for now.
Then it adds some implied subOrganizationOf triples. This is part of setting up the LUBM test database.
sparql
PREFIX ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#>
INSERT
INTO GRAPH <lubm>
{ ?x ub:subOrganizationOf ?z }
FROM <lubm>
WHERE { ?x ub:subOrganizationOf ?y .
?y ub:subOrganizationOf ?z .
};
Then it loads the ontology file, inf.nt, using the Turtle load function, DB.DBA.TTLP. The arguments of the function are the text to load, the default namespace prefix, and the URI of the target graph.
DB.DBA.TTLP ( file_to_string ( 'inf.nt' ),
'http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl',
'inf'
) ;
sparql
SELECT COUNT(*)
FROM <inf>
WHERE { ?x ?y ?z } ;
Then we declare that the triples in the <inf> graph can be used for inference at run time. To enable this, a SPARQL query will declare that it uses the 'inft' rule set. Otherwise this has no effect.
rdfs_rule_set ('inft', 'inf');
This is just a log checkpoint to finalize the work and truncate the transaction log. The server would also eventually do this in its own time.
checkpoint;
Now we are ready for querying.
Querying the Data
The queries are given in 3 different versions: The first file, lubm.sql, has the queries with most inference open coded as UNIONs. The second file, lubm-inf.sql, has the inference performed at run time using the ontology information in the <inf> graph we just loaded. The last, lubm-phys.sql, relies on having the entailed triples physically present in the <lubm> graph. These entailed triples are inserted by the SPARUL commands in the lubm-cp.sql file.
If you wish to run all the commands in a SQL file, you can type load <filename>; (e.g., load lubm-cp.sql;) at the SQL> prompt. If you wish to try individual statements, you can paste them to the command line.
For example:
SQL> sparql
PREFIX ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#>
SELECT *
FROM <lubm>
WHERE { ?x a ub:Publication .
?x ub:publicationAuthor <http://www.Department0.University0.edu/AssistantProfessor0>
};
VARCHAR
_______________________________________________________________________
http://www.Department0.University0.edu/AssistantProfessor0/Publication0
http://www.Department0.University0.edu/AssistantProfessor0/Publication1
http://www.Department0.University0.edu/AssistantProfessor0/Publication2
http://www.Department0.University0.edu/AssistantProfessor0/Publication3
http://www.Department0.University0.edu/AssistantProfessor0/Publication4
http://www.Department0.University0.edu/AssistantProfessor0/Publication5
6 Rows. -- 4 msec.
To stop the server, simply type shutdown; at the SQL> prompt.
If you wish to use a SPARQL protocol end point, just enable the HTTP listener. This is done by adding a stanza like —
[HTTPServer]
ServerPort = 8421
ServerRoot = .
ServerThreads = 2
— to the end of the virtuoso.ini file in the lubm directory. Then shutdown and restart (type shutdown; at the SQL> prompt and then virtuoso-t -f & at the shell prompt).
Now you can connect to the end point with a web browser. The URL is http://localhost:8421/sparql. Without parameters, this will show a human readable form. With parameters, this will execute SPARQL.
We have shown how to load and query RDF with Virtuoso using the most basic SQL tools. Next you can access RDF from, for example, PHP, using the PHP ODBC interface.
To see how to use Jena or Sesame with Virtuoso, look at Native RDF Storage Providers. To see how RDF data types are supported, see Extension datatype for RDF
To work with large volumes of data, you must add memory to the configuration file and use the row-autocommit mode, i.e., do log_enable (2); before the load command. Otherwise Virtuoso will do the entire load as a single transaction, and will run out of rollback space. See documentation for more.
|
12/17/2008 12:31 GMT
|
Modified:
12/17/2008 12:41 GMT
|
Virtuoso RDF: A Getting Started Guide for the Developer
[
Virtuso Data Space Bot
]
It is a long standing promise of mine to dispel the false impression that using Virtuoso to work with RDF is complicated.
The purpose of this presentation is to show a programmer how to put RDF into Virtuoso and how to query it. This is done programmatically, with no confusing user interfaces.
You should have a Virtuoso Open Source tree built and installed. We will look at the LUBM benchmark demo that comes with the package. All you need is a Unix shell. Running the shell under emacs (m-x shell) is the best. But the open source isql utility should have command line editing also. The emacs shell is however convenient for cutting and pasting things between shell and files.
To get started, cd into binsrc/tests/lubm.
To verify that this works, you can do
./test_server.sh virtuoso-t
This will test the server with the LUBM queries. This should report 45 tests passed. After this we will do the tests step-by-step.
Loading the Data
The file lubm-load.sql contains the commands for loading the LUBM single university qualification database.
The data files themselves are in lubm_8000, 15 files in RDFXML.
There is also a little ontology called inf.nt. This declares the subclass and subproperty relations used in the benchmark.
So now let's go through this procedure.
Start the server:
$ virtuoso-t -f &
This starts the server in foreground mode, and puts it in the background of the shell.
Now we connect to it with the isql utility.
$ isql 1111 dba dba
This gives a SQL> prompt. The default username and password are both dba.
When a command is SQL, it is entered directly. If it is SPARQL, it is prefixed with the keyword sparql. This is how all the SQL clients work. Any SQL client, such as any ODBC or JDBC application, can use SPARQL if the SQL string starts with this keyword.
The lubm-load.sql file is quite self-explanatory. It begins with defining an SQL procedure that calls the RDF/XML load function, DB..RDF_LOAD_RDFXML, for each file in a directory.
Next it calls this function for the lubm_8000 directory under the server's working directory.
sparql
CLEAR GRAPH <lubm>;
sparql
CLEAR GRAPH <inf>;
load_lubm ( server_root() || '/lubm_8000/' );
Then it verifies that the right number of triples is found in the <lubm> graph.
sparql
SELECT COUNT(*)
FROM <lubm>
WHERE { ?x ?y ?z } ;
The echo commands below this are interpreted by the isql utility, and produce output to show whether the test was passed. They can be ignored for now.
Then it adds some implied subOrganizationOf triples. This is part of setting up the LUBM test database.
sparql
PREFIX ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#>
INSERT
INTO GRAPH <lubm>
{ ?x ub:subOrganizationOf ?z }
FROM <lubm>
WHERE { ?x ub:subOrganizationOf ?y .
?y ub:subOrganizationOf ?z .
};
Then it loads the ontology file, inf.nt, using the Turtle load function, DB.DBA.TTLP. The arguments of the function are the text to load, the default namespace prefix, and the URI of the target graph.
DB.DBA.TTLP ( file_to_string ( 'inf.nt' ),
'http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl',
'inf'
) ;
sparql
SELECT COUNT(*)
FROM <inf>
WHERE { ?x ?y ?z } ;
Then we declare that the triples in the <inf> graph can be used for inference at run time. To enable this, a SPARQL query will declare that it uses the 'inft' rule set. Otherwise this has no effect.
rdfs_rule_set ('inft', 'inf');
This is just a log checkpoint to finalize the work and truncate the transaction log. The server would also eventually do this in its own time.
checkpoint;
Now we are ready for querying.
Querying the Data
The queries are given in 3 different versions: The first file, lubm.sql, has the queries with most inference open coded as UNIONs. The second file, lubm-inf.sql, has the inference performed at run time using the ontology information in the <inf> graph we just loaded. The last, lubm-phys.sql, relies on having the entailed triples physically present in the <lubm> graph. These entailed triples are inserted by the SPARUL commands in the lubm-cp.sql file.
If you wish to run all the commands in a SQL file, you can type load <filename>; (e.g., load lubm-cp.sql;) at the SQL> prompt. If you wish to try individual statements, you can paste them to the command line.
For example:
SQL> sparql
PREFIX ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#>
SELECT *
FROM <lubm>
WHERE { ?x a ub:Publication .
?x ub:publicationAuthor <http://www.Department0.University0.edu/AssistantProfessor0>
};
VARCHAR
_______________________________________________________________________
http://www.Department0.University0.edu/AssistantProfessor0/Publication0
http://www.Department0.University0.edu/AssistantProfessor0/Publication1
http://www.Department0.University0.edu/AssistantProfessor0/Publication2
http://www.Department0.University0.edu/AssistantProfessor0/Publication3
http://www.Department0.University0.edu/AssistantProfessor0/Publication4
http://www.Department0.University0.edu/AssistantProfessor0/Publication5
6 Rows. -- 4 msec.
To stop the server, simply type shutdown; at the SQL> prompt.
If you wish to use a SPARQL protocol end point, just enable the HTTP listener. This is done by adding a stanza like —
[HTTPServer]
ServerPort = 8421
ServerRoot = .
ServerThreads = 2
— to the end of the virtuoso.ini file in the lubm directory. Then shutdown and restart (type shutdown; at the SQL> prompt and then virtuoso-t -f & at the shell prompt).
Now you can connect to the end point with a web browser. The URL is http://localhost:8421/sparql. Without parameters, this will show a human readable form. With parameters, this will execute SPARQL.
We have shown how to load and query RDF with Virtuoso using the most basic SQL tools. Next you can access RDF from, for example, PHP, using the PHP ODBC interface.
To see how to use Jena or Sesame with Virtuoso, look at Native RDF Storage Providers. To see how RDF data types are supported, see Extension datatype for RDF
To work with large volumes of data, you must add memory to the configuration file and use the row-autocommit mode, i.e., do log_enable (2); before the load command. Otherwise Virtuoso will do the entire load as a single transaction, and will run out of rollback space. See documentation for more.
|
12/17/2008 12:31 GMT
|
Modified:
12/17/2008 12:41 GMT
|
ISWC 2008: Some Questions
[
Orri Erling
]
Inference: Is it always forward chaining?
We got a number of questions about Virtuoso's inference support. It seems that we are the odd one out, as we do not take it for granted that inference ought to consist of materializing entailment.
Firstly, of course one can materialize all one wants with Virtuoso. The simplest way to do this is using SPARUL. With the recent transitivity extensions to SPARQL, it is also easy to materialize implications of transitivity with a single statement. Our point is that for trivial entailment such as subclass, sub-property, single transitive property, and owl:sameAs, we do not require materialization, as we can resolve these at run time also, with backward-chaining built into the engine.
For more complex situations, one needs to materialize the entailment. At the present time, we know how to generalize our transitive feature to run arbitrary backward-chaining rules, including recursive ones. We could have a sort of Datalog backward-chaining embedded in our SQL/SPARQL and could run this with good parallelism, as the transitive feature already works with clustering and partitioning without dying of message latency. Exactly when and how we do this will be seen. Even if users want entailment to be materialized, such a rule system could be used for producing the materialization at good speed.
We had a word with Ian Horrocks on the question. He noted that it is often naive on behalf of the community to tend to equate description of semantics with description of algorithm. The data need not always be blown up.
The advantage of not always materializing is that the working set stays better. Once the working set is no longer in memory, response times jump disproportionately. Also, if the data changes or is retracted or is unreliable, one can end up doing a lot of extra work with materialization. Consider the effect of one malicious sameAs statement. This can lead to a lot of effects that are hard to retract. On the other hand, if running in memory with static data such as the LUBM benchmark, the queries run some 20% faster if entailment subclasses and sub-properties are materialized rather than done at run time.
Genetic Algorithms for SPARQL?
Our compliments for the wildest idea of the conference go to Eyal Oren, Christophe Guéret, and Stefan Schlobach, et al, for their paper on using genetic algorithms for guessing how variables in a SPARQL query ought to be instantiated. Prisoners of our "conventional wisdom" as we are, this might never have occurred to us.
Schema Last?
It is interesting to see how the industry comes to the semantic web conferences talking about schema last while at the same time the traditional semantic web people stress enforcing schema constraints and making more predictably performing and database friendlier logics. So do the extremes converge.
There is a point to schema last. RDF is very good for getting a view of ad hoc or unknown data. One can just load and look at what there is. Also, additions of unforeseen optional properties or relations to the schema are easy and efficient. However, it seems that a really high traffic online application would always benefit from having some application specific data structures. Such could also save considerably in hardware.
It is not a sharp divide between RDF and relational application oriented representation. We have the capabilities in our RDB to RDF mapping. We just need to show this and have SPARUL and data loading
|
11/04/2008 15:54 GMT
|
Modified:
11/04/2008 14:36 GMT
|
ISWC 2008: Some Questions
[
Virtuso Data Space Bot
]
Inference: Is it always forward chaining?
We got a number of questions about Virtuoso's inference support. It seems that we are the odd one out, as we do not take it for granted that inference ought to consist of materializing entailment.
Firstly, of course one can materialize all one wants with Virtuoso. The simplest way to do this is using SPARUL. With the recent transitivity extensions to SPARQL, it is also easy to materialize implications of transitivity with a single statement. Our point is that for trivial entailment such as subclass, sub-property, single transitive property, and owl:sameAs, we do not require materialization, as we can resolve these at run time also, with backward-chaining built into the engine.
For more complex situations, one needs to materialize the entailment. At the present time, we know how to generalize our transitive feature to run arbitrary backward-chaining rules, including recursive ones. We could have a sort of Datalog backward-chaining embedded in our SQL/SPARQL and could run this with good parallelism, as the transitive feature already works with clustering and partitioning without dying of message latency. Exactly when and how we do this will be seen. Even if users want entailment to be materialized, such a rule system could be used for producing the materialization at good speed.
We had a word with Ian Horrocks on the question. He noted that it is often naive on behalf of the community to tend to equate description of semantics with description of algorithm. The data need not always be blown up.
The advantage of not always materializing is that the working set stays better. Once the working set is no longer in memory, response times jump disproportionately. Also, if the data changes or is retracted or is unreliable, one can end up doing a lot of extra work with materialization. Consider the effect of one malicious sameAs statement. This can lead to a lot of effects that are hard to retract. On the other hand, if running in memory with static data such as the LUBM benchmark, the queries run some 20% faster if entailment subclasses and sub-properties are materialized rather than done at run time.
Genetic Algorithms for SPARQL?
Our compliments for the wildest idea of the conference go to Eyal Oren, Christophe Guéret, and Stefan Schlobach, et al, for their paper on using genetic algorithms for guessing how variables in a SPARQL query ought to be instantiated. Prisoners of our "conventional wisdom" as we are, this might never have occurred to us.
Schema Last?
It is interesting to see how the industry comes to the semantic web conferences talking about schema last while at the same time the traditional semantic web people stress enforcing schema constraints and making more predictably performing and database friendlier logics. So do the extremes converge.
There is a point to schema last. RDF is very good for getting a view of ad hoc or unknown data. One can just load and look at what there is. Also, additions of unforeseen optional properties or relations to the schema are easy and efficient. However, it seems that a really high traffic online application would always benefit from having some application specific data structures. Such could also save considerably in hardware.
It is not a sharp divide between RDF and relational application oriented representation. We have the capabilities in our RDB to RDF mapping. We just need to show this and have SPARUL and data loading
|
11/04/2008 15:54 GMT
|
Modified:
11/04/2008 14:37 GMT
|
ISWC 2008: The Scalable Knowledge Systems Workshop
[
Orri Erling
]
Mike Dean of BBN Technologies opened the Scalable Knowledge Systems Workshop with an invited talk. He reminded us of the facts of nature as concern the cost of distributed computing and running out of space for the working set. Developers in the semantic web field deplorably often ignore these facts, or alternatively recognize them and admit that they are unbeatable, that one just can't join across partitions.
I gave a talk about the Virtuoso Cluster edition, wherein I repeated essentially the same ground facts as Mike and outlined how we (in spite of these) profit from distributed memory multiprocessing. To those not intimate with these questions, let me affirm that deriving benefit from threading in a symmetric multiprocessor box, let alone a cluster connected by a network, totally depends on having many relatively long running things going at a time and blocking as seldom as possible.
Further, Mike Dean talked about ASIO, the BBN suite of semantic web tools. His most challenging statement was about the storage engine, a network-database-inspired triple-store using memory-mapped files.
Will the CODASYL days come back, and will the linked list on disk be the way to store triples/quads? I would say that this will have, especially with a memory-mapped file, probably a better best-case as a B-tree but that this also will be less predictable with fragmentation. With Virtuoso, using a B-tree index, we see about 20-30% of CPU time spent on index lookup when running LUBM queries. With a disk-based memory-mapped linked-list storage, we would see some improvements in this while getting hit probably worse than now in the case of fragmentation. Plus compaction on the fly would not be nearly as easy and surely far less local, if there were pointers between pages. So it is my intuition that trees are a safer bet with varying workloads while linked lists can be faster in a query-dominated in-memory situation.
Chris Bizer presented the Berlin SPARQL Benchmark (BSBM), which has already been discussed here in some detail. He did acknowledge that the next round of the race must have a real steady-state rule. This just means that the benchmark must be run long enough for the system under test to reach a state where the cache is full and the performance remains indefinitely at the same level. Reaching steady state can take 20-30 minutes in some cases.
Regardless of steady state, BSBM has two generally valid conclusions:
- mapping relational to RDF, where possible, is faster than triple storage; and
- the equivalent relational solution can be some 10x faster than the pure triples representation.
Mike Dean asked whether BSBM was a case of a setup to have triple stores fail. Not necessarily, I would say; we should understand that one motivation of BSBM is testing mapping technologies. Therefore it must have a workload where mapping makes sense. Of course there are workloads where triples are unchallenged — take the Billion Triples Challenge data set for one.
Also, with BSBM, once should note that the query optimization time plays a fairly large role since most queries touch relatively little data. Also, even if the scale is large, the working set is not nearly the size of the database. This in fact penalizes mapping technologies against native SQL since the difference there is compiling the query, especially since parameters are not used. So, Chris, since we both like to map, let's make a benchmark that shows mapping closer to native SQL.
Bridging the 10x Gap?
When we run Virtuoso relational against Virtuoso triple store with the TPC-H workload, we see that the relational case is significantly faster. These are long queries, thus query optimization time is negligible; we are here comparing memory-based access times. Why is this? The answer is that a single index lookup gives multiple column values with almost no penalty for the extra column. Also, since the number of total joins is lower, the overhead coming from moving from join to next join is likewise lower. This is just a meter of count of executed instructions.
A column store joins in principle just as much as a triple store. However, since the BI workload often consists of scanning over large tables, the joins tend to be local, the needed lookup can often use the previous location as a starting point. A triple store can do the same if queries have high locality. We do this in some SQL situations and can try this with triples also. The RDF workload is typically more random in its access pattern, though. The other factor is the length of control path. A column store has a simpler control flow if it knows that the column will have exactly one value per row. With RDF, this is not a given. Also, the column store's row is identified by a single number and not a multipart key. These two factors give the column store running with a fixed schema some edge over the more generic RDF quad store.
There was some discussion on how much closer a triple store could come to a relational one. Some gains are undoubtedly possible. We will see. For the ideal row store workload, the RDBMS will continue to have some edge. Large online systems typically have a large part of the workload that is simple and repetitive. There is nothing to prevent one having special indices for supporting such workload, even while retaining the possibility of arbitrary triples elsewhere. Some degree of application-specific data structure does make sense. We just need to show how this is done. In this way, we have a continuum and not an either/or choice of triples vs. tables.
Scale, Where Next?
Concerning the future direction of the workshop, there were a few directions suggested. One of the more interesting ones was Mike Dean's suggestion about dealing with a large volume of same-as assertions, specifically a volume where materializing all the entailed triples was no longer practical. Of course, there is the question of scale. This time, we were the only ones focusing on a parallel database with no restrictions on joining.
|
11/03/2008 13:16 GMT
|
Modified:
11/03/2008 12:33 GMT
|
ISWC 2008: The Scalable Knowledge Systems Workshop
[
Virtuso Data Space Bot
]
Mike Dean of BBN Technologies opened the Scalable Knowledge Systems Workshop with an invited talk. He reminded us of the facts of nature as concern the cost of distributed computing and running out of space for the working set. Developers in the semantic web field deplorably often ignore these facts, or alternatively recognize them and admit that they are unbeatable, that one just can't join across partitions.
I gave a talk about the Virtuoso Cluster edition, wherein I repeated essentially the same ground facts as Mike and outlined how we (in spite of these) profit from distributed memory multiprocessing. To those not intimate with these questions, let me affirm that deriving benefit from threading in a symmetric multiprocessor box, let alone a cluster connected by a network, totally depends on having many relatively long running things going at a time and blocking as seldom as possible.
Further, Mike Dean talked about ASIO, the BBN suite of semantic web tools. His most challenging statement was about the storage engine, a network-database-inspired triple-store using memory-mapped files.
Will the CODASYL days come back, and will the linked list on disk be the way to store triples/quads? I would say that this will have, especially with a memory-mapped file, probably a better best-case as a B-tree but that this also will be less predictable with fragmentation. With Virtuoso, using a B-tree index, we see about 20-30% of CPU time spent on index lookup when running LUBM queries. With a disk-based memory-mapped linked-list storage, we would see some improvements in this while getting hit probably worse than now in the case of fragmentation. Plus compaction on the fly would not be nearly as easy and surely far less local, if there were pointers between pages. So it is my intuition that trees are a safer bet with varying workloads while linked lists can be faster in a query-dominated in-memory situation.
Chris Bizer presented the Berlin SPARQL Benchmark (BSBM), which has already been discussed here in some detail. He did acknowledge that the next round of the race must have a real steady-state rule. This just means that the benchmark must be run long enough for the system under test to reach a state where the cache is full and the performance remains indefinitely at the same level. Reaching steady state can take 20-30 minutes in some cases.
Regardless of steady state, BSBM has two generally valid conclusions:
- mapping relational to RDF, where possible, is faster than triple storage; and
- the equivalent relational solution can be some 10x faster than the pure triples representation.
Mike Dean asked whether BSBM was a case of a setup to have triple stores fail. Not necessarily, I would say; we should understand that one motivation of BSBM is testing mapping technologies. Therefore it must have a workload where mapping makes sense. Of course there are workloads where triples are unchallenged — take the Billion Triples Challenge data set for one.
Also, with BSBM, once should note that the query optimization time plays a fairly large role since most queries touch relatively little data. Also, even if the scale is large, the working set is not nearly the size of the database. This in fact penalizes mapping technologies against native SQL since the difference there is compiling the query, especially since parameters are not used. So, Chris, since we both like to map, let's make a benchmark that shows mapping closer to native SQL.
Bridging the 10x Gap?
When we run Virtuoso relational against Virtuoso triple store with the TPC-H workload, we see that the relational case is significantly faster. These are long queries, thus query optimization time is negligible; we are here comparing memory-based access times. Why is this? The answer is that a single index lookup gives multiple column values with almost no penalty for the extra column. Also, since the number of total joins is lower, the overhead coming from moving from join to next join is likewise lower. This is just a meter of count of executed instructions.
A column store joins in principle just as much as a triple store. However, since the BI workload often consists of scanning over large tables, the joins tend to be local, the needed lookup can often use the previous location as a starting point. A triple store can do the same if queries have high locality. We do this in some SQL situations and can try this with triples also. The RDF workload is typically more random in its access pattern, though. The other factor is the length of control path. A column store has a simpler control flow if it knows that the column will have exactly one value per row. With RDF, this is not a given. Also, the column store's row is identified by a single number and not a multipart key. These two factors give the column store running with a fixed schema some edge over the more generic RDF quad store.
There was some discussion on how much closer a triple store could come to a relational one. Some gains are undoubtedly possible. We will see. For the ideal row store workload, the RDBMS will continue to have some edge. Large online systems typically have a large part of the workload that is simple and repetitive. There is nothing to prevent one having special indices for supporting such workload, even while retaining the possibility of arbitrary triples elsewhere. Some degree of application-specific data structure does make sense. We just need to show how this is done. In this way, we have a continuum and not an either/or choice of triples vs. tables.
Scale, Where Next?
Concerning the future direction of the workshop, there were a few directions suggested. One of the more interesting ones was Mike Dean's suggestion about dealing with a large volume of same-as assertions, specifically a volume where materializing all the entailed triples was no longer practical. Of course, there is the question of scale. This time, we were the only ones focusing on a parallel database with no restrictions on joining.
|
11/03/2008 13:16 GMT
|
Modified:
11/03/2008 12:33 GMT
|
|
|