Details

Orri Erling

Subscribe

Post Categories

Recent Articles

Display Settings

articles per page.
order.
Showing posts in all categories RefreshRefresh
Virtuoso - Are We Too Clever for Our Own Good? (updated)

"Physician, heal thyself," it is said. We profess to say what the messaging of the semantic web ought to be, but is our own perfect?

I will here engage in some critical introspection as well as amplify on some answers given to Virtuoso-related questions in recent times.

I use some conversations from the Vienna Linked Data Practitioners meeting as a starting point. These views are mine and are limited to the Virtuoso server. These do not apply to the ODS (OpenLink Data Spaces) applications line, OAT (OpenLink Ajax Toolkit), or ODE (OpenLink Data Explorer).

"It is not always clear what the main thrust is, we get the impression that you are spread too thin," said Sören Auer.

Well, personally, I am all for core competence. This is why I do not participate in all the online conversations and groups as much as I could, for example. Time and energy are critical resources and must be invested where they make a difference. In this case, the real core competence is running in the database race. This in itself, come to think of it, is a pretty broad concept.

This is why we put a lot of emphasis on Linked Data and the Data Web for now, as this is the emerging game. This is a deliberate choice, not an outside imperative or built-in limitation. More specifically, this means exposing any pre-existing relational data as linked data plus being the definitive RDF store.

We can do this because we own our database and SQL and data access middleware and have a history of connecting to any RDBMS out there.

The principal message we have been hearing from the RDF field is the call for scale of triple storage. This is even louder than the call for relational mapping. We believe that in time mapping will exceed triple storage as such, once we get some real production strength mappings deployed, enough to outperform RDF warehousing.

There are also RDF middleware things like RDF-ization and demand-driven web harvesting (i.e, the so-called Sponger). These are SPARQL options, thus accessed via standard interfaces. We have little desire to create our own languages or APIs, or to tell people how to program. This is why we recently introduced Sesame- and Jena-compatible APIs to our RDF store. From what we hear, these work. On the other hand, we do not hesitate to move beyond the standards when there is obvious value or necessity. This is why we brought SPARQL up to and beyond SQL expressivity. It is not a case of E3 (Embrace, Extend, Extinguish).

Now, this message could be better reflected in our material on the web. This blog is a rather informal step in this direction; more is to come. For now we concentrate on delivering.

The conventional communications wisdom is to split the message by target audience. For this, we should split the RDF, relational, and web services messages from each other. We believe that a challenger, like the semantic web technology stack, must have a compelling message to tell for it to be interesting. This is not a question of research prototypes. The new technology cannot lack something the installed technology takes for granted.

This is why we do not tend to show things like how to insert and query a few triples: No business out there will insert and query triples for the sake of triples. There must be a more compelling story — for example, turning the whole world into a database. This is why our examples start with things like turning the TPC-H database into RDF, queries and all. Anything less is not interesting. Why would an enterprise that has business intelligence and integration issues way more complex than the rather stereotypical TPC-H even look at a technology that pretends to be all for integration and all for expressivity of queries, yet cannot answer the first question of the entry exam?

The world out there is complex. But maybe we ought to make some simple tutorials? So, as a call to the people out there, tell us what a good tutorial would be. The question is more about figuring out what is out there and adapting these and making a sort of compatibility list. Jena and Sesame stuff ought to run as is. We could offer a webinar to all the data web luminaries showing how to promote the data web message with Virtuoso. After all, why not show it on the best platform?

"You are arrogant. When I read your papers or documentation, the impression I get is that you say you are smart and the reader is stupid."

We should answer in multiple parts.

For general collateral, like web sites and documentation:

The web site gives a confused product image. For the Virtuoso product, we should divide at the top into

  • Data web and RDF - Host linked data, expose relational assets as linked data;
  • Relational Database - Full function, high performance, open source, Federated/Virtual Relational DBMS, expose heterogeneous RDB assets through one point of contact for integration;
  • Web Services - access all the above over standard protocols, dynamic web pages, web hosting.

For each point, one simple statement. We all know what the above things mean?

Then we add a new point about scalability that impacts all the above, namely the Virtuoso version 6 Cluster, meaning that you can do all these things at 10 to 1000 times the scale. This means this much more data or in some cases this much more requests per second. This too is clear.

Far as I am concerned, hosting Java or .NET does not have to be on the front page. Also, we have no great interest in going against Apache when it comes to a web server only situation. The fact that we have a web listener is important for some things but our claim to fame does not rest on this.

Then for documentation and training materials: The documentation should be better. Specifically it should have more of a how-to dimension since nobody reads the whole thing anyhow. About online tutorials, the order of presentation should be different. They do not really reflect what is important at the present moment either.

Now for conference papers: Since taking the data web as a focus area, we have submitted some papers and had some rejected because these do not have enough references and do not explain what is obvious to ourselves.

I think that the communications failure in this case is that we want to talk about end to end solutions and the reviewers expect research. For us, the solution is interesting and exists only if there is an adequate functionality mix for addressing a specific use case. This is why we do not make a paper about query cost model alone because the cost model, while indispensable, is a thing that is taken for granted where we come from. So we mention RDF adaptations to cost model, as these are important to the whole but do not find these to be the justification for a whole paper. If we made papers on this basis, we would have to make five times as many. Maybe we ought to.

"Virtuoso is very big and very difficult"

One thing that is not obvious from the Virtuoso packaging is that the minimum installation is an executable under 10MB and a config file. Two files.

This gives you SQL and SPARQL out of the box. Adding ODBC and JDBC clients is as simple as it gets. After this, there is basic database functionality. Tuning is a matter of a few parameters that are explained on this blog and elsewhere. Also, the full scale installation is available as an Amazon EC2 image, so no installation required.

Now for the difficult side:

Use SQL and SPARQL; use stored procedures whenever there is server side business logic. For some time critical web pages, use VSP. Do not use VSPX. Otherwise, use whatever you are used to — PHP or Java or anything else. For web services, simple is best. Stick to basics. "The engineer is one who can invent a simple thing." Use SQL statements rather than admin UI.

Know that you can start a server with no database file and you get an initial database with nothing extra. The demo database, the way it is produced by installers is cluttered.

We should put this into a couple of use case oriented how-tos.

Also, we should create a network of "friendly local virtuoso geeks" for providing basic training and services so we do not have to explain these things all the time. To all you data-web-ers out there — please sign up and we will provide instructions, etc. Contact Yrjänä Rankka (ghard[at-sign]openlinksw.com), or go through the mailing lists; do not contact me directly.

"OK, we understand that you may be good at the large end of the spectrum but how do you reconcile this with the lightweight or embedded end, like the semantic desktop?"

Now, what is good for one end is usually good for the other. Namely, a database, no matter the scale, needs to have space efficient storage, fast index lookup, and correct query plans. Then there are things that occur only at the high-end, like clustering, but these are separate things. For embedding, the initial memory footprint needs to be small. With Virtuoso, this is accomplished by leaving out some 200 built-in tables and 100,000 lines of SQL procedures that are normally in by default, supporting things such as DAV and diverse other protocols. After all, if SPARQL is all one wants these are not needed.

If one really wants to do one's server logic (like web listener and thread dispatching) oneself, this is not impossible but requires some advice from us. On the other hand, if one wants to have logic for security close to the data, then using stored procedures is recommended; these execute right next to the data, and support inline SPARQL and SQL. Depending on the license status of the other code, some special licensing arrangements may apply.

We are talking about such things with different parties at present.

"How webby are you? What is webby?"

"Webby means distributed, heterogeneous, open; not monolithic consolidation of everything."

We are philosophically webby. We come from open standards; we are after all called OpenLink; our history consists of connecting things. We believe in choice — the user should be able to pick the best of breed for components and have them work together. We cannot and do not wish to force replacement of existing assets. Transforming data on the fly and connecting systems, leaving data where it originally resides, is the first preference. For the data web, the first preference is a federation of independent SPARQL end points. When there is harvesting, we prefer to do it on demand, as with our Sponger. With the immense amount of data out there we believe in finding what is relevant when it is relevant, preferably close at hand, leveraging things like social networks. With a data web, many things which are now siloized, such as marketplaces and social networks, will return to the open.

Google-style crawling of everything becomes less practical if one needs to run complex ad hoc queries against the mass of data. For these types of scenarios, if one needs to warehouse, the data cloud will offer solutions where one pays for database on demand. While we believe in loosely coupled federation where possible, we have serious work on the scalability side for the data center and the compute-on-demand cloud.

"How does OpenLink see the next five years unfolding?"

Personally, I think we have the basics for the birth of a new inflection in the knowledge economy. The URI is the unit of exchange; its value and competitive edge lie in the data it links you with. A name without context is worth little, but as a name gets more use, more information can be found through that name. This is anything from financial statistics, to legal precedents, to news reporting or government data. Right now, if the SEC just added one line of markup to the XBRL template, this would instantaneously make all SEC-mandated reporting into linked data via GRDDL.

The URI is a carrier of brand. An information brand gets traffic and references, and this can be monetized in diverse ways. The key word is context. Information overload is here to stay, and only better context offers the needed increase in productivity to stay ahead of the flood.

Semantic technologies on the whole can help with this. Why these should be semantic web or data web technologies as opposed to just semantic is the linked data value proposition. Even smart islands are still islands. Agility, scale, and scope, depend on the possibility of combining things. Therefore common terminologies and dereferenceability and discoverability are important. Without these, we are at best dealing with closed systems even if they were smart. The expert systems of the 1980s are a case in point.

Ever since the .com era, the URL has been a brand. Now it becomes a URI. Thus, entirely hiding the URI from the user experience is not always desirable. The URI is a sort of handle on the provenance and where more can be found; besides, people are already used to these.

With linked data, information value-add products become easy to build and deploy. They can be basically just canned SPARQL queries combining data in a useful and insightful manner. And where there is traffic there can be monetization, whether by advertizing, subscription, or other means. Such possibilities are a natural adjunct to the blogosphere. To publish analysis, one no longer needs to be a think tank or media company. We could call this scenario the birth of a meshup economy.

For OpenLink itself, this is our roadmap. The immediate future is about getting our high end offerings like clustered RDF storage generally available, both on the cloud and for private data centers. Ourselves, we will offer the whole Linked Open Data cloud as a database. The single feature to come in version 2 of this is fully automatic partitioning and repartitioning for on-demand scale; now, you have to choose how many partitions you have.

This makes some things possible that were hard thus far.

On the mapping front, we go for real-scale data integration scenarios where we can show that SPARQL can unify terms and concepts across databases, yet bring no added cost for complex queries. Enterprises can use their existing warehouses and have an added level of abstraction, the possibility of cross systems interlinking, the advantages of using the same taxonomies and ontologies across systems, and so forth.

Then there will be developments in the direction of smarter web harvesting on demand with the Virtuoso Sponger, and federation of heterogeneous SPARQL end points. The federation is not so unlike clustering, except the time scales are 2 orders of magnitude longer. The work on SPARQL end point statistics and data set description and discovery is a good development in the community.

Then there will be NLP integration, as exemplified by the Open Calais linked data wrapper and more.

Can we pull this off or is this being spread too thin? We know from experience that all this can be accomplished. Scale is already here; we show it with the billion triples set. Mapping is here; we showed it last in the Berlin Benchmark. We will also show some TPC-H results after we get a little quiet after the ISWC event. Then there is ongoing maintenance but with this we have shown a steady turnaround and quick time to fix for pretty much anything.

# PermaLink Comments [0]
10/26/2008 12:15 GMT Modified: 10/27/2008 12:07 GMT
ESWC 2008

Yrjänä Rankka and I attended ESWC2008 on behalf of OpenLink.

We were invited at the last minute to give a Linked Open Data talk at Paolo Bouquet's Identity and Reference workshop. We also had a demo of SPARQL BI (PPT); other formats coming soon), our business intelligence extensions to SPARQL as well as joining between relational data mapped to RDF and native RDF data. i was also speaking at the social networks panel chaired by Harry Halpin.

I have gathered a few impressions that I will share in the next few posts (1 - RDF Mapping, 2 - DARQ, 3 - voiD, 4 - Paradigmata). Caveat: This is not meant to be complete or impartial press coverage of the event but rather some quick comments on issues of personal/OpenLink interest. The fact that I do not mention something does not mean that it is unimportant.

The voiD Graph

Linked Open Data was well represented, with Chris Bizer, Tom Heath, ourselves and many others. The great advance for LOD this time around is voiD, the Vocabulary of Interlinked Datasets, a means to describe what in fact is inside the LOD cloud, how to join it with what and so forth. Big time important if there is to be a web of federatable data sources, feeding directly into what we have been saying for a while about SPARQL end-point self-description and discovery. There is reasonable hope of having something by the date of Linked Data Planet in a couple of weeks.

Federating

Bastian Quilitz gave a talk about his DARQ, a federated version of Jena's ARQ.

Something like DARQ's optimization statistics should make their way into the SPARQL protocol as well as the voiD data set description.

We really need federation but more on this in a separate post.

XSPARQL

Axel Polleres et al had a paper about XSPARQL, a merge of XQuery and SPARQL. While visiting DERI a couple of weeks back and again at the conference, we talked about OpenLink implementing the spec. It is evident that the engines must be in the same process and not communicate via the SPARQL protocol for this to be practical. We could do this. We'll have to see when.

Politically, using XQuery to give expressions and XML synthesis to SPARQL would be fitting. These things are needed anyhow, as surely as aggregation and sub-queries but the latter would not so readily come from XQuery. Some rapprochement between RDF and XML folks is desirable anyhow.

Panel: Will the Sem Web Rise to the Challenge of the Social Web?

The social web panel presented the question of whether the sem web was ready for prime time with data portability.

The main thrust was expressed in Harry Halpin's rousing closing words: "Men will fight in a battle and lose a battle for a cause they believe in. Even if the battle is lost, the cause may come back and prevail, this time changed and under a different name. Thus, there may well come to be something like our semantic web, but it may not be the one we have worked all these years to build if we do not rise to the occasion before us right now."

So, how to do this? Dan Brickley asked the audience how many supported, or were aware of, the latest Web 2.0 things, such as OAuth and OpenID. A few were. The general idea was that research (after all, this was a research event) should be more integrated and open to the world at large, not living at the "outdated pace" of a 3 year funding cycle. Stefan Decker of DERI acquiesced in principle. Of course there is impedance mismatch between specialization and interfacing with everything.

I said that triples and vocabularies existed, that OpenLink had ODS (OpenLink Data Spaces, Community LinkedData) for managing one's data-web presence, but that scale would be the next thing. Rather large scale even, with 100 gigatriples (Gtriples) reached before one even noticed. It takes a lot of PCs to host this, maybe $400K worth at today's prices, without replication. Count 16G ram and a few cores per Gtriple so that one is not waiting for disk all the time.

The tricks that Web 2.0 silos do with app-specific data structures and app-specific partitioning do not really work for RDF without compromising the whole point of smooth schema evolution and tolerance of ragged data.

So, simple vocabularies, minimal inference, minimal blank nodes. Besides, note that the inference will have to be done at run time, not forward-chained at load time, if only because users will not agree on what sameAs and other declarations they want for their queries. Not to mention spam or malicious sameAs declarations!

As always, there was the question of business models for the open data web and for semantic technologies in general. As we see it, information overload is the factor driving the demand. Better contextuality will justify semantic technologies. Due to the large volumes and complex processing, a data-as-service model will arise. The data may be open, but its query infrastructure, cleaning, and keeping up-to-date, can be monetized as services.

Identity and Reference

For the identity and reference workshop, the ultimate question is metaphysical and has no single universal answer, even though people, ever since the dawn of time and earlier, have occupied themselves with the issue. Consequently, I started with the Genesis quote where Adam called things by nominibus suis, off-hand implying that things would have some intrinsic ontologically-due names. This would be among the older references to the question, at least in widely known sources.

For present purposes, the consensus seemed to be that what would be considered the same as something else depended entirely on the application. What was similar enough to warrant a sameAs for cooking purposes might not warrant a sameAs for chemistry. In fact, complete and exact sameness for URIs would be very rare. So, instead of making generic weak similarity assertions like similarTo or seeAlso, one would choose a set of strong sameAs assertions and have these in effect for query answering if they were appropriate to the granularity demanded by the application.

Therefore sameAs is our permanent companion, and there will in time be malicious and spam sameAs. So, nothing much should be materialized on the basis of sameAs assertions in an open world. For an app-specific warehouse, sameAs can be resolved at load time.

There was naturally some apparent tension between the Occam camp of entity name services and the LOD camp. I would say that the issue is more a perceived polarity than a real one. People will, inevitably, continue giving things names regardless of any centralized authority. Just look at natural language. But having a dictionary that is commonly accepted for established domains of discourse is immensely helpful.

CYC and NLP

The semantic search workshop was interesting, especially CYC's presentation. CYC is, as it were, the grand old man of knowledge representation. Over the long term, I would have support of the CYC inference language inside a database query processor. This would mostly be for repurposing the huge knowledge base for helping in search type queries. If it is for transactions or financial reporting, then queries will be SQL and make little or no use of any sort of inference. If it is for summarization or finding things, the opposite holds. For scaling, the issue is just making correct cardinality guesses for query planning, which is harder when inference is involved. We'll see.

I will also have a closer look at natural language one of these days, quite inevitably, since Zitgist (for example) is into entity disambiguation.

Scale

Garlic gave a talk about their Data Patrol and QDOS. We agree that storing the data for these as triples instead of 1000 or so constantly changing relational tables could well make the difference between next-to-unmanageable and efficiently adaptive.

Garlic probably has the largest triple collection in constant online use to date. We will soon join them with our hosting of the whole LOD cloud and Sindice/Zitgist as triples.

Conclusions

There is a mood to deliver applications. Consequently, scale remains a central, even the principal topic. So for now we make bigger centrally-managed databases. At the next turn around the corner we will have to turn to federation. The point here is that a planetary-scale, centrally-managed, online system can be made when the workload is uniform and anticipatable, but if it is free-form queries and complex analysis, we have a problem. So we move in the direction of federating and charging based on usage whenever the workload is more complex than making simple lookups now and then.

For the Virtuoso roadmap, this changes little. Next we make data sets available on Amazon EC2, as widely promised at ESWC. With big scale also comes rescaling and repartitioning, so this gets additional weight, as does further parallelizing of single user workloads. As it happens, the same medicine helps for both. At Linked Data Planet, we will make more announcements.

# PermaLink Comments [0]
06/09/2008 13:49 GMT Modified: 06/11/2008 13:15 GMT
Virtuoso Cluster Preview

I wrote the basics of the Virtuoso clustering support over the past three weeks.  It can now manage connections, decide where things go, do two phase commits, insert and select data from tables partitioned over multiple Virtuoso instances.  It works about enough to be measured, of which I will blog more over the next two weeks.

I will in the following give a features preview of what will be in the Virtuoso clustering support when it is released in the fall of this year (2007).

Data Partitioning

A Virtuoso database consists of indices only, so that the row of a table is stored together with the primary key.  Blobs are stored on separate pages when they do not fit inline within the row.  With clustering, partitioning can be specified index by index. Partitioning means that values of specific columns are used for determining where the containing index entry will be stored.  Virtuoso partitions by hash and allows specifying what parts of partitioning columns are used for the hash, for example bits 14-6 of an integer or the first 5 characters of a string.  Like this, key compression gains are not lost by storing consecutive values on different partitions.

Once the partitioning is specified, we specify which set of cluster nodes stores this index.  Not every index has to be split evenly across all nodes.  Also, all nodes do not have to have equal slices of the partitioned index, accommodating differences in capacity between cluster nodes.

Each Virtuoso instance can manage up to 32TB of data.  A cluster has no definite size limit.

Load Balancing and Fault Tolerance

When data is partitioned, an operation on the data goes where the data is.  This provides a certain natural parallelism but we will discuss this further below.

Some data may be stored multiple times in the cluster, either for fail-over or for splitting read load.  Some data, such as database schema, is replicated on all nodes.  When specifying a set of nodes for storing the partitions of a key, it is possible to specify multiple nodes for the same partition.  If this is the case, updates go to all nodes and reads go to a randomly picked node from the group.

If one of the nodes in the group fails, operation can resume with the surviving node.  The failed node can be brought back online from the transaction logs of the surviving nodes. A few transactions may be rolled back at the time of failure and again at the time of the failed node rejoining the cluster but these are aborts as in the case of deadlock and lose no committed data.

Shared Nothing

The Virtuoso architecture does not require a SAN for disk sharing across nodes.  This is reasonable since a few disks on a local controller can easily provide 300MB/s of read and passing this over an interconnect fabric that would also have to carry inter-node messages could saturate even a fast network.

Client View

A SQL or HTTP client can connect to any node of the cluster and get an identical view of all data with full transactional semantics.  DDL operations like table creation and package installation are limited to one node, though.

Applications such as ODS will run unmodified.  They are installed on all nodes with a single install command.  After this, the data partitioning must be declared, which is a one time operation to be done cluster by cluster.  The only application change is specifying the partitioning columns for each index.  The gain is optional redundant storage and capacity not limited to a single machine.  The penalty is that single operations may take a little longer when not all data is managed by the same process but then the parallel throughput is increased.  We note that the main ODS performance factor is web page logic and not database access.  Thus splitting the web server logic over multiple nodes gives basically linear scaling.

Parallel Query Execution

Message latency is the principal performance factor in a clustered database.  Due to this, Virtuoso packs the maximum number of operations in a single message.  For example, when doing a loop join that reads one table sequentially and retrieves a row of another table for each row of the outer table, a large number of the join of the inner loop are run in parallel.  So, if there is a join of five tables that gets one row from each table and all rows are on different nodes, the time will be spent on message latency.  If each step of the join gets 10 rows, for a total of 100000 results, the message latency is not a significant factor and the cluster will clearly outperform a single node.

Also, if the workload consists of large numbers of concurrent short updates or queries, the message latencies will even out and throughput will scale up even if doing a single transaction were faster on a single node.

Parallel SQL

There are SQL extensions for stored procedures allowing parallelizing operations.  For example, if a procedure has a loop doing inserts, the inserted rows can be buffered until a sufficient number is available, at which point they are sent in batches to the nodes concerned.  Transactional semantics are kept but error detection is deferred to the actual execution.

Transactions

Each transaction is owned by one node of the cluster, the node to which the client is connected.  When more than one node besides the owner of the transaction is updated, two phase commit is used.  This is transparent to the application code.  No external transaction monitor is required, the Virtuoso instances perform these functions internally.  There is a distributed deadlock detection scheme based on the nodes periodically sharing transaction waiting information.

Since read transactions can operate without locks, reading the last committed state of uncommitted updated rows, waiting for locks is not very common.

Interconnect and Threading

Virtuoso uses TCP to connect between instances.  A single instance can have multiple listeners at different network interfaces for cluster activity.  The interfaces will be used in a round-robin fashion by the peers, spreading the load over all network interfaces. A separate thread is created for monitoring each interface.  Long messages, such as transfers of blobs are done on a separate thread, thus allowing normal service on the cluster node while the transfer is proceeding.

We will have to test the performance of TCP over Infiniband to see if there is clear gain in going to a lower level interface like MPI.  The Virtuoso architecture is based on streams connecting cluster nodes point to point.  The design does not per se gain from remote DMA or other features provided by MPI.  Typically, messages are quite short, under 100K.  Flow control for transfer of blobs is however nice to have but can be written at the application level if needed.  We will get real data on the performance of different interconnects in the next weeks.

Deployment and Management

Configuring is quite simple, with each process sharing a copy of the same configuration file.  One line in the file differs from host to host, telling it which one it is.  Otherwise the database configuration files are individual per host, accommodating different file system layouts etc.  Setting up a node requires copying the executable and two configuration files, no more.   All functionality is contained in a single process.  There are no installers to be run or such.

Changing the number or network interface of cluster nodes requires a cluster restart.  Changing data partitioning requires copying the data into a new table and renaming this over the old one.  This is time consuming and does not mix well with updates.  Splitting an existing cluster node requires no copying with repartitioning but shifting data between partitions does.

A consolidated status report shows the general state and level of intra-cluster traffic as count of messages and count of bytes.

Start, shutdown, backup, and package installation commands can only be issued from a single master node. Otherwise all is symmetrical.

Present State and Next Developments

The basics are now in place.  Some code remains to be written for such things as distributed deadlock detection, 2-phase commit recovery cycle, management functions, etc.  Some SQL operations like text index, statistics sampling, and index intersection need special support, yet to be written.

The RDF capabilities are not specifically affected by clustering except in a couple of places.  Loading will be slightly revised to use larger batches of rows to minimize latency, for example.

There is a pretty much infinite world of SQL optimizations for splitting aggregates, taking advantage of co-located joins etc.  These will be added gradually.  These are however not really central to the first application of RDF storage but are quite important for business intelligence, for example.

We will run some benchmarks for comparing single host and clustered Virtuoso instances over the next weeks.  Some of this will be with real data, giving an estimate on when we can move some of the RDF data we presently host to the new platform.  We will benchmark against Oracle and DB2 later but first we get things to work and compare against ourselves.

We roughly expect a halving in space consumption and a significant increase in single query performance and linearly scaling parallel throughput through addition of cluster nodes.

The next update will be on this blog within two weeks.

# PermaLink Comments [0]
08/27/2007 09:44 GMT Modified: 04/25/2008 11:59 GMT
Ideas on RDF Store Benchmarking

This post presents some ideas and use cases for RDF store benchmarking.

Use Cases

  • Basic triple storage and retrieval. The LUBM benchmark captures many aspects of this.
  • Recursive rule application. The simpler cases of this are things like transitive closure.
  • Mapping of relational data to RDF. Since relational benchmarks are well established, as in the TPC benchmarks, the schemas and test data generation can come from there. The problem is that the D/H/R benchmarks consist of aggregates and grouping exclusively but SPARQL does not have these.

Benchmarking Triple Stores

An RDF benchmark suite should meet the following criteria:

  • Have a single scale factor.
  • Produce a single metric, queries per unit of time, for example. The metric should be concisely expressible, for example 10 qpsR at 100M, options 1, 2, 3. Due to the heterogeneous nature of the systems under test, the result's short form likely needs to specify the metric, scale and options included in the test.
  • Have optional parts, such as different degrees of inferencing and maybe language extensions such as full text, as this is a likely component of any social software.
  • Have a specification for a full disclosure report, TPC style, even though we can skip the auditing part in the interest of making it easy for vendors to publish results and be listed.
  • Have a subject domain where real data are readily available and which is broadly understood by the community. For example, SIOC data about on-line communities seems appropriate. Typical degree of connectedness, number of triples per person etc can be measured from real files .
  • Have a diverse enough workload. This should include initial bulk load of data, some adding of triples during the run and continuous query load.
  •  

The query load should illustrate the following types of operations:

  • Basic lookups, such as would be made for filling in a person's home page in a social networks app. List data of user plus names and emails of friends. Relatively short joins, unions, and optionals.
  • Graph operations like shortest path from individual to individual in a social network.
  • Selecting data with drill down, as in faceted browsing. For example, start with articles having tag t, see distinct tags of articles with tag t, select another tag t2 to see the distinct tags of articles with both t and t2 and so forth.
  • Retrieving all closely related nodes, as in composing a SIOC snapshot over a person's post in different communities, the recent activity report for a forum etc. These will be construct or describe queries. The coverage of describe is unclear, hence construct may be better.

If we take an application like LinkedIn as a model, we can get a reasonable estimate of the relative frequency of different queries. For the queries per second metric, we can define the mix similarly to TPC C. We count executions of the main query and divide by running time. Within this time, for every 10 executions of the main query there are varying numbers of executions of secondary queries, typically more complex ones.

Full Disclosure Report

The report contains basic TPC-like items such as:

  • Metric qps/scale/options
  • Software used, DBMS, RDF toolkit if separate
  • Hardware. Number, clock and type of CPUs per machine, number of machines in cluster, RAM per machine, disks per machine, manufacturer, price of hardware/software

These can go into a summary spreadsheet that is just like the TPC ones.

Additionally, the full report should include:

  • Configuration files for DBMS, web server, other components.
  • Parameters for test driver, i.e., number of clicks, how many concurrent clicks. The tester determines the degree of parallelism that gets the best throughput and should indicate this in the report. Making a graph of throughput as function of concurrent clients is a lot of work and maybe not necessary here.
  • Duration in real time. Since for any large database with a few G of working set the warm up time is easily 30 minutes, the warm up time should be mentioned but not included in the metric. The measured interval should not be less than 1h in duration and should reflect a "steady state," as defined in the TPC rules.
  • Source code of server side application logic. This can be inference rules, stored procedures, dynamic web pages or any other server side software-like thing that exists or is modified for the purpose of the test.
  • Specification of test driver. If there is a commonly used test driver, its type, parameters and version. If the test driver is custom, reference to its source code.
  • Database sizes. For a preallocated database of n G, how much was free after the initial load, how much after the test run? How many bytes per triple.
  • CPU/IO. This may not always be readily measurable but is interesting still. Maybe a realistic spec is listing the sum of CPU minutes across all  server machines and server processes. For IO, maybe the system totals from iostat before and after the full run, including load and warm-up. If the DBMS and RDF toolkits are separate, it is interesting to know the division of CPU time between them.

Test Drivers

OpenLink has a multithreaded C program that simulates n web users multiplexed over m threads. For example, 10000 users with 100 threads, each user with its own state, so that they carry out their respective usage patterns independently, getting served as soon as the server is available, still having no more than m requests going at any time. The usage pattern is something like go check the mail, browse the catalogue, add to shopping cart etc. This can be modified to browse a social network database and produce the desired query mix. This generates HTTP requests, hence would work against a SPARQL end point or any set of dynamic web pages.

The program produces a running report of the clicks per second rate and statistics at the end, listing the min/avg/max times per operation.

This can be packaged as a separate open source download once the test spec is agreed upon.

For generating test data, a modification of the LUBM generator is probably the most convenient choice.

Benchmarking Relational to RDF Mapping

This area is somewhat more complex than triple storage.

At least the following factors enter into the evaluation: 

  • Degree of SPARQL compliance. For example, can one have a variable as predicate? Are there limits on optionals and unions?
  • Are the data being queried split over multiple RDBMS and joined between them?
  • Type of use case. Is this about navigational lookups or about statistics? OLTP or OLAP? It would be the former, as SPARQL does not really have aggregation. Still, many of the interesting queries are about comparing large data sets.

The rationale for mapping relational data to RDF is often data integration. Even in simple cases like the OpenLink Data Spaces applications, a single SPARQL query will often result in a union of queries over distinct relational schemas, each somewhat similar but different in its details.

A test for mapping should represent this aspect. Of course, translating a column into a predicate is easy and useful, specially when copying data. Still, the full power of mapping seems to involve a single query over disparate sources with disparate schemas.

A real world case is OpenLink's ongoing work for mapping WordPress, Mediawiki, phpBB, Drupal, and possibly other popular web applications into SIOC.

Using this as a benchmark might make sense because the source schemas are widely known, there is a lot of real world data in these systems, and the test driver might even be the same as with the above proposed triple store benchmark. The query mix might have to be somewhat tailored.

Another "enterprise style" scenario might be to take the TPC C and TPC D databases — after all both have products, customers and orders — and map them into a common ontology. Then there could be queries sometimes running on only one, sometimes joining both.

Considering the times and the audience, the WordPress/Mediawiki scenario might be culturally more interesting and more fun to demo.

The test has two aspects: Throughput and coverage. I think these should be measured separately.

The throughput can be measured with queries that are generally sensible, such as "get articles by an author that I know with tags t1 and t2."

Then there are various pathological queries that work specially poorly with mapping. For example, if the types of subjects are not given, if the predicate is known at run time only, if the graph is not given, we get a union of everything joined with another union of everything and many of the joins between the terms of the different unions are identically empty but the software may not know this.

In a real world case, I would simply forbid such queries. In the benchmarking case, these may be of some interest. If the mapping is clever enough, it may survive cases like "list all predicates and objects of everything called gizmo where the predicate is in the product ontology".

It may be good to divide the test into a set of straightforward mappings and special cases and measure them separately. The former will be queries that a reasonably written application would do for producing user reports.

# PermaLink Comments [0]
11/21/2006 14:09 GMT Modified: 04/16/2008 16:53 GMT
More RDF scalability tests

We have lately been busy with RDF scalability. We work with the 8000 university LUBM data set, a little over a billion triples. We can load it in 23h 46m on a box with 8G RAM. With 16G we probably could get it in 16h.

The resulting database is 75G, 74 bytes per triple which is not bad. It will shrink a little more if explicitly compacted by merging adjacent partly filled pages. See Advances in Virtuoso RDF Triple Storage for an in-depth treatment of the subject.

The real question of RDF scalability is finding a way of having more than one CPU on the same index tree without them hitting the prohibitive penalty of waiting for a mutex. The sure solution is partitioning, would probably have to be by range of the whole key. but before we go to so much trouble, we'll look at dropping a couple of critical sections from index random access. Also some kernel parameters may be adjustable, like a spin count before calling the scheduler when trying to get an occupied mutex. Still we should not waste too much time on platform specifics. We'll see.

We just updated the Virtuoso Open Source cut. The latest RDF refinements are not in, so maybe the cut will have to be refreshed shortly.

We are also now applying the relational to RDF mapping discussed in Declarative SQL Schema to RDF Ontology Mapping to the ODS applications.

There is a form of the mapping in the VOS cut on the net but it is not quite ready yet. We must first finish testing it through mapping all the relational schemas of the ODS apps before we can really recommend it. This is another reason for a VOS update in the near future.

We will be looking at the query side of LUBM after the ISWC 2006 conference. So far, we find queries compile OK for many SIOC use cases with the cost model that there is now. A more systematic review of the cost model for SPARQL will come when we get to the queries.

We put some ideas about inferencing in the Advances in Triple Storage paper. The question is whether we should forward chain such things as class subsumption and subproperties. If we build these into the SQL engine used for running SPARQL, we probably can do these as unions at run time with good performance and better working set due to not storing trivial entailed triples. Some more thought and experimentation needs to go into this.

# PermaLink Comments [0]
11/01/2006 19:26 GMT Modified: 04/16/2008 16:53 GMT
Recent Virtuoso Developments

We have been extensively working on virtual database refinements. There are many SQL cost model adjustments to better model distributed queries and we now support direct access to Oracle and Informix statistics system tables. Thus, when you attach a table from one or the other, you automatically getup to date statistics. This helps Virtuoso optimize distributed queries. Also the documentation is updated as concerns these, with a new section on distributed query optimization.

On the applications side, we have been keeping up with the SIOC RDF ontology developments. All ODS applications now make their data available as SIOC graphs for download and SPARQL query access.

What is most exciting however is our advance in mapping relational data into RDF. We now have a mapping language that makes arbitrary legacy data in Virtuoso or elsewhere in the relational world RDF query-able. We will put out a white paper on this in a few days.

Also we have some innovations in mind for optimizing the physical storage of RDF triples. We keep experimenting, now with our sights set to the high end of triple storage, towards billion triple data sets. We are experimenting with a new more space efficient index structure for better working set behavior. Next week will yield the first results.

# PermaLink Comments [0]
09/19/2006 11:59 GMT Modified: 04/16/2008 16:53 GMT
Virtuoso and ODS Update

We have released an update of Virtuoso Open Source Edition and the OpenLink Data Spaces suite.

This marks the coming of age of our RDF and SPARQL efforts. We have the new SQL cost model with SPARQL awareness, we have applications which present much of their data as SIOC, FOAF, ATOM OWL and other formats.

We continue refining these technologies. Our next roadmap item is mapping relational data into RDF and offering SPARQL access to relational data without data duplication. Expect a white paper about this soon.

# PermaLink Comments [0]
08/10/2006 11:06 GMT Modified: 04/16/2008 16:53 GMT
RDF Bulk Load and Virtuoso/PL Parallelism

We have been playing with the Wikipedia3 RDF data set, 48 million triples or so. We have for a long time foreseen the need for a special bulk loader for RDF but this brought this into immediate relevance.

So I wrote a generic parallel extension to Virtuoso/PL and SQL. This consists of a function for creating a queue that will feed async requests to be served on a thread pool of configurable size. Each of the worker threads has its own transaction and the owner of the thread pool can look at or block for return states of individual request . This is a generic means for delegating work to async threads from Virtuoso/PL. Of course this can also be used at a lower level for parallelizing single SQL queries, for example aggregation of a large table or creating an index on a large table. Many applications, such as the ODS Feed Manager will also benefit, since this makes it more convenient to schedule parallel downloads from news sources and the like. This extension will make its way into the release after next.

But back to RDF. We presently have the primary key of the triple store as GSPO and a second index as PGOS. Using this mechanism, we will experiment with different multithreaded loading configurations. One thread translates from the IRI text representation to the IRI IDs, one thread may insert into the GSPO index, which is typically local and a few threads will share the inserting into the PGOS key. The latter key is inserted in random order, whereas the former is inserted mainly in ascending order when loading new data. In this way, we should be able to keep full load on several CPUs and even more disks.

It turns out that the new async queue plus thread pool construct is very handy for any pipeline or symmetric parallelization. When this is well tested, I will update the documents and maybe do a technical article about this.

Transactionality is not an issue in the bulk load situation. The graph being loaded will anyway be incomplete until it is loaded, other graphs will not be affected and no significant amount of locks will be held at any time by the bulk loader threads.

Also later, when looking at within-query and other parallelization, we have many interesting possibilities. For example, we may measure the CPU and IO load and adjust the size of the shareable thread pool accordingly. All SQL or web requests get their thread just as they now do, and extra threads may be made available for opportunistic parallelization up until we have full CPU and IO utilization. Still, this will not lead to long queries preempting short ones, since all get at least one thread. I may post some results of parallel RDF loading later on this blog.

# PermaLink Comments [0]
07/13/2006 10:35 GMT Modified: 04/16/2008 16:13 GMT
         
Powered by OpenLink Virtuoso Universal Server
Running on Linux platform