Details

Virtuoso Data Space Bot
Burlington, United States

Subscribe

Post Categories

Recent Articles

Display Settings

articles per page.
order.
Virtuoso 7 Release

The quest of OpenLink Software is to bring flexibility, efficiency, and expressive power to people working with data. For the past several years, this has been focused on making graph data models viable for the enterprise. Flexibility in schema evolution is a central aspect of this, as is the ability to share identifiers across different information systems, i.e., giving things URIs instead of synthetic keys that are not interpretable outside of a particular application.

With Virtuoso 7, we dramatically improve the efficiency of all this. With databases in the billions of relations (also known as triples, or 3-tuples), we can fit about 3x as many relations in the same space (disk and RAM) as with Virtuoso 6. Single-threaded query speed is up to 3x better, plus there is intra-query parallelization even in single-server configurations. Graph data workloads are all about random lookups. With these, having data in RAM is all-important. With 3x space efficiency, you can run with 3x more data in the same space before starting to go to disk. In some benchmarks, this can make a 20x gain.

Also the Virtuoso scale-out support is fundamentally reworked, with much more parallelism and better deployment flexibility.

So, for graph data, Virtuoso 7 is a major step in the coming of age of the technology. Data keeps growing and time is getting scarcer, so we need more flexibility and more performance at the same time.

So, let’s talk about how we accomplish this. Column stores have been the trend in relational data warehousing for over a decade. With column stores comes vectored execution, i.e., running any operation on a large number of values at one time. Instead of running one operation on one value, then the next operation on the result, and so forth, you run the first operation on thousands or hundreds-of-thousands of values, then the next one on the results of this, and so on.

Column-wise storage brings space efficiency, since values in one column of a table tend to be alike -- whether repeating, sorted, within a specific range, or picked from a particular set of possible values. With graph data, where there are no columns as such, the situation is exactly the same -- just substitute the word predicate for column. Space efficiency brings speed -- first by keeping more of the data in memory; secondly by having less data travel between CPU and memory. Vectoring makes sure that data that are closely located get accessed in close temporal proximity, hence improving cache utilization. When there is no locality, there are a lot of operations pending at the same time, as things always get done on a set of values instead of on a single value. This is the crux of the science of columns and vectoring.

Of the prior work in column stores, Virtuoso may most resemble Vertica, well described in Daniel Abadi’s famous PhD thesis. Virtuoso itself is described in IEEE Data Engineering Bulletin, March 2012 (PDF). The first experiments in column store technology with Virtuoso were in 2009, published at the SemData workshop at VLDB 2010 in Singapore. We tried storing TPC H as graph data and in relational tables, each with both rows and columns, and found that we could get 6 bytes per quad space utilization with the RDF-ization of TPC H, as opposed to 27 bytes with the row-wise compressed RDF storage model. The row-wise compression itself is 3x more compact than a row-wise representation with no compression.

Memory is the key to speed, and space efficiency is the key to memory. Performance comes from two factors: locality and parallelism. Both are addressed by column store technology. This made me a convert.

At this time, we also started the EU FP7 project, LOD2, most specifically working with Peter Boncz of CWI, the king of the column store, famous for MonetDB and VectorWise. This cooperation goes on within LOD2 and has extended to LDBC, an FP7 for designing benchmarks for graph and RDF databases. Peter has given us a world of valuable insight and experience in all aspects of avant garde database, from adaptive techniques to query optimization and beyond. One thing that was recently published is the results for Virtuoso cluster at CWI, running analytics on 150 billion relations on CWI’s SciLens cluster.

The SQL relational table-oriented databases and property graph-oriented databases (Graph for short) are both rooted in relational database science. Graph management simply introduces extra challenges with regards to scalability. Hence, at OpenLink Software, having a good grounding in the best practices of relational columnar (or column-wise) database management technology is vital.

Virtuoso is more prominently known for high-performance RDF-based graph database technology, but the entirety of its SQL relational data management functionality (which is the foundation for graph store) is vectored, and even allows users to choose between row-wise and column-wise physical layouts, index by index.

It has been asked: is this a new NoSQL engine? Well, there isn’t really such a thing. There are of course database engines that do not have SQL support and it has become trendy to call them "NoSQL." So, in this space, Virtuoso is an engine that does support SQL, plus SPARQL, and is designed to do big joins and aggregation (i.e., analytics) and fast bulk load, as well as ACID transactions on small updates, all with column store space efficiency. It is not only for big scans, as people tend to think about column stores, since it can also be used in compact embedded form.

Virtuoso also delivers great parallelism and throughput in a scale-out setting, with no restrictions on transactions and no limits on joining. The base is in relational database science, but all the adaptations that RDF and graph workloads need are built-in, with core level support for run-time data-typing, URIs as native Reference types, user-defined custom data types, etc.

Now that the major milestone of releasing Virtuoso 7 (open source and commercial editions) has been reached, the next steps include enabling our current and future customers to attain increased agility from big (linked) open data exploits. Technically, it will also include continued participation in DBMS industry benchmarks, such as those from the TPC, and others under development via the Linked Data Benchmark Council (LDBC), plus other social-media-oriented challenges that arise in this exciting data access, integration, and management innovation continuum. Thus, continue to expect new optimization tricks to be introduced at frequent intervals through the open source development branch at GitHub, between major commercial releases.

Related

# PermaLink Comments [1]
05/13/2013 18:06 GMT-0500 Modified: 08/21/2015 14:17 GMT-0500
Developer Opportunities at OpenLink Software

If it is advanced database technology, you will get to do it with us.

We are looking for exceptional talent to implement some of the hardest stuff in the industry. This ranges from new approaches to query optimization; to parallel execution (both scale up and scale out); to elastic cloud deployments and self-managing, self-tuning, fault-tolerant databases. We are most familiar to the RDF world, but also have full SQL support, and the present work will serve both use cases equally.

We are best known in the realms of high-performance database connectivity middleware and massively-scalable Linked-Data-oriented graph-model DBMS technology.

We have the basics -- SQL and SPARQL, column store, vectored execution, cost based optimization, parallel execution (local and cluster), and so forth. In short, we have everything you would expect from a DBMS. We do transactions as well as analytics, but the greater challenges at present are on the analytics side.

You will be working with my team covering:

  • Adaptive query optimization -- interleaving execution and optimization, so as to always make the correct plan choices based on actual data characteristics

  • Self-managing cloud deployments for elastic big data -- clusters that can grow themselves and redistribute load, recover from failures, etc.

  • Developing and analyzing new benchmarks for RDF and graph databases

  • Embedding complex geospatial reasoning inside the database engine. We have the basic R-tree and the OGC geometry data types; now we need to go beyond this

  • Every type of SQL optimizer and execution engine trick that serves to optimize for TPC-H and DS.

What do I mean by really good? It boils down to being a smart and fast programmer. We have over the years talked to people, including many who have worked on DBMS programming, and found that they actually know next to nothing of database science. For example, they might not know what a hash join is. Or they might not know that interprocess latency is in the tens of microseconds even within one box, and that in that time one can do tens of index lookups. Or they might not know that blocking on a mutex kills.

If you do core database work, we want you to know how many CPU cache misses you will have in flight at any point of the algorithm, and how many clocks will be spent waiting for them at what points. Same for distributed execution: The only way a cluster can perform is having max messages with max payload per message in flight at all times.

These are things that can be learned. So I do not necessarily expect that you have in-depth experience of these, especially since most developer jobs are concerned with something else. You may have to unlearn the bad habit of putting interfaces where they do not belong, for example. Or to learn that if there is an interface, then it must pass as much data as possible in one go.

Talent is the key. You need to be a self-starter with a passion for technology and have competitive drive. These can be found in many guises, so we place very few limits on the rest. If you show you can learn and code fast, we don't necessarily care about academic or career histories. You can be located anywhere in the world, and you can work from home. There may be some travel but not very much.

In the context of EU FP7 projects, we are working with some of the best minds in database, including Peter Boncz of CWI and VU Amsterdam (MonetDB, VectorWise) and Thomas Neumann of Technical University of Munich (RDF3X, HYPER). This is an extra guarantee that you will be working on the most relevant problems in database, informed by the results of the very best work to date.

For more background, please see the IEEE Computer Society Bulletin of the Technical Committee on Data Engineering, Special Issue on Column Store Systems.

All articles and references therein are relevant for the job. Be sure to read the CWI work on run time optimization (ROX), cracking, and recycling. Do not miss the many papers on architecture-conscious, cache-optimized algorithms; see the VectorWise and MonetDB articles in the bulletin for extensive references.

If you are interested in an opportunity with us, we will ask you to do a little exercise in multithreaded, performance-critical coding, to be detailed in a blog post in a few days. If you have done similar work in research or industry, we can substitute the exercise with a suitable sample of this, but only if this is core database code.

There is a dual message: The challenges will be the toughest a very tough race can offer. On the other hand, I do not want to scare you away prematurely. Nobody knows this stuff, except for the handful of people who actually do core database work. So we are not limiting this call to this small crowd and will teach you on the job if you just come with an aptitude to think in algorithms and code fast. Experience has pros and cons so we do not put formal bounds on this. "Just out of high school" may be good enough, if you are otherwise exceptional. Prior work in RDF or semantic web is not a factor. Sponsorship of your M.Sc. or Ph.D. thesis, if the topic is in our line of work and implementation can be done in our environment, is a further possibility. Seasoned pros are also welcome and will know the nature of the gig from the reading list.

We are aiming to fill the position(s) between now and October.

Resumes and inquiries can be sent to Hugh Williams, hwilliams@openlinksw.com. We will contact applicants for interviews.

# PermaLink Comments [0]
08/07/2012 13:21 GMT-0500
Virtuoso 6.2 brings New Features!

Virtuoso 6.2 introduces a major number of enhancements to areas including...

  • Linked Data Deployment
  • Linked Data Middleware
  • Data Virtualization
  • Dynamic Data Exchange & Data Replication
  • Security

Linked Data Deployment

Feature Description Benefit
Automatic Deployment Linked Data Pages are now automatically published for every Virtuoso Data Object; users need only load their data into the RDF Quad Store. Handcrafted URL-Rewrite Rules are no longer necessary.
HTTP Metadata Enhancements HTTP Link: header is used to transfer vital metadata (e.g., relationships between a Descriptor Resource and its Subject) from HTTP Servers to User Agents. Enables HTTP-oriented tools to work with such relationships and other metadata.
HTML Metadata Embedding HTML resource <head /> and <link /> elements and their @rel attributes are used to transfer vital metadata (e.g., relationships between a Descriptor Resource and its Subject) from HTTP Servers to User Agents. Enables HTML-oriented tools to work with such relationships and other metadata.
Hammer Stack Auto-Discovery Patterns HTML resource <head /> section and <link /> elements, the HTTP Link: header, and XRD-based "host-meta" resources collectively provide structured metadata about Virtuoso hosts, associated Linked Data Spaces, and specific Data Items (Entities). Enables humans and machines to easily distinguish between Descriptor Resources and their Subjects, irrespective of URI scheme.

Linked Data Middleware

Feature Description Benefit
New Sponger Cartridges New cartridges (data access and transformation drivers) for Twitter, Facebook, Amazon, eBay, LinkedIn, and others. Enable users and user agents to deal with the Sponged data spaces as though they were named graphs in a quad store, or tables in an RDBMS.
New Descriptor Pages HTML-based descriptor pages are automatically generated. Descriptor subjects, and the constellation of navigable attribute-and-value pairs that constitute their descriptive representation, are clearly identified.
Automatic Subject Identifier Generation De-referenceable data object identifiers are automatically created. Removes tedium and risk of error associated with nuance-laced manual construction of identifiers.
Support for OData, JSON, RDFa Additional data representation and serialization formats associated with Linked Data. Increases flexibility and interoperability.

Data Virtualization

Feature Description Benefit
Materialized RDF Views RDF Views over ODBC/JDBC Data Sources can now (optionally) keep the Quad Store in sync with the RDBMS data source. Enables high-performance Faceted Browsing while remaining sensitive to changes in the RDBMS data sources.
CSV-to-RDF Transformation Wizard-based generation of RDF Linked Data from CSV files. Speeds deployment of data which may only exist in CSV form as Linked Data.
Transparent Data Access Binding SPASQL (SPARQL Query Language integrated into SQL) is usable over ODBC, JDBC, ADO.NET, OLEDB, or XMLA connections. Enables Desktop Productivity Tools to transparently work with any blend of RDBMS and RDF data sources.

Dynamic Data Exchange & Data Replication

Feature Description Benefit
Quad Store to Quad Store Replication High-fidelity graph-data replication between one or more database instances. Enables a wide variety of deployment topologies.
Delta Engine Automated generation of deltas at the named-graph-level, matches transactional replication offered by the Virtuoso SQL engine. Brings RDF replication on par with SQL replication.
PubSubHubbub Support Deep integration within Quad Store as an optional mechanism for shipping deltas. Enables push-based data replication across a variety of topologies.

Security

Feature Description Benefit
WebID support at the DBMS core Use WebID protocol for low-level ACL-based protection of database objects (RDF or Relational) and Web Services. Enables application of sophisticated security and data access policies to Web Services (e.g., SPARQL endpoint) and actual DBMS objects.
Webfinger Supports using mailto: and acct: URIs in the context of WebID and other mechanisms, when domain holders have published necessary XRDS resources. Enables more intuitive identification of people and organizations.
Fingerpoint Similar to Webfinger but does not require XRDS resources; instea,d it works directly with SPARQL endpoints exposed using auto-discovery patterns in the <head /> section of HTML documents. Enables more intuitive identification of people and organizations.

# PermaLink Comments [5]
09/22/2010 17:08 GMT-0500 Modified: 08/21/2015 14:43 GMT-0500
DataSpaces Bulletin: December issue now online!

The highly anticipated December 2008 issue of the DataSpaces Bulletin is now available!

This month's DataSpaces contains material of interest to the Virtuoso developer and UDA user community alike —

  1. Introduction to Virtuoso Universal Server (Cloud Edition).
  2. Links to Virtuoso and Linked Data mailing lists.
  3. UDA license management tips and tricks.
# PermaLink Comments [0]
12/09/2008 13:21 GMT-0500 Modified: 12/09/2008 15:06 GMT-0500
IBM Flexes XML Muscle

Here is another article titled "IBM Flexes XML Muscle" that covers the same general theme: IBM's appreciation of Unified Storage.

As indicated in an earlier post: IBM is clearly validating what we have done with Virtuoso (as was the case initially with their Virtual / Federated DBMS initiative ala DB2 Integrator). Here is an excerpt from today's eWeek article supporting this position:

To achieve maximum XML performance, bolstered indexing attributes in the technology will enable advanced search functions and a higher degree of filtering. IBM is also adding support for XPath and XQuery data models. This will allow users to create views that involve SQL and XQuery by sending the protocol through DB2's query optimizer for a unified query plan.

Read on..

Virtuoso has been doing this since 2000; unfortunately a lot of

# PermaLink Comments [0]
01/04/2005 12:19 GMT-0500 Modified: 06/22/2006 08:56 GMT-0500
What is the platform?

I came across an interesting piece by Adam Bosworth titled "What is the platform?"

# PermaLink Comments [0]
10/05/2004 12:31 GMT-0500 Modified: 06/22/2006 08:56 GMT-0500
Enterprise Databases get a grip on XML

Databases get a grip on XML
From Inforworld.

The next iteration of the SQL standard was supposed to arrive in 2003. But SQL standardization has always been a glacially slow process, so nobody should be surprised that SQL:2003 ? now known as SQL:200n ? isn?t ready yet. Even so, 2003 was a year in which XML-oriented data management, one of the areas addressed by the forthcoming standard, showed up on more and more developers? radar screens.

# PermaLink Comments [0]
01/06/2004 18:17 GMT-0500 Modified: 06/22/2006 08:56 GMT-0500
Creating RSS Using SQLX

Here is a practical example of how to create RSS on the fly from SQL data sources leveraging Virtuoso 3.2's SQLX implementation.

This is further illuminates the content of my earlier post on this subject.

# PermaLink Comments [0]
11/11/2003 18:33 GMT-0500 Modified: 06/22/2006 08:56 GMT-0500
XML Development Hindered by Lack of Conformity to Data Connectivity Standards ?

I've just read an

# PermaLink Comments [0]
11/11/2003 18:14 GMT-0500 Modified: 06/22/2006 08:56 GMT-0500
Replace and defend -- Contd

Reading the Longhorn SDK docs is a disorienting experience. Everything's familiar but different. Consider these three examples:

[Full story: Replace and defend via Jon's Radio]

"Replace & Defend" is certainly a strategy that would have awakened the entire non Microsoft Developer world during the recent PDC event. I know these events are all about preaching to the choir (Windows only developers), but as someone who has worked with Microsoft technologies as an ISV since the late 80's there is something about this events announcements that leave me concerned.

Ironically these concerns aren't about the competitive aspects of their technology disruptions, but more along the lines of how

# PermaLink Comments [0]
10/31/2003 15:58 GMT-0500 Modified: 06/22/2006 08:56 GMT-0500
 <<     | 1 | 2 |     >>
Powered by OpenLink Virtuoso Universal Server
Running on Linux platform
OpenLink Software 1998-2006