Details

Orri Erling

Subscribe

Post Categories

Recent Articles

Display Settings

articles per page.
order.
Showing posts in all categories RefreshRefresh
LDBC: A Socio-technical Perspective

(Originally posted to the LDBC blog.)

In recent days, cyberspace has seen some discussion concerning the relationship of the EU FP7 project LDBC (Linked Data Benchmark Council) and sociotechnical considerations. It has been suggested that LDBC, to its own and the community’s detriment, ignores sociotechnical aspects.

LDBC, as research projects go, actually has an unusually large, and as of this early date, successful and thriving sociotechnical aspect, i.e., involvement of users and vendors alike. I will here discuss why, insofar as the technical output of the project goes, sociotechnical metrics are in fact out of scope. Then yet again, to what degree the benefits potentially obtained from the use of LDBC outcomes are in fact realized does have a strong dependence on community building, a social process.

One criticism of big data projects we sometimes encounter is the point that data without context is not useful. Further, one cannot just assume that one can throw several data sets together and get meaning from this, as there may be different semantics for similar looking things, just think of 7 different definitions of blood pressure.

In its initial user community meeting, LDBC was, according to its charter, focusing mostly on cases where the data is already in existence and of sufficient quality for the application at hand.

Michael Brodie, Chief Scientist at Verizon, is a well known advocate of focusing on meaning of data, not only on processing performance. There is a piece on this matter by him, Peter Boncz, Chris Bizer, and myself on the Sigmod Record: "The Meaningful Use of Big Data: Four Perspectives – Four Challenges".

I had a conversation with Michael at a DERI meeting a couple of years ago about measuring the total cost of technology adoption, thus including socio-technical aspects such as acceptance by users, learning curves of various stakeholders, whether in fact one could demonstrate an overall gain in productivity arising from semantic technologies. [in my words, paraphrased]

"Can one measure the effectiveness of different approaches to data integration?" asked I.

"Of course one can," answered Michael, "this only involves carrying out the same task with two different technologies, two different teams and then doing a double blind test with users. However, this never happens. Nobody does this because doing the task even once in a large organization is enormously costly and nobody will even seriously consider doubling the expense."

LDBC does in fact intend to address technical aspects of data integration, i.e., schema conversion, entity resolution, and the like. Addressing the sociotechnical aspects of this (whether one should integrate in the first place, whether the integration result adds value, whether it violates privacy or security concerns, whether users will understand the result, what the learning curves are, etc.) is simply too diverse and so totally domain dependent that a general purpose metric cannot be developed, at least not in the time and budget constraints of the project. Further, adding a large human element in the experimental setting (e.g., how skilled the developers are, how well the stakeholders can explain their needs, how often these needs change, etc.) will lead to experiments that are so expensive to carry out and whose results will have so many unquantifiable factors that these will constitute an insuperable barrier to adoption.

Experience demonstrates that even agreeing on the relative importance of quantifiable metrics of database performance is hard enough. Overreaching would compromise the project's ability to deliver its core value. Let us next talk about this.

It is only a natural part of the political landscape that the EC's research funding choices are criticized by some members of the public. Some criticism is about the emphasis on big data. Big data is a fact on the ground, and research and industry need to deal with it. Of course, there have been and will be critics of technology in general on moral or philosophical grounds. Instead of opening this topic, I will refer you to an article by Michael Brodie. In a world where big data is a given, lowering the entry threshold for big data applications, thus making them available not only to government agencies and the largest businesses, seems ethical to me, as per Brodie's checklist. LDBC will contribute to this by driving greater availability, better performance, and lower cost for these technologies.

Once we accept that big data is there and is important, we arrive at the issue of deriving actionable meaning from it. A prerequisite of deriving actionable meaning from big data is the ability to flexibly process this data. LDBC is about creating metrics for this. The prerequisites for flexibly working with data are fairly independent of the specific use case, while the criteria of meaning, let alone actionable analysis, are very domain specific. Therefore, in order to provide the greatest service to the broadest constituency, LDBC focuses on measuring that which is most generic, yet will underlie any decision support or other data processing deployment that involves RDF or graph data.

I would say that LDBC is an exceptionally effective use of taxpayer money. LDBC will produce metrics that will drive technological innovation for years to come. The total money spent towards pursuing goals set forth by LDBC is likely to vastly exceed the budget of LDBC. Only think of the person-centuries or even millennia that have gone into optimizing for TPC-C and TPC-H. The vast majority of the money spent for these pursuits is paid by industry, not by research funding. It is spent worldwide, not in Europe alone.

Thus, if LDBC is successful, a limited amount of EC research money will influence how much greater product development budgets are spent in the future. This multiplier effect applies of course to highly successful research outcomes in general but is especially clear with LDBC.

European research funding has played a significant role in creating the foundations of the RDF/Linked Data scene. LDBC is a continuation of this policy, however the focus has now shifted to reflect the greater maturity of the technology. LDBC is now about making the RDF and graph database sectors into mature industries whose products can predictably tackle the challenges out there.

# PermaLink Comments [0]
12/03/2012 16:23 GMT
LDBC - the Linked Data Benchmark Council

(This posting was inadvertently delayed from the time of its writing, 2012-11-21.)

The Linked Data Benchmark Council (LDBC) project is officially starting now.

This represents a serious effort towards making relevant and well thought out metrics for RDF and graph databases and defining protocols for measurement and publishing of well documented and reproducible results. This also entails the creation of a TPC-analog for the graph and RDF domains.

The project brings together leading vendors, with OpenLink and Ontotext representing the RDF side and Neo Technology and Sparsity Technologies representing the graph database side. Peter Boncz of MonetDB and Vectorwise fame is the technical director, with participation from the Technical University of Munich with Thomas Neumann, known for RDF3X and HyPer. La Universitat Politècnica de Catalunya coordinates the project and brings strong academic expertise in graph databasing, also representing their Sparsity Technologies spinoff. FORTH (Foundation for Research and Technology - Hellas) of Crete contributes expertise in data integration and provenance. STI Innsbruck participates in community building and outreach.

The consortium has second-to-none understanding of benchmarking and has sufficient time allotted to the task for producing world class work, comparable to the TPC benchmarks. This has to date never been realized in the RDF or graph space.

History demonstrates that whenever something that is sufficiently important starts getting systematically measured, there is an improvement in the metric. The early days of the TPC saw a 40-fold increase in transaction processing speed. TPC-H continues to be, after 18 years, well used as a basis of quantifying advances in analytics databases.

A serious initiative for well-thought-out benchmarks for guiding the emerging RDF and graph database markets is nothing short of a necessary precondition for the emergence of a serious market with several vendors offering mutually comparable products.

Benchmarks are only as good as their credibility and adoption. For this reason, LDBC has been in touch with all graph and RDF vendors we could find, and has received a positive statement of intent from most, indicating that they would participate in a LDBC organization and contribute to shaping benchmarks.

There is further a Technical User Community, with its initial meeting this week, where present-day end users of RDF and graph databases will voice their wishes for benchmark development. Thus benchmarks will be grounded in use cases contributed by real users.

With these elements in place we have every reason to expect relevant benchmarks with broad adoption, with all the benefits this entails.

# PermaLink Comments [0]
11/28/2012 18:06 GMT
LDBC Technical User Community Meeting

The LDBC Technical User Community (TUC) had its initial meeting in Barcelona last week.

First we wish to thank the many end user organizations that were present. This clearly validates the project's mission and demonstrates that there is acute awareness of the need for better metrics in the field. In the following, I will summarize the requirements that were brought forth.

  • Scale out - There was near unanimity among users that even if present workloads could be handled on single servers, a scale-out growth path was highly desirable. On the other hand, some applications were scale-out based from the get go. Even when not actually used, a scale-out capability is felt to be an insurance against future need.

  • Making limits explicit - How far can this technology go? Benchmarks need to demonstrate at what scales the products being considered work best, and where they will grind to a halt. Also, the impact of scale-out on performance needs to be made clear. The cost of solutions at different scales must be made explicit.

    Many of these requirements will be met by simply following TPC practices. Now, vendors cannot be expected to publish numbers for cases where their products fail, but they do have incentives for publishing numbers on large data, and at least giving a price/performance point that exceeds most user needs.

  • Fault tolerance and operational characteristics - Present day benchmarks (e.g., the TPC ones) hardly address operational aspects that most enterprise deployments will encounter. This was already stated by Michael Stonebraker at the first TPC performance evaluation workshop some years back at VLDB in Lyon. Users want to know the price/performance impact of making fault-tolerant systems and wish to have metrics for things like backup and bulk load under online conditions. A need to operate across multiple geographies was present in more than one use case, thus requiring a degree of asynchronous replication such as log shipping.

  • Update-intensive workloads - Unlike one might think, RDF uses are not primarily load-once-plus-lookup. Freshness of data creates value, and databases, even if they are warehouses in character, need to be kept up to date much better than just by periodic reload. Online updates may be small, as for example refreshing news feeds or web crawls, where the unit of update is small but updates are many, but also replacing reference data sets of hundreds of millions of triples. The latter requirement exceeds what is practical in a single transaction. ACID was generally desired, with some interest also in eventual consistency. We did not get use cases with much repeatable read (e.g., updating account balances), but rather atomic and durable replacement of sets of statements.

  • Inference - Class and property hierarchies were common, followed by use of transitivity. owl:sameAs was not in much use, being too dangerous, i.e., a single statement may potentially have huge effect and produce unpredictable sets of properties for instances, for which applications are not prepared. Beyond these, the wishes for inference, with use cases ranging from medicine to forensics, were outside of the OWL domain. These typically involved probability scores adding up the joint occurrence of complex criteria with some numeric computation (e.g. time intervals, geography, etc.).

    As materialization of forward closure is the prevalent mode of implementing inference in RDF, users wished to have a measure of its cost in space and time, especially under online-update loads.

  • Text, XML, and Geospatial - There is no online application that does not have text search. In publishing, this is hardly ever provided by an RDF store, even if there is one in the mix. Even so, there is an understandable desire to consolidate systems, i.e., to not have an XML database for content and a separate RDF database for metadata. Also, many applications have a geospatial element. One wish was to combine XPATH/XQuery with SPARQL, and it was implied that query optimization should create good plans under these conditions.

    There was extensive discussion especially on benchmarking full-text. Such a benchmark would need to address the quality of relevance ranking. Doing new work in this space is clearly out of scope for LDBC, but an IR benchmark could be reused as an add-on to provide a quality score. The performance score would come from the LDBC side of the benchmark. Now, many of the applications of text (e.g., news) might not even sort on text match score, but rather by time. Also if the text search is applied to metadata like labels or URI strings, the quality of a match is a non-issue, as there is no document context.

  • Data integration - Almost all applications had some element of data integration. Indeed, if one uses RDF in the first place, the motivation usually has to do with schema flexibility. Having a relational schema for everything is often seen to be too hard to maintain and to lead to too much development time before an initial version of an application or answer of a business question. Data integration is everywhere but stays elusive for benchmarking. Every time it is different and most vendors present do not offer products for this specific need. Many ideas were presented, including using SPARQL for entity resolution, and for checking consistency of an integration result.

A central issue of benchmark design is having an understandable metric. People cannot make sense of more than a few figures. The TPC practice of throughput at scale and price per unit of throughput at scale is a successful example. However, it may be difficult to agree on relative weights of components if a metric is an aggregate of too many things. Also, if a benchmark has too many optional parts, metrics easily become too complicated. On the other hand, requiring too many features (e.g. XML, full text, geospatial) restricts the number of possible participants.

To stimulate innovation, a benchmark needs to be difficult but restricted to a specific domain. TPC-H is a good example, favoring specialized systems built for analytics alone. To be a predictor of total cost and performance in a complex application, a benchmark must include much more functionality, and will favor general purpose systems that do many things but are not necessarily outstanding in any single aspect.

After 1-1/2 days with users, the project team met to discuss actual benchmark task forces to be started. The conclusion was that work would initially proceed around two use cases: publishing, and social networks. The present use of RDF by the BBC and the Press Association provides the background scenario for the publishing benchmark, and the work carried out around the Social Intelligence Benchmark (SIB) in LOD2 will provide a starting point for the social network benchmark. Additionally, user scenarios from the DEX graph database user base will help shape the SN workload.

A data integration task force needs more clarification, but work in this direction is in progress.

In practice, driving progress needs well-focused benchmarks with special trick questions intended to stress specific aspects of a database engine. Providing an overall perspective on cost and online operations needs a broad mix of features to be covered.

These needs will be reconciled by having many metrics inside a single use case, i.e., a social network data set can be used for transactional updates, for lookup queries, for graph analytics, and for TPC-H style business intelligence questions, especially if integrated with another more-relational dataset. Thus there will be a mix of metrics, from transactions to analytics, with single and multiuser workloads. Whether these are packaged as separate benchmarks, or as optional sections of one, remains to be seen.
# PermaLink Comments [0]
11/27/2012 23:17 GMT
Developer Recruitment Exercise

The specification of the exercise referred to in the previous post may be found below.

Questions on the exercise can be sent to the email specified in the previous post. I may schedule a phone call to answer questions based on the initial email contact.

We seek to have all applicants complete the exercise before October 1.

General

The exercise consists of implementing a part of the TPC-C workload in memory, in C or C++. TPC-C is the long-time industry standard benchmark for transaction processing performance. We use this as a starting point for an exercise for assessing developer skill level in writing heavily multithreaded, performance-critical code.

The application performs a series of transactions against an in-memory database, encountering lock contention and occasional deadlocks. The application needs to provide atomicity, consistency, and isolation for transactions. The task consists of writing the low-level data structures for storing the memory-resident database and for managing concurrency, including lock queueing, deadlock detection, and commit/rollback. The solutions are evaluated based on their actual measured multithreaded performance on commodity servers, e.g., 8- or 12-cores of Intel Xeon.

OpenLink provides the code for data generation and driving the test. This is part of the TPC-C kit in Virtuoso Open Source. The task is to replace the SQL API calls with equivalent in-process function calls against the in-memory database developed as part of the exercise.

Rules

We are aware that the best solution to the problem may be running transactions single-threaded against in-memory hash tables without any concurrency control. The application data may be partitioned so that a single transaction can be in most cases assigned to a partition, which it will get for itself for the few microseconds it takes to do its job. For this exercise, this solution is explicitly ruled out. The application must demonstrate shared access to data, with a transaction holding multiple concurrent locks and being liable to deadlock.

TPC-C can be written so as to avoid deadlocks by always locking in a certain order. This is also expressly prohibited; in specific, the stock rows of a new order transaction must be locked in the order they are specified in the invocation. In application terms this makes no sense, but for purposes of the exercise this will serve as a natural source of deadlocks.

Parameters

The application needs to offer an interactive or scripted interface (command line is OK) which provides the following operations:

  • Clear and initialize a database of n warehouses.

  • Run n threads, each doing m new order transactions. Each thread has a home warehouse and occasionally accesses other warehouse's data. This reports the real time elapsed and the number of retries arising from deadlocks.

  • Check the consistency between the stock, orders, and order_line data structures.

  • Report system status such as clocks spent waiting for specific mutexes. This is supplied as part of the OpenLink library used by the data generator.

Data Structures

The transactions are written as C functions. The data is represented as C structs, and tree indices or hash tables are used for value-based access to the structures by key. The application has no persistent storage. The structures reference each other by the key values as in the database, so no direct pointers. The key values are to be translated into pointers with a hash table or other index-like structure.

The application must be thread-safe, and transactions must be able to roll back. Transactions will sometimes wait for each other in updating shared resources such as stock or district or warehouse balances. The application must be written so as to implement fine-grained locking, and each transaction must be able to hold multiple locks. The application must be able to detect deadlocks. For deadlock recovery, it is acceptable to abort the transaction that detects the deadlock.

C++ template libraries may be used but one must pay attention to their efficiency.

The new order transaction is the only required transaction.

All numbers can be represented as integers. This holds equally for key columns as for monetary amounts.

All index structures (e.g., hash tables) in the application must be thread safe, so that an insert would be safe with concurrent access or concurrent inserts. This holds also for index structures for tables which do not get inserts in the test (e.g. item, customer, stock, etc.).

A sequence object must not be used for assigning new values to the O_ID column of ORDERS. These values must come from the D_NEXT_O_ID column of the DISTRICT table. If a new order transaction rolls back, its update of D_NEXT_O_ID is also rolled back. This causes O_ID values to always be consecutive within a district.

TPC-C Functionality

The application must implement the TPC-C new order transaction in full. This must not avoid deadlocks by ordering locking on stock rows. See the rules section.

The transaction must have the semantics specified in TPC-C, except for durability.

Supporting Files

The test driver calling the transaction procedures is in tpccodbc.c. This can be reused so as to call the transaction procedure in process instead of the ODBC exec.

The user interface may be a command line menu with run options for different numbers of transactions with different thread counts and an option for integrity check.

The integrity check consists of verifying s_cnt_order against the orders and checking that max (O_ID) and D_NEXT_O_ID match within each district.

Running the application should give different statistics such as CPU%, cumulative time spent waiting for locks, etc. The rdtsc instruction can be used for getting clock counts for timing.

Points to Note

This section summarizes some of the design patterns and coding tricks we expect to see in a solution to the exercise. These may seem self-evident to some, but experience indicates that this is not universally so.

  • The TPC-C transaction profile for new order specifies a semantics for the operation. The order of locking is left to the implementation as long as the semantics are in effect. The application will be tested with many clients on the same warehouse, running as fast as they can. So lock contention is expected. Therefore, the transaction should be written so as to acquire the locks with the greatest contention as late as possible. No locks need be acquired for the item table since none of the transactions will update it.

  • For implementing locks, using a mutex to serialize access to application resources is not enough. Many locks will be acquired by each transaction, in an unpredictable order. Unless explicit queueing for locks is implemented with deadlock detection, the application will not work.

  • If waiting for a mutex causes the operating system to stop a thread, even when there are cores free, the latency is multiple microseconds, even if the mutex is released by its owner on the next cycle after the waiting thread is suspended. This will destroy any benefit from parallelism unless one is very careful. Programmers do not seem to instinctively know this.

Therefore any structure to which access must be serialized (e.g. hash tables, locks, etc.) needs to be protected by a mutex but must be partitioned so that there are tens or hundreds of mutexes depending on which section of the structure one is accessing.

Submissions that protect a hash table or other index-like structure for a whole application table with a single mutex or rw lock will be discarded off the bat.

Even while using many mutexes, one must hold them for a minimum of time. When accessing a hash table, do the invariant parts first; acquire the mutex after that. For example, if you calculate the hash number after acquiring the mutex for the hash table, the submission will be rejected.

The TPC-C application has some local and some scattered access. Orders are local, and stock and item lines are scattered. When doing scattered memory accesses, the program should be written so that the CPU will, from a single thread, have multiple concurrent cache misses in flight at all times. So, when accessing 10 stock lines, calculate the hash numbers first; then access the memory, deferring any branches based on the accessed values. In this way, out of order execution will miss the CPU cache for many independent addresses in parallel. One can use the gcc __builtin_prefetch primitive, or simply write the program so as to have mutually data-independent memory accesses in close proximity.

For detecting deadlocks, a global transaction wait graph may have to be maintained. This will need to be maintained in a serialized manner. If many threads access this, the accesses must be serialized on a global mutex. This may be very bad if the deadlock detection takes a long time. Alternately, the wait graph may be maintained on another thread. The thread will get notices of waits and transacts from worker threads with some delay. Having spotted a cycle, it may kill one or another party. This will require some inter-thread communication. The submission may address this matter in any number of ways.

However, just acquiring a lock without wait must not involve getting a global mutex. Going to wait will have to do so, were it only for queueing a notice to a monitor thread. Using a socket-to-self might appear to circumvent this, but the communication stack will have mutexes inside so this is no better.

Evaluation Criteria

The exercise will be evaluated based on the run time performance, especially multicore scalability of the result.

Extra points are not given for implementing interfaces or for being object oriented. Interfaces, templates, and objects are not forbidden as such, but their cost must not exceed the difference between getting an address from a virtual table and calling a function directly.

The locking implementation must be correct. It can be limited to exclusive locks and need not support isolation other than repeatable read. Running the application must demonstrate deadlocks and working recovery from these.

Code and Libraries To Be Used

The TPC-C data generator and test driver are in the Virtuoso Open Source distribution, in the files binsrc/tests/tpcc*.c and files included from these. You can make the exercise in the same directory and just alter the files or make script. The application is standalone and has no other relation to the Virtuoso code. The libsrc/Thread threading wrappers may be used. If not using these, make a wrapper similar to mutex_enter when MTX_METER is defined so that it counts the waits and clocks spent during wait. Also have a report like that in mutex_stat() for the mutex wait frequency and duration.

# PermaLink Comments [0]
08/16/2012 15:26 GMT
Developer Opportunities at OpenLink Software

If it is advanced database technology, you will get to do it with us.

We are looking for exceptional talent to implement some of the hardest stuff in the industry. This ranges from new approaches to query optimization; to parallel execution (both scale up and scale out); to elastic cloud deployments and self-managing, self-tuning, fault-tolerant databases. We are most familiar to the RDF world, but also have full SQL support, and the present work will serve both use cases equally.

We are best known in the realms of high-performance database connectivity middleware and massively-scalable Linked-Data-oriented graph-model DBMS technology.

We have the basics -- SQL and SPARQL, column store, vectored execution, cost based optimization, parallel execution (local and cluster), and so forth. In short, we have everything you would expect from a DBMS. We do transactions as well as analytics, but the greater challenges at present are on the analytics side.

You will be working with my team covering:

  • Adaptive query optimization -- interleaving execution and optimization, so as to always make the correct plan choices based on actual data characteristics

  • Self-managing cloud deployments for elastic big data -- clusters that can grow themselves and redistribute load, recover from failures, etc.

  • Developing and analyzing new benchmarks for RDF and graph databases

  • Embedding complex geospatial reasoning inside the database engine. We have the basic R-tree and the OGC geometry data types; now we need to go beyond this

  • Every type of SQL optimizer and execution engine trick that serves to optimize for TPC-H and DS.

What do I mean by really good? It boils down to being a smart and fast programmer. We have over the years talked to people, including many who have worked on DBMS programming, and found that they actually know next to nothing of database science. For example, they might not know what a hash join is. Or they might not know that interprocess latency is in the tens of microseconds even within one box, and that in that time one can do tens of index lookups. Or they might not know that blocking on a mutex kills.

If you do core database work, we want you to know how many CPU cache misses you will have in flight at any point of the algorithm, and how many clocks will be spent waiting for them at what points. Same for distributed execution: The only way a cluster can perform is having max messages with max payload per message in flight at all times.

These are things that can be learned. So I do not necessarily expect that you have in-depth experience of these, especially since most developer jobs are concerned with something else. You may have to unlearn the bad habit of putting interfaces where they do not belong, for example. Or to learn that if there is an interface, then it must pass as much data as possible in one go.

Talent is the key. You need to be a self-starter with a passion for technology and have competitive drive. These can be found in many guises, so we place very few limits on the rest. If you show you can learn and code fast, we don't necessarily care about academic or career histories. You can be located anywhere in the world, and you can work from home. There may be some travel but not very much.

In the context of EU FP7 projects, we are working with some of the best minds in database, including Peter Boncz of CWI and VU Amsterdam (MonetDB, VectorWise) and Thomas Neumann of Technical University of Munich (RDF3X, HYPER). This is an extra guarantee that you will be working on the most relevant problems in database, informed by the results of the very best work to date.

For more background, please see the IEEE Computer Society Bulletin of the Technical Committee on Data Engineering, Special Issue on Column Store Systems.

All articles and references therein are relevant for the job. Be sure to read the CWI work on run time optimization (ROX), cracking, and recycling. Do not miss the many papers on architecture-conscious, cache-optimized algorithms; see the VectorWise and MonetDB articles in the bulletin for extensive references.

If you are interested in an opportunity with us, we will ask you to do a little exercise in multithreaded, performance-critical coding, to be detailed in a blog post in a few days. If you have done similar work in research or industry, we can substitute the exercise with a suitable sample of this, but only if this is core database code.

There is a dual message: The challenges will be the toughest a very tough race can offer. On the other hand, I do not want to scare you away prematurely. Nobody knows this stuff, except for the handful of people who actually do core database work. So we are not limiting this call to this small crowd and will teach you on the job if you just come with an aptitude to think in algorithms and code fast. Experience has pros and cons so we do not put formal bounds on this. "Just out of high school" may be good enough, if you are otherwise exceptional. Prior work in RDF or semantic web is not a factor. Sponsorship of your M.Sc. or Ph.D. thesis, if the topic is in our line of work and implementation can be done in our environment, is a further possibility. Seasoned pros are also welcome and will know the nature of the gig from the reading list.

We are aiming to fill the position(s) between now and October.

Resumes and inquiries can be sent to Hugh Williams, hwilliams@openlinksw.com. We will contact applicants for interviews.

# PermaLink Comments [0]
08/07/2012 13:21 GMT
IEEE publication of ?Virtuoso, a Hybrid RDBMS/Graph Column Store?

My article, Virtuoso, a Hybrid RDBMS/Graph Column Store (PDF), can be found in Volume 35, Number 1, March 2012 (PDF) of the Bulletin of the IEEE Computer Society Technical Committee on Data Engineering (also known as the IEEE Data Engineering Bulletin).

Abstract:

We discuss applying column store techniques to both graph (RDF) and relational data for mixed workloads ranging from lookup to analytics in the context of the OpenLink Virtuoso DBMS. In so doing, we need to obtain the excellent memory efficiency, locality and bulk read throughput that are the hallmark of column stores while retaining low-latency random reads and updates, under serializable isolation.

DBLP BibTeX Record 'journals/debu/Erling12' (XML)

@article{DBLP:journals/debu/Erling12,
  author    = {Orri Erling},
  title     = {Virtuoso, a Hybrid RDBMS/Graph Column Store},
  journal   = {IEEE Data Eng. Bull.},
  volume    = {35},
  number    = {1},
  year      = {2012},
  pages     = {3-8},
  ee        = {http://sites.computer.org/debull/A12mar/vicol.pdf},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}

 
# PermaLink Comments [0]
04/23/2012 10:55 GMT
ICDE 2012 (post 6 of 6) - Science Data Panel

Michael Stonebraker chaired a panel on the future of science data at ICDE 2012 last week. Other participants were Jeremy Kepner from MIT Lincoln Labs, Anastasia Ailamaki from EPFL, and Alex Szalay from Johns Hopkins University.

This is the thrust of what was said, noted from memory. My comments follow after the synopsis.

Jeremy Kepner: When Java was new we saw it as the coming thing and figured that in HPC we should find space for this. When MapReduce and Hadoop came along, we saw this as a sea change in parallel programming models. This was so simple literally anybody could make parallel algorithms whereas this was not so with MPI. Even parallel distributed arrays are harder. So MapReduce was a game changer, together with the cloud where anybody can get a cluster. Hardly a week passes without me having to explain to somebody in government what MapReduce and Hadoop are about.

We have a lot of arrays and a custom database for them. But the arrays are sparse so this is in fact a triple store. Our users like to work in MATLAB, and any data management must run together with that.

Of course, MapReduce is not a real scheduler, and Hadoop is not a real file system. For deployment, we must integrate real schedulers and make HDFS look like a file system to applications. The abstraction of a file system is something people like. Being able to skip a time-consuming data-ingestion process with a database is an advantage with file-based paradigms like Hadoop. If this is enhanced with the right scheduling features, this can be a good component in the HPC toolbox.

Michael Stonebraker: Users of the data use math packages like R, MATLAB, SAS, SPSS, or similar. If business intelligence is about AVG, MIN, MAX, COUNT, and GROUP BY, science applications are much more diverse in their analytics. All science algorithms have an inner loop that resembles linear algebra operations like matrix multiplication. Data is more often than not a large array. There are some graphs in biology and chemistry, but the world is primarily rectangular. Relational databases can emulate sparse arrays but are 20x slower than a custom-made array database for dense arrays. And I will not finish without picking on MapReduce: I know of 2000-node MapReduce clusters. The work they do is maybe that of a 100-node parallel database. So if 2000 nodes is what you want to operate, be my guest.

Science database is a zero billion dollar business. We do not expect to make money from the science market with SciDB, which by now works and has commercial services supplied by Paradigm 4, while the code itself is open source, which is a must for the science community. The real business opportunity is in the analytics needed by insurance and financial services in general, which are next to identical with the science use cases SciDB tackles. This makes the vendors pay attention.

Alex Szalay: The way astronomy is done today is through surveys: a telescope scans through the sky and produces data. We have now for 10 years operated the Sloane Sky Survey and kept the data online. We have all the data, and complete query logs, available for anyone interested. When we set out to do this with Jim Gray, everybody found this a crazy idea, but it has worked out.

Anastasia Ailamaki: We do not use SciDB. We find a lot of spatial use cases. Researchers need access to simulation results which are usually over a spatial model, like in earthquake simulations and the brain. Off-the-shelf techniques like R trees do not work -- the objects overlap too much -- so we have made our own spatial indexing. We make custom software when it is necessary, and are not tied to vendors. In geospatial applications, we can create meshes of different shapes -- like tetrahedral or cubes for earthquakes, and cylinders for the brain -- and index these in a geospatial index. But since an R tree is inefficient when objects overlap too much, as these do, we just find one; and then because there is reachability from an object to neighboring ones, we use this to get all the objects in the area of interest.

* * *

This is obviously a diverse field. Probably the message that we can synthesize out of this is that flexibility and parallel programming models are what we need to pay attention to. There is a need to go beyond what one can do in SQL while continuing to stay close to the data. Also, allowing for plug-in data types and index structures may be useful; we sometimes get requests for such anyway.

The continuing argument around MapReduce and Hadoop is a lasting feature of the landscape. A parallel DB will beat MapReduce any day at joining across partitions; the problem is to overcome the mindset that sees Hadoop as the always-first answer to anything parallel. People will likely have to fail with this before they do anything else. For us, the matter is about having database-resident logic for extract-transform-load (ETL) that can do data-integration type-transformations and maybe iterative graph algorithms that constantly join across partitions, better than a MapReduce job, while still allowing application logic to be written in Java. Teaching sem-web-heads to write SQL procedures and to know about join order, join type, and partition locality, has proven to be difficult. People do not understand latency, whether in client-server or cluster settings. This is why they do not see the point of stored procedures or of shipping functions to data. This sounds like a terrible indictment, like saying that people do not understand why rivers flow downhill. Yet, it is true. This is also why MapReduce is maybe the only parallel programming paradigm that can be successfully deployed in the absence of this understanding, since it is actually quite latency-tolerant, not having any synchronous cross-partition operations except for the succession of the map and reduce steps themselves.

Maybe it is so that the database guys see MapReduce as an insult to their intelligence and the rest of the world sees it as the only understandable way of running grep and sed (Unix commands for string search/replace) in parallel, with the super bonus of letting you reshuffle the outputs so that you can compare everything to everything else, which grep alone never let you do.

* * *

Making a database that does not need data loading seems a nice idea, and CWI has actually done something in this direction in "Here are my Data Files. Here are my Queries. Where are my Results?"] However, there is another product called Algebra Data that claims to take in data without loading and to optimize storage based on access. We do not have immediate plans in this direction. Bulk load is already quite fast (take 100G TPC-H in 70 minutes or so), but faster is always possible.

# PermaLink Comments [0]
04/17/2012 15:36 GMT Modified: 04/19/2012 16:44 GMT
ICDE 2012 (post 5 of 6) - Graphs

There were quite a few talks about graphs at ICDE 2012. Neither the representations of graphs, nor the differences between RDF and generic graph models, entered much into the discussion. On the other hand, graph similarity searches and related were addressed a fair bit.

Graph DB and RDF/Linked Data are distinct, if neighboring disciplines. On one hand, graph problems predate Linked Data, and the RDF/Linked Data world is a web artifact, which graphs are not as such, so a slightly different cultural derivation also makes these disjoint. Besides, graphs may imply schema first whereas linked data basically cannot. Then another differentiation might be derived from edges not really being first class citizens in RDF, except for reification, at which the RDF reification vocabulary is miserably inadequate, as pointed out before.

RDF is being driven by the web-style publishing of Linked Open Data (LOD), with some standardization and uptake by publishers; Graph DB is not standardized but driven by diverse graph-analytics use cases.

There is no necessary reason why these could not converge, but it will be indefinitely long before any standards come to cover this, so best not hold one's breath. Communities are jealous of their borders, so if the neighbor does something similar one tends to emphasize the differences and not the commonalities.

So for some things, one could warehouse the original RDF of the web microformats and LOD, and then ETL into some other graph model for specific tasks, or just do these in RDF. Of course, then RDF systems need to offer suitable capabilities. These seem to be about very fast edge traversal within a rather local working set, and about accommodating large, iteratively-updated intermediate results, e.g., edge weights.

Judging by the benchmarks paper (Benchmarking traversal operations over graph databases (Slidedeck (ppt), paper (pdf)); Marek Ciglan, Alex Averbuch, and Ladialav Hluchy.) at the GDM workshop, the state of benchmarking in graph databases is even worse than in RDF, where the state is bad enough. The paper's premise was flawed to start, using application logic to do JOINs instead of doing them in the DBMS. In this way, latency comes to dominate, and only the most blatant differences are seen. There is nothing like this style of benchmarking to make an industry look bad. The supercomputer Graph 500 benchmark, on the other hand, lets the contestants make their own implementations on a diversity of architectures with random traversal as well as loading and generating large intermediate results. It is somewhat limited, but still broader than the the graph database benchmarks paper at the GDM workshop.

Returning to graphs, there were some papers on similarity search and clique detection. As players in this space, beyond just RDF, we might as well consider implementing necessary features for efficient expression of such problems. The algorithms discussed were expressed in procedural code against memory-based data structures; there is usually no query language or parallel/distributed processing involved.

MapReduce has become the default way in which people would tackle such problems at scale; in fact, people do not consider anything else, as far as I can tell. Well, they certainly do not consider MPI for example as a first choice. The parallel array things in Fortran do not at first sight seem very graphy, so this is likely not something that crosses one's mind either.

We should try some of the similarity search and clustering in SQL with a parallel programming model. We have excellent expression-evaluation speed from vectoring and unrestricted recursion between partitions, and no file system latencies like MapReduce. The initial test case will be some of the linking/data-integration/mapping workloads in LOD2.

Having some sort-of-agreed-upon benchmark for these workloads would make this more worthwhile. Again, we will see what emerges.

# PermaLink Comments [0]
04/17/2012 15:35 GMT Modified: 04/19/2012 16:43 GMT
ICDE 2012 (post 4 of 6) - Graph Data Management Workshop

I gave an invited talk ("Virtuoso 7 - Column Store and Adaptive Techniques for Graph" (Slides (ppt))) at the Graph Data Management Workshop at ICDE 2012.

Bryan Thompson of Systap (Bigdata® RDF store) was also invited, so we got to talk about our common interests. He told me about two cool things they have recently done, namely introducing tables to SPARQL, and adding a way of reifying statements that does not rely on extra columns. The table business is just about being able to store a multicolumn result set into a named persistent entity for subsequent processing. But this amounts to a SQL table, so the relational model has been re-arrived at, once more, from practical considerations. The reification just packs all the fields of a triple (or quad) into a single string and this string is then used as an RDF S or O (Subject or Object), less frequently a P or G (Predicate or Graph). This works because Bigdata® has variable length fields in all columns of the triple/quad table. The query notation then accepts a function-looking thing in a triple pattern to mark reification. Nice. Virtuoso has a variable length column in only the O but could of course have one in also S and even in P and G. The column store would still compress the same as long as reified values did not occur. These values on the other hand would be unlikely to compress very well but run length and dictionary would always work.

So, we could do it like Bigdata®, or we could add a "quad ID" column to one of the indices, to give a reification ID to quads. Again no penalty in a column store, if you do not access the column. Or we could make an extra table of PSOG->R.

Yet another variation would be to make the SPOG concatenation a literal that is interned in the RDF literal table, and then used as any literal would be in the O, and as an IRI in a special range when occurring as S. The relative merits depend on how often something will be reified and on whether one wishes to SELECT based on parts of reification. Whichever the case may be, the idea of a function-looking placeholder for a reification is a nice one and we should make a compatible syntax if we do special provenance/reification support. The model in the RDF reification vocabulary is a non-starter and a thing to discredit the sem web for anyone from database.

I heard from Bryan that the new W3 RDF WG had declared provenance out of scope, unfortunately. The word on the street on the other hand is that provenance is increasingly found to be an issue. This is confirmed by the active work of the W3 Provenance Working Group.

# PermaLink Comments [0]
04/17/2012 15:34 GMT
ICDE 2012 (post 3 of 6) - What Is Timely LOD Search Worth?

There was a talk (Linked Data and Live Querying for Enabling Support Platforms for Web Dataspaces (Slides (PDF)); Jürgen Umbrich, Marcel Karnstedt, Josiane Xavier Parreira, Axel Polleres and Manfred Hauswirth) at the Data Engineering Meets the Semantic Web (DESWEB) workshop at ICDE last week about the problems of caching LOD, whether attempted by Sindice or OpenLink's LOD Cloud Cache. The conclusion was that OpenLink covered a bit more of the test data sets and that Sindice was maybe better up to date on the ones that it covered but that neither did it very well. The data sets were random graphs of user FOAF profiles and such collected from some Billion Triples Data set, thus not data that is likely to have commercial value, except in huge quantities maybe for some advertising, except that click streams and the like are much more valuable.

Being involved with at least one of these, and being in the audience, I felt obligated to comment. The fact is, neither OpenLink's LOD Cloud Cache nor Sindice is a business, and there is not a business model which could justify keeping them timely on the web crawls they contain. Doing so is easy enough, if there is a good enough reason.

The talk did make a couple of worthwhile points: The data does change; and if one queries entities, one encounters large variation in change-frequency across entities and their attributes.

The authors suggested to have a piece of middleware decide what things can be safely retrieved from a copy and what have to be retrieved from the source. Not too much is in fact known about the change frequency of the data, except that it changes, as the authors pointed out.

The crux of the matter is that the thing that ought to know this best is the query processor at the LOD warehouse. For client-side middleware to split the query, it needs access to statistics that it must get from the warehouse or keep by itself. Of course, in concrete application scenarios, you go to the source if you ask about the weather or traffic jams, and otherwise go to the warehouse based on application-level knowledge.

But for actual business intelligence, one needs histories, so a search engine with only the present is not so interesting. At any rate, refreshing the data should leave a trail of past states. Exposing this for online query would just triple the price, so we forget about that for now. Just keeping an append-only table of history is not too much of a problem. One may make extracts from this table into a relational form for specific business questions. There is no point doing such analytics in RDF itself. One would have to just try to see if there is anything remotely exploitable in such histories. Making a history table is easy enough. Maybe I will add one.

Let us now see what it would take to operate a web crawl cache that would be properly provisioned, kept fresh, and managed. We base this on the Sindice crawl sizes and our experiments on these; the non-web-crawl LOD Cloud Cache is not included.

From previous experience we know the sizing: 5Gt/144GB RAM. Today's best price point is on 24-DIMM E5 boards, so 192GB RAM, or 6.67Gt. A unit like that (8TB HDD, 0.5TB SSD, 192GB RAM, 12 core E5, InfiniBand) costs about $6800.

The Sindice crawl is now about 20Gt, so $28K of gear (768GB RAM) is enough. Let us count this 4 times: 2x for anticipated growth; and 2x for running two copies -- one for online, and one for batch jobs. This is 3TB RAM. Power is 16x500W = 8KW, which we could round to 80A at 110V. Colocation comes to $500 for the space, and $1200 per month for power; make it $2500 per month with traffic included.

At this rate, 3 year TCO is $120K + ( 36 * $2.5K ) = $210K. This takes one person half time to operate, so this is another $50K per year.

We do not count software development in this, except some scripting that should be included in the yearly $50K DBA bill.

Under what circumstances is such a thing profitable? Or can such a thing be seen as a marketing demo, to be paid for by license or service sales?

A third party can operate a system of this sort, but then the cost will be dominated by software licenses if running on Virtuoso cluster.

For comparison, the TB at EC2 costs ((( 16 * $2 ) * 24 ) * 31 ) = $24,808 per month. With reserved instances, it is ( 16 * ( $2192 + ((( 0.7 * 24 ) * 365 ) * 3 ))) / 36 = $8938 per month for a 3 year term. Counting at 3TB, the 3 year TCO is $965K at EC2. AWS has volume discounts but they start higher than this; ( 3 * ( 16 * $2K )) = $96K reserved host premium is under $250K. So if you do not even exceed their first volume discount threshold, it does not look likely you can cut a special deal with AWS.

(The AWS prices are calculated with the high memory instances, approximately 64GB usable RAM each. The slightly better CC2 instance is a bit more expensive.)

Yet another experiment to make is whether a system as outlined will even run at anywhere close to the performance of physical equipment. This is uncertain; clouds are not for speed, based on what we have seen. They make the most sense when the monthly bill is negligible in relation to the cost of a couple of days of human time.

# PermaLink Comments [0]
04/17/2012 15:33 GMT Modified: 04/19/2012 16:43 GMT
 <<     | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |     >>
Powered by OpenLink Virtuoso Universal Server
Running on Linux platform