Last week the RDF and graph DB benchmarking project, LDBC, had its 3rd Technical User Community meeting in London, held in collaboration with the GraphConnect event. This meeting marks the official launch of the LDBC non-profit company which is the successor of the present EU FP7 project.

The meeting was very well attended, along with most of the new advisory board. Xavier Lopez from Oracle, Luis Ceze from the University of Washington, and Abraham Bernstein of the University of Zurich were present. Jans Aasman of Franz, Inc., and Karl Huppler, former chairman of the TPC, were not present but are signed up as advisory board members.

We had great talks by the new board members and invited graph and RDF DB users.

Nuno Carvalho of Fujitsu Labs presented on the Fujitsu RDF use cases and benchmarking requirements, based around analytics streaming on time series of streaming data. The technology platform is diverse, with anything from RDF stores to HBase. The challenge is integration. I pointed out that with Virtuoso column store, you could now efficiently host time series data alongside RDF. Sure, a relational format is more efficient with time series data, but it can be co-located with RDF, and queries can join between the two. This is especially so after our stellar bulk-load speed measured with the TPC-H dataset.

Luis Ceze of Washington University presented Grappa, a C++ graph programming framework that in his words would be like Cray XMT, later Yarc Data, in software. The idea is to have a graph algorithm divided into small executable steps, millions in number, and to have very efficient scheduling and switching between these, building latency tolerance into every step of the application. Commodity interconnects like InfiniBand deliver bad throughput with small messages, but with endless message combination opportunities from millions of mini work units, the overall throughput stays good. We know the same from all the Virtuoso scale-out work. Luis is presently working on GraphBench, a research project at Washington State funded by Oracle for graph algorithm benchmarking. The major interest for LDBC is in having a library of common graph analytics as a starting point. Having these, the data generation can further evolve so as to create challenges for the algorithms. One issue that came up is the question of validating graph algorithm results: Unlike in SQL queries, there is not necessarily a single correct answer. If the algorithm to use and the count of iterations to run is not fully specified, response times will vary widely. Random walks will anyway create variation between consecutive runs.

Abraham Bernstein presented about the work on his Signal/Collect graph programming framework and its applications in fraud detection. He also talked about the EU FP7 project ViSTA-TV which does massive stream processing around the real time behavior of internet TV users. Again, Abraham gave very direct suggestions for what to include in the LDBC graph analytics workload.

Andreas Both of Unister presented on RDF ontology-driven applications in an e-commerce context. Unister is Germany’s leading e-commerce portal operator with a large number of properties ranging across travel to most B2C. The RDF use cases are many, in principle down to final content distribution but high online demand often calls for specialized solutions like bit field intersections for combining conditions. Sufficiently advanced database technology may also offer this but this is not a guarantee. Selecting travel destinations based on attributes like sports opportunities, culture, etc., can be made into efficient query plans, but this also requires perfect query plans for short queries. I expect to learn more about this when visiting on site. There is clear input for LDBC in these workloads.

There were three talks on semantic applications in cultural heritage. Robina Clayphan of Europeana talked about this pan-European digital museum and library, and the Europeana Data Model (EDM). C.E. Ore of the University of Oslo talked about the CIDOC CRM (Conceptual Reference Model) ontology (ISO standard 21127:2006) and its role in representing cultural, historic, and archaeological information. Atanas Kiryakov of Ontotext gave a talk on a possible benchmark around CIDOC CRM reasoning. In the present LDBC work, RDF inference plays a minor role, but reasoning would be emphasized with this CRM workload, in which the inference needed revolves around abbreviating unions between many traversal paths of different lengths between modeled objects. The data is not very large but the ontology has a lot of detail. This still is not the elusive use case which would really require all the OWL complexities. We will first see how the semantic publishing benchmark work led by Ontotext in LDBC plays out. There is anyhow work enough there.

The most concrete result was that the graph analytics part of the LDBC agenda starts to take shape. The LDBC organization is getting formed, and its processes and policies are getting defined. I visited Thomas Neumann’s group in Munich just prior to the TUC meeting to work on this. Nowadays Peter Boncz, who was recently awarded the Humboldt Prize, goes to Munich on a weekly basis, so Munich is the favored destination for much LDBC-related work.

The first workload of the Social Network Benchmark is taking shape, and there is good advance also in the Semantic Publishing Benchmark. I will in a future post give more commentary on these workloads, now that the initial drafts from the respective task forces are out.