I will here take a break from core database and talk a bit about EU policies for research funding.
I had lunch with Stefan Manegold of CWI last week, where we talked about where European research should go. Stefan is involved in RETHINK big, a European research project for compiling policy advice regarding big data for EC funding agencies. As part of this, he is interviewing various stakeholders such as end user organizations and developers of technology.
RETHINK big wants to come up with a research agenda primarily for hardware, anything from faster networks to greener data centers. CWI represents software expertise in the consortium.
So, we went through a regular questionnaire about how we see the landscape. I will summarize this below, as this is anyway informative.
My own core competence is in core database functionality, specifically in high performance query processing, scale-out, and managing schema-less data. Most of the Virtuoso installed base is in the RDF space, but most potential applications are in fact outside of this niche.
The life sciences vertical is the one in which I have the most application insight, from going to Open PHACTS meetings and holding extensive conversations with domain specialists. We have users in many other verticals, from manufacturing to financial services, but there I do not have as much exposure to the actual applications.
Having said this, the challenges throughout tend to be in diversity of data. Every researcher has their MySQL database or spreadsheet, and there may not even be a top level catalogue of everything. Data formats are diverse. Some people use linked data (most commonly RDF) as a top level metadata format. The application data, such as gene sequences or microarray assays, reside in their native file formats and there is little point in RDF-izing these.
There are also public data resources that are published in RDF serializations as vendor-neutral, self-describing format. Having everything as triples, without a priori schema, makes things easier to integrate and in some cases easier to describe and query.
So, the challenge is in the labor intensive nature of data integration. Data comes with different levels of quantity and quality, from hand-curated to NLP extractions. Querying in the single- or double-digit terabyte range with RDF is quite possible, as we have shown many times on this blog, but most use cases do not even go that far. Anyway, what we see on the field is primarily a data diversity game. The scenario is data integration; the technology we provide is database. The data transformation proper, data cleansing, units of measure, entity de-duplication, and such core data-integration functions are performed using diverse, user-specific means.
Jerven Bolleman of the Swiss Institute of Bioinformatics is a user of ours with whom we have long standing discussions on the virtues of federated data and querying. I advised Stefan to go talk to him; he has fresh views about the volume challenges with unexpected usage patterns. Designing for performance is tough if the usage pattern is out of the blue, like correlating air humidity on the day of measurement with the presence of some genomic patterns. Building a warehouse just for that might not be the preferred choice, so the problem field is not exhausted. Generally, I’d go for warehousing though.
OK. Even a fast network is a network. A set of processes on a single shared-memory box is also a kind of network. InfiniBand is maybe half the throughput and 3x the latency of single threaded interprocess communication within one box. The operative word is latency. Making large systems always involves a network or something very much like one in large scale-up scenarios.
On the software side, next to nobody understands latency and contention; yet these are the one core factor in any pursuit of scalability. Because of this situation, paradigms like MapReduce and bulk synchronous parallel (BSP) processing have become popular because these take the communication out of the program flow, so the programmer cannot muck this up, as otherwise would happen with the inevitability of destiny. Of course, our beloved SQL or declarative query in general does give scalability in many tasks without programmer participation. Datalog has also been used as a means of shipping computation around, as in the the work of Hellerstein.
There are no easy solutions. We have built scale-out conscious, vectorized extensions to SQL procedures where one can express complex parallel, distributed flows, but people do not use or understand these. These are very useful, even indispensable, but only on the inside, not as a programmer-facing construct. MapReduce and BSP are the limit of what a development culture will absorb. MapReduce and BSP do not hide the fact of distributed processing. What about things that do? Parallel, partitioned extensions to Fortran arrays? Functional languages? I think that all the obvious aids to parallel/distributed programming have been conceived of. No silver bullet; just hard work. And above all the discernment of what paradigm fits what problem. Since these are always changing, there is no finite set of rules, and no substitute for understanding and insight, and the latter are vanishingly scarce. "Paradigmatism," i.e., the belief that one particular programming model is a panacea outside of its original niche, is a common source of complexity and inefficiency. This is a common form of enthusiastic naïveté.
If you look at power efficiency, the clusters that are the easiest to program consist of relatively few high power machines and a fast network. A typical node size is 16+ cores and 256G or more RAM. Amazon has these in entirely workable configurations, as documented earlier on this blog. The leading edge in power efficiency is in larger number of smaller units, which makes life again harder. This exacerbates latency and forces one to partition the data more often, whereas one can play with replication of key parts of data more freely if the node size is larger.
One very specific item where research might help without having to rebuild the hardware stack would be better, lower-latency exposure of networks to software. Lightweight threads and user-space access, bypassing slow protocol stacks, etc. MPI has some of this, but maybe more could be done.
So, I will take a cluster of such 16-core, 256GB machines on a faster network, over a cluster of 1024 x 4G mobile phones connected via USB. Very selfish and unecological, but one has to stay alive and life is tough enough as is.
The transition from capex to opex may be approaching maturity, as there have been workable cloud configurations for the past couple of years. The EC2 from way back, with at best a 4 core 16G VM and a horrible network for $2/hr, is long gone. It remains the case that 4 months of 24x7 rent in the cloud equals the purchase price of physical hardware. So, for this to be economical long-term at scale, the average utilization should be about 10% of the peak, and peaks should not be on for more than 10% of the time.
So, database software should be rented by the hour. A 100-150% markup for the $2.80 a large EC2 instance costs would be reasonable. Consider that 70% of the cost in TPC benchmarks is database software.
There will be different pricing models combining different up-front and per-usage costs, just as there are for clouds now. If the platform business goes that way and the market accepts this, then systems software will follow. Price/performance quotes should probably be expressed as speed/price/hour instead of speed/price.
The above is rather uncontroversial but there is no harm restating these facts. Reinforce often.
This is a harder question. There is some European business in wide area and mobile infrastructures. Competing against Huawei will keep them busy. Intel and Mellanox will continue making faster networks regardless of European policies. Intel will continue building denser compute nodes, e.g., integrated Knight’s Corner with dual IB network and 16G fast RAM on chip. Clouds will continue making these available on demand once the technology is in mass production.
What’s the next big innovation? Neuromorphic computing? Quantum computing? Maybe. For now, I’d just do more engineering along the core competence discussed above, with emphasis on good marketing and scalable execution. By this I mean trained people who know something about deployment. There is a huge training gap. In the would-be "Age of Data," knowledge of how things actually work and scale is near-absent. I have offered to do some courses on this to partners and public alike, but I need somebody to drive this show; I have other things to do.
I have been to many, many project review meetings, mostly as a project partner but also as reviewer. For the past year, the EC has used an innovation questionnaire at the end of the meetings. It is quite vague, and I don’t think it delivers much actionable intelligence.
What would deliver this would be a venture capital type activity, with well-developed networks and active participation in developing a business. The EC is not now set up to perform this role, though. But the EC is a fairly large and wealthy entity, so it could invest some money via this type of channel. Also there should be higher individual incentives and rewards for speed and excellence. Getting the next Horizon 2020 research grant may be good, but better exists. The grants are competitive enough and the calls are not bad; they follow the times.
In the projects I have seen, productization does get some attention, e.g., the LOD2 stack, but it is not something that is really ongoing or with dedicated commercial backing. It may also be that there is no market to justify such dedicated backing. Much of the RDF work has been "me, too" — let’s do what the real database and data integration people do, but let’s just do this with triples. Innovation? Well, I took the best of the real DB world and adapted this to RDF, which did produce a competent piece of work with broad applicability, extending outside RDF. Is there better than this? Well, some of the data integration work (e.g., LIMES) is not bad, and it might be picked up by some of the players that do this sort of thing in the broader world, e.g., Informatica, the DI suites of big DB vendors, Tamr, etc. I would not know if this in fact adds value to the non-RDF equivalents; I do not know the field well enough, but there could be a possibility.
The recent emphasis for benchmarking, spearheaded by Stefano Bertolo is good, as exemplified by the LDBC FP7. There should probably be one or two projects of this sort going at all times. These make challenges known and are an effective means of guiding research, with a large multiplier: Once a benchmark gets adopted, infinitely more work goes into solving the problem than in stating it in the first place.
The aims and calls are good. The execution by projects is variable. For 1% of excellence, there apparently must be 99% of so-and-so, but this is just a fact of life and not specific to this context. The projects are rather diffuse. There is not a single outcome that gets all the effort. In this, the level of engagement of participants is less and focus is much more scattered than in startups. A really hungry, go-getter mood is mostly absent. I am a believer in core competence. Well, most people will agree that core competence is nice. But the projects I have seen do not drive for it hard enough.
It is hard to say exactly what kinds of incentives could be offered to encourage truly exceptional work. The American startup scene does offer high rewards and something of this could be transplanted into the EC project world. I would not know exactly what form this could take, though.
About this entry:
Author: Virtuoso Data Space Bot
Published: 06/29/2015 15:36 GMT
Comment Status: 0 Comments