Michael Stonebraker chaired a panel on the future of science data at ICDE 2012 last week. Other participants were Jeremy Kepner from MIT Lincoln Labs, Anastasia Ailamaki from EPFL, and Alex Szalay from Johns Hopkins University.
This is the thrust of what was said, noted from memory. My comments follow after the synopsis.
We have a lot of arrays and a custom database for them. But the arrays are sparse so this is in fact a triple store. Our users like to work in MATLAB, and any data management must run together with that.
Of course, MapReduce is not a real scheduler, and Hadoop is not a real file system. For deployment, we must integrate real schedulers and make HDFS look like a file system to applications. The abstraction of a file system is something people like. Being able to skip a time-consuming data-ingestion process with a database is an advantage with file-based paradigms like Hadoop. If this is enhanced with the right scheduling features, this can be a good component in the HPC toolbox.
Michael Stonebraker: Users of the data use math packages like R, MATLAB, SAS, SPSS, or similar. If business intelligence is about AVG, MIN, MAX, COUNT, and GROUP BY, science applications are much more diverse in their analytics. All science algorithms have an inner loop that resembles linear algebra operations like matrix multiplication. Data is more often than not a large array. There are some graphs in biology and chemistry, but the world is primarily rectangular. Relational databases can emulate sparse arrays but are 20x slower than a custom-made array database for dense arrays. And I will not finish without picking on MapReduce: I know of 2000-node MapReduce clusters. The work they do is maybe that of a 100-node parallel database. So if 2000 nodes is what you want to operate, be my guest.
Science database is a zero billion dollar business. We do not expect to make money from the science market with SciDB, which by now works and has commercial services supplied by Paradigm 4, while the code itself is open source, which is a must for the science community. The real business opportunity is in the analytics needed by insurance and financial services in general, which are next to identical with the science use cases SciDB tackles. This makes the vendors pay attention.
Alex Szalay: The way astronomy is done today is through surveys: a telescope scans through the sky and produces data. We have now for 10 years operated the Sloane Sky Survey and kept the data online. We have all the data, and complete query logs, available for anyone interested. When we set out to do this with Jim Gray, everybody found this a crazy idea, but it has worked out.
Anastasia Ailamaki: We do not use SciDB. We find a lot of spatial use cases. Researchers need access to simulation results which are usually over a spatial model, like in earthquake simulations and the brain. Off-the-shelf techniques like R trees do not work -- the objects overlap too much -- so we have made our own spatial indexing. We make custom software when it is necessary, and are not tied to vendors. In geospatial applications, we can create meshes of different shapes -- like tetrahedral or cubes for earthquakes, and cylinders for the brain -- and index these in a geospatial index. But since an R tree is inefficient when objects overlap too much, as these do, we just find one; and then because there is reachability from an object to neighboring ones, we use this to get all the objects in the area of interest.
* * *
This is obviously a diverse field. Probably the message that we can synthesize out of this is that flexibility and parallel programming models are what we need to pay attention to. There is a need to go beyond what one can do in SQL while continuing to stay close to the data. Also, allowing for plug-in data types and index structures may be useful; we sometimes get requests for such anyway.
The continuing argument around MapReduce and Hadoop is a lasting feature of the landscape. A parallel DB will beat MapReduce any day at joining across partitions; the problem is to overcome the mindset that sees Hadoop as the always-first answer to anything parallel. People will likely have to fail with this before they do anything else. For us, the matter is about having database-resident logic for extract-transform-load (ETL) that can do data-integration type-transformations and maybe iterative graph algorithms that constantly join across partitions, better than a MapReduce job, while still allowing application logic to be written in Java. Teaching sem-web-heads to write SQL procedures and to know about join order, join type, and partition locality, has proven to be difficult. People do not understand latency, whether in client-server or cluster settings. This is why they do not see the point of stored procedures or of shipping functions to data. This sounds like a terrible indictment, like saying that people do not understand why rivers flow downhill. Yet, it is true. This is also why MapReduce is maybe the only parallel programming paradigm that can be successfully deployed in the absence of this understanding, since it is actually quite latency-tolerant, not having any synchronous cross-partition operations except for the succession of the map and reduce steps themselves.
Maybe it is so that the database guys see MapReduce as an insult to their intelligence and the rest of the world sees it as the only understandable way of running grep and sed (Unix commands for string search/replace) in parallel, with the super bonus of letting you reshuffle the outputs so that you can compare everything to everything else, which grep alone never let you do.
Making a database that does not need data loading seems a nice idea, and CWI has actually done something in this direction in "Here are my Data Files. Here are my Queries. Where are my Results?"] However, there is another product called Algebra Data that claims to take in data without loading and to optimize storage based on access. We do not have immediate plans in this direction. Bulk load is already quite fast (take 100G TPC-H in 70 minutes or so), but faster is always possible.
There were quite a few talks about graphs at ICDE 2012. Neither the representations of graphs, nor the differences between RDF and generic graph models, entered much into the discussion. On the other hand, graph similarity searches and related were addressed a fair bit.
Graph DB and RDF/Linked Data are distinct, if neighboring disciplines. On one hand, graph problems predate Linked Data, and the RDF/Linked Data world is a web artifact, which graphs are not as such, so a slightly different cultural derivation also makes these disjoint. Besides, graphs may imply schema first whereas linked data basically cannot. Then another differentiation might be derived from edges not really being first class citizens in RDF, except for reification, at which the RDF reification vocabulary is miserably inadequate, as pointed out before.
RDF is being driven by the web-style publishing of Linked Open Data (LOD), with some standardization and uptake by publishers; Graph DB is not standardized but driven by diverse graph-analytics use cases.
There is no necessary reason why these could not converge, but it will be indefinitely long before any standards come to cover this, so best not hold one's breath. Communities are jealous of their borders, so if the neighbor does something similar one tends to emphasize the differences and not the commonalities.
So for some things, one could warehouse the original RDF of the web microformats and LOD, and then ETL into some other graph model for specific tasks, or just do these in RDF. Of course, then RDF systems need to offer suitable capabilities. These seem to be about very fast edge traversal within a rather local working set, and about accommodating large, iteratively-updated intermediate results, e.g., edge weights.
Judging by the benchmarks paper (Benchmarking traversal operations over graph databases (Slidedeck (ppt), paper (pdf)); Marek Ciglan, Alex Averbuch, and Ladialav Hluchy.) at the GDM workshop, the state of benchmarking in graph databases is even worse than in RDF, where the state is bad enough. The paper's premise was flawed to start, using application logic to do JOINs instead of doing them in the DBMS. In this way, latency comes to dominate, and only the most blatant differences are seen. There is nothing like this style of benchmarking to make an industry look bad. The supercomputer Graph 500 benchmark, on the other hand, lets the contestants make their own implementations on a diversity of architectures with random traversal as well as loading and generating large intermediate results. It is somewhat limited, but still broader than the the graph database benchmarks paper at the GDM workshop.
Returning to graphs, there were some papers on similarity search and clique detection. As players in this space, beyond just RDF, we might as well consider implementing necessary features for efficient expression of such problems. The algorithms discussed were expressed in procedural code against memory-based data structures; there is usually no query language or parallel/distributed processing involved.
MapReduce has become the default way in which people would tackle such problems at scale; in fact, people do not consider anything else, as far as I can tell. Well, they certainly do not consider MPI for example as a first choice. The parallel array things in Fortran do not at first sight seem very graphy, so this is likely not something that crosses one's mind either.
We should try some of the similarity search and clustering in SQL with a parallel programming model. We have excellent expression-evaluation speed from vectoring and unrestricted recursion between partitions, and no file system latencies like MapReduce. The initial test case will be some of the linking/data-integration/mapping workloads in LOD2.
Having some sort-of-agreed-upon benchmark for these workloads would make this more worthwhile. Again, we will see what emerges.
I gave an invited talk ("Virtuoso 7 - Column Store and Adaptive Techniques for Graph" (Slides (ppt))) at the Graph Data Management Workshop at ICDE 2012.
Bryan Thompson of Systap (Bigdata® RDF store) was also invited, so we got to talk about our common interests. He told me about two cool things they have recently done, namely introducing tables to SPARQL, and adding a way of reifying statements that does not rely on extra columns. The table business is just about being able to store a multicolumn result set into a named persistent entity for subsequent processing. But this amounts to a SQL table, so the relational model has been re-arrived at, once more, from practical considerations. The reification just packs all the fields of a triple (or quad) into a single string and this string is then used as an RDF S or O (Subject or Object), less frequently a P or G (Predicate or Graph). This works because Bigdata® has variable length fields in all columns of the triple/quad table. The query notation then accepts a function-looking thing in a triple pattern to mark reification. Nice. Virtuoso has a variable length column in only the O but could of course have one in also S and even in P and G. The column store would still compress the same as long as reified values did not occur. These values on the other hand would be unlikely to compress very well but run length and dictionary would always work.
So, we could do it like Bigdata®, or we could add a "quad ID" column to one of the indices, to give a reification ID to quads. Again no penalty in a column store, if you do not access the column. Or we could make an extra table of PSOG->R.
Yet another variation would be to make the SPOG concatenation a literal that is interned in the RDF literal table, and then used as any literal would be in the O, and as an IRI in a special range when occurring as S. The relative merits depend on how often something will be reified and on whether one wishes to SELECT based on parts of reification. Whichever the case may be, the idea of a function-looking placeholder for a reification is a nice one and we should make a compatible syntax if we do special provenance/reification support. The model in the RDF reification vocabulary is a non-starter and a thing to discredit the sem web for anyone from database.
I heard from Bryan that the new W3 RDF WG had declared provenance out of scope, unfortunately. The word on the street on the other hand is that provenance is increasingly found to be an issue. This is confirmed by the active work of the W3 Provenance Working Group.
There was a talk (Linked Data and Live Querying for Enabling Support Platforms for Web Dataspaces (Slides (PDF)); Jürgen Umbrich, Marcel Karnstedt, Josiane Xavier Parreira, Axel Polleres and Manfred Hauswirth) at the Data Engineering Meets the Semantic Web (DESWEB) workshop at ICDE last week about the problems of caching LOD, whether attempted by Sindice or OpenLink's LOD Cloud Cache. The conclusion was that OpenLink covered a bit more of the test data sets and that Sindice was maybe better up to date on the ones that it covered but that neither did it very well. The data sets were random graphs of user FOAF profiles and such collected from some Billion Triples Data set, thus not data that is likely to have commercial value, except in huge quantities maybe for some advertising, except that click streams and the like are much more valuable.
Being involved with at least one of these, and being in the audience, I felt obligated to comment. The fact is, neither OpenLink's LOD Cloud Cache nor Sindice is a business, and there is not a business model which could justify keeping them timely on the web crawls they contain. Doing so is easy enough, if there is a good enough reason.
The talk did make a couple of worthwhile points: The data does change; and if one queries entities, one encounters large variation in change-frequency across entities and their attributes.
The authors suggested to have a piece of middleware decide what things can be safely retrieved from a copy and what have to be retrieved from the source. Not too much is in fact known about the change frequency of the data, except that it changes, as the authors pointed out.
The crux of the matter is that the thing that ought to know this best is the query processor at the LOD warehouse. For client-side middleware to split the query, it needs access to statistics that it must get from the warehouse or keep by itself. Of course, in concrete application scenarios, you go to the source if you ask about the weather or traffic jams, and otherwise go to the warehouse based on application-level knowledge.
But for actual business intelligence, one needs histories, so a search engine with only the present is not so interesting. At any rate, refreshing the data should leave a trail of past states. Exposing this for online query would just triple the price, so we forget about that for now. Just keeping an append-only table of history is not too much of a problem. One may make extracts from this table into a relational form for specific business questions. There is no point doing such analytics in RDF itself. One would have to just try to see if there is anything remotely exploitable in such histories. Making a history table is easy enough. Maybe I will add one.
Let us now see what it would take to operate a web crawl cache that would be properly provisioned, kept fresh, and managed. We base this on the Sindice crawl sizes and our experiments on these; the non-web-crawl LOD Cloud Cache is not included.
From previous experience we know the sizing: 5Gt/144GB RAM. Today's best price point is on 24-DIMM E5 boards, so 192GB RAM, or 6.67Gt. A unit like that (8TB HDD, 0.5TB SSD, 192GB RAM, 12 core E5, InfiniBand) costs about $6800.
The Sindice crawl is now about 20Gt, so $28K of gear (768GB RAM) is enough. Let us count this 4 times: 2x for anticipated growth; and 2x for running two copies -- one for online, and one for batch jobs. This is 3TB RAM. Power is 16x500W = 8KW, which we could round to 80A at 110V. Colocation comes to $500 for the space, and $1200 per month for power; make it $2500 per month with traffic included.
At this rate, 3 year TCO is $120K + ( 36 * $2.5K ) = $210K. This takes one person half time to operate, so this is another $50K per year.
We do not count software development in this, except some scripting that should be included in the yearly $50K DBA bill.
Under what circumstances is such a thing profitable? Or can such a thing be seen as a marketing demo, to be paid for by license or service sales?
A third party can operate a system of this sort, but then the cost will be dominated by software licenses if running on Virtuoso cluster.
For comparison, the TB at EC2 costs ((( 16 * $2 ) * 24 ) * 31 ) = $24,808 per month. With reserved instances, it is ( 16 * ( $2192 + ((( 0.7 * 24 ) * 365 ) * 3 ))) / 36 = $8938 per month for a 3 year term. Counting at 3TB, the 3 year TCO is $965K at EC2. AWS has volume discounts but they start higher than this; ( 3 * ( 16 * $2K )) = $96K reserved host premium is under $250K. So if you do not even exceed their first volume discount threshold, it does not look likely you can cut a special deal with AWS.
(The AWS prices are calculated with the high memory instances, approximately 64GB usable RAM each. The slightly better CC2 instance is a bit more expensive.)
Yet another experiment to make is whether a system as outlined will even run at anywhere close to the performance of physical equipment. This is uncertain; clouds are not for speed, based on what we have seen. They make the most sense when the monthly bill is negligible in relation to the cost of a couple of days of human time.
We have played around with LOD data sets and Virtuoso Column Store for the past several months. I will here give a few numbers and comment on some different platform comparisons that we have made. The answer at the end of this is how to size a system for often-changing web-style data. The conclusion is a data-to-RAM ratio that gives an acceptable working set without driving the price up by forcing 100% RAM residence.
The experiment is loading Sindice web crawls. The platform is 2 x Xeon 5520 and 144G RAM. The initial load rate is 200-180Kt, and drops to 100Kt at 5Gt because of I/O. The system is Virtuoso Column Store configured to run as 4 processes and 32 partitions, all on the same box. After 5Gt, we see just more I/O and going further is not relevant; one runs CPU-bound or not at all.
We use 4 Crucial SSDs in the setup. The hot structures like the RDF quad indices are on SSD, and the cold ones are on hard disk. A cold structure is a write-only index like the dictionary of literals (id to lit).
For bulk load, SSDs turn out not to be particularly useful. For a cold start on the other hand, SSDs cut warmup time of 144G RAM from over half an hour to a couple of minutes. It is possible that Intel SSDs would also help with bulk load, but this has not been tried. The SSD problem during bulk load is that these do not write very fast, and while there are writes in queue, read latency goes up; so under a constant write load, the SSD's famous instantaneous random read no longer works.
The fragment considered in the example is 4.95Gt: 8.1M pages worth of quads; 12.7M of literals and iris; and 4.71M of full text index. A page is 8KB. The files on disk contain empty pages, but these do not matter since they do not take up RAM. The quad indices take 13.4 bytes/quad. The row-wise equivalent used to be 38 or so bytes/quad with similar data. Two-thirds of the IRI and literal string data can benefit from column-wise stream compression. (This was not used but if it were, we could count on a 50% drop in size for the data affected, so instead of 12.7M pages, we could maybe get 8.5M on a good day. This could be worth doing but is not a priority.) The system was configured to have 12M database pages in RAM, so a little under half the database pages of the set fit in RAM at one time; thus one cannot call this a memory-only setup. Due to the locality in the unusually non-local data, this is as far as secondary storage can reach without becoming an over-2x slowdown. In practice, we are talking about under 1% of rows accessed coming from secondary storage, but that alone means half throughput.
We note that this data set represents the worst that we have seen. It has 129M distinct graphs, 38 t/g. Regular data like the synthetic benchmark sets take half the space per quad. This is about a third of a Sindice crawl; the other two-thirds look the same as far as we looked.
So if you are interested in hosting data like this, you can budget 144GB RAM for every 5Gt. Do not try it with anything less. Budgeting double this is wise, so that you have space to cook the data; this is important since in order to do things with it, one needs to at least copy things for materializing transformations.
If you are budget-constrained and hosting very regular content like UniProt, you can budget maybe 144GB RAM for every 10Gt.
As for CPU, this does not matter as much as long as you do not go to disk. Just for load speed, Dbpedia is loaded in 300s on a cluster of eight (8) dual AMD 2378 boxes at 2.6GHz (total 8 cores per host, so 64 cores in the cluster), and in 945s on one (1) dual Xeon 5520 box at 2.26GHz (total 8 cores in the host). Intel makes much better CPUs, as we see. Both scenarios are 100% in RAM. For even more regular data, the load rates are a bit higher: 1.3Mt/s for the AMD cluster, and 300Kt/s for the Xeon host.
The interconnect for the AMD cluster is 1 x gigE but this does not matter for load. For CPU-bound cross-partition JOINs, 1 or 2 x gigE is insufficient; 4 x gigE might barely make it; InfiniBand should be safe. When running cross-partition JOINs, a single 8-core Xeon box generates about 300MB/s of interconnect traffic; a gigE connection can maybe take 50MB/s with some luck.
Intel E5 is not dramatically better than Nehalem but this is something we will see in a while when we make measurements with real equipment. Prior to the E5 release, we tried Amazon EC2 CC2 ("Cluster Compute Eight Extra Large Instance" -- 2x8 core E5, 2.66GHz). The results were inconclusive; it never did more than 1.9x better than Xeon 5520 even when running an empty loop (i.e., recursive Fibonacci function in SQL, no cache misses, no I/O). With a database JOIN, 1.3x better is the best we saw. But this must be the fault of Amazon and not of E5.
We also tried AMD "Magny-Cours", but for 32 cores against 8 it never did over 2x better, more like 1.4x often enough, and and single thread speed was 50% worse, so not a good buy. We did not find a Bulldozer to try, and did not feel like buying one since the reviews did not promise more core speed over the Magny-Cours.
It seems that especially with Column Store, we are truly CPU-bound and not memory-latency- or bandwidth-bound. This is based on the observation that a Xeon 5620 with 2 of 3 memory channels populated loads BSBM data only 10% faster than the same with 1 of 3 channels populated, with CPU affinity set on a dual socket system.
So, if you have a choice between a $2K processor (E5-2690) and a $600 processor (E5-2630), buy the cheaper one and get RAM with the money saved.
$1440 buys 128G in $90 8G DIMMs. Then buy E5 boards with 24 DIMMs -- one for every 7Gt of web crawl data. If your software licenses are priced per core, getting higher-clock 4-core E5’s might make sense.
While on the subject of bytes and quads/triples, we note that Bigdata®'s recent announcement says up to 50 billion triples per single server. Franz loaded at a good 800+ Kt/s rate up to a trillion triples. One is led to think from the spec that this was with less than full cpu but still with highly local data, considering 1.5 bytes a triple would hit very heavy I/O otherwise. Their statement to the effect of LUBM-like data corroborates this, so we are not talking about exactly the same thing.
So if you compare the claims, I am talking about running CPU-bound on the worst data there is. Franz and Bigdata® do not specify, so it is hard to compare. LOD2 should in principle publish actual metrics with at least Bigdata®; Franz is not participating in these races.
We may publish some more detailed measurements with more varied configurations later. The thing to remember is minimum 144GB RAM for every 5Gt of web crawls, if you want to load and refresh in RAM.
LOD2's database contributions are, on one hand, Virtuoso Column Store and Elastic Cluster, and on the other, the demonstration and proof from CWI that indeed all of the relational innovations for which CWI is well known apply to graph/RDF data as well.
The value is unquestionable both to Virtuoso users in the short-term, and to the state of science and to all RDF users and vendors in the mid-term.
The LOD2 claim of "linking the universe" (my words) will be tested soon enough, after we first put the universe in a bucket.
This refers to a real-time quad store of Sindice crawls, plus a warehouse of the LOD data sets.
This effort raises a few questions that I will treat in a number of posts to follow, such as --
What is done now is under-provisioned and not kept up to date. We are talking about all the RDF on the web in near real time with arbitrary queries. This is very far from the "billion triples" data sets or vertical portals, which are both easy by comparison.