Details
OpenLink Software
Burlington, United States
Subscribe
Post Categories
Recent Articles
Community Member Blogs
Display Settings
Translate
|
Showing posts in all categories Refresh
SemData@Sofia Roundtable write-up
[
Orri Erling
]
There was last week an invitation-based roundtable about semantic data management in Sofia, Bulgaria.
Lots of smart people together. The meeting was hosted by Ontotext and chaired by Dieter Fensel. On the database side we had Ontotext, SYSTAP (Bigdata), CWI (MonetDB), Karlsruhe Institute of Technology (YARS2/SWSE). LarKC was well represented, being our hosts, with STI, Ontotext, CYC, and VU Amsterdam. Notable absences were Oracle, Garlik, Franz, and Talis.
Now of semantic data management... What is the difference between a relational database and a semantic repository, a triple/quad store, a whatever-you-call-them?
I had last fall a meeting at CWI with Martin Kersten, Peter Boncz and Lefteris Sidirourgos from CWI, and Frank van Harmelen and Spiros Kotoulas of VU Amsterdam, to start a dialogue between semanticists and databasers. Here we were with many more people trying to discover what the case might be. What are the differences?
Michael Stonebraker and Martin Kersten have basically said that what is sauce for the goose is sauce for the gander, and that there is no real difference between relational DB and RDF storage, except maybe for a little tuning in some data structures or parameters. Semantic repository implementors on the other hand say that when they tried putting triples inside an RDB it worked so poorly that they did everything from scratch. (It is a geekly penchant to do things from scratch, but then this is not always unjustified.)
OpenLink Software and Virtuoso are in agreement with both sides, contradictory as this might sound. We took our RDBMS and added data types and structures and cost model alterations to an existing platform. Oracle did the same. MonetDB considers doing this and time will tell the extent of their RDF-oriented alterations. Right now the estimate is that this will be small and not in the kernel.
I would say with confidence that without source code access to the RDB, RDF will not be particularly convenient or efficient to accommodate. With source access, we found that what serves RDB also serves RDF. For example, execution engine and data compression considerations are the same, with minimal tweaks for RDF's run time typing needs.
So now we are founding a platform for continuing this discussion. There will be workshops and calls for papers and the beginnings of a research community.
After the initial meeting at CWI, I tried to figure what the difference was between the databaser and semanticist minds. Really, the things are close but there is still a disconnect. Database is about big sets and semantics is about individuals, maybe. The databaser discovers that the operation on each member of the set is not always the same, and the semanticist discovers that the operation on each member of the set is often the same.
So the semanticist says that big joins take time. The databaser tells the semanticist not to repeat what's been obvious for 40 years and for which there is anything from partitioned hashes to merges to various vectored execution models. Not to mention columns.
Spiros of VU Amsterdam/LarKC says that map-reduce materializes inferential closure really fast. Lefteris of CWI says that while he is not a semantic person, he does not understand what the point of all this materializing is, nobody is asking the question, right? So why answer? I say that computing inferential closure is a semanticist tradition; this is just what they do. Atanas Kiryakov of Ontotext says that this is not just a tradition whose start and justification is in the forgotten mists of history, but actually a clear and present need; just look at all the joining you would need.
Michael Witbrock of CYC says that it is not about forward or backward inference on toy rule sets, but that both will be needed and on massively bigger rule sets at that. Further, there can be machine learning to direct the inference, doing the meta-reasoning merged with the reasoning itself.
I say that there is nothing wrong with materialization if it is guided by need, in the vein of memo-ization or cracking or recycling as is done in MonetDB. Do the work when it is needed, and do not do it again.
Brian Thompson of Systap/Bigdata asks whether it is not a contradiction in terms to both want pluggability and merging inference into the data, like LarKC would be doing. I say that this is difficult but not impossible and that when you run joins in a cluster database, as you decide based on the data where the next join step will be, so it will be with inference. Right there, between join steps, integrated with whatever data partitioning logic you have, for partitioning you will have, data being bigger and bigger. And if you have reuse of intermediates and demand driven indexing à la MonetDB, this too integrates and applies to inference results.
So then, LarKC and CYC, can you picture a pluggable inference interface at this level of granularity? So far, I have received some more detail as to the needs of inference and database integration, essentially validating our previous intuitions and plans.
Aside talking of inference, we have the more immediate issue of creating an industry out of the semantic data management offerings of today.
What do we need for this? We need close-to-parity with relational — doing your warehouse in RDF with the attendant agility thereof can't cost 10x more to deploy than the equivalent relational solution.
We also want to tell the key-value, anti-SQL people, who throw away transactions and queries, that there is a better way. And for this, we need to improve our gig just a little bit. Then you have the union of some level of ACID, at least consistent read, availability, complex query, large scale.
And to do this, we need a benchmark. It needs a differentiation of online queries and browsing and analytics, graph algorithms and such. We are getting there. We will soon propose a social web benchmark for RDF which has both online and analytical aspects, a data generator, a test driver, and so on, with a TPC-style set of rules. If there is agreement on this, we will all get a few times faster. At this point, RDF will be a lot more competitive with mainstream and we will cross another qualitative threshold.
|
03/15/2010 09:46 GMT
|
Modified:
03/15/2010 11:06 GMT
|
SemData@Sofia Roundtable write-up
[
Virtuso Data Space Bot
]
There was last week an invitation-based roundtable about semantic data management in Sofia, Bulgaria.
Lots of smart people together. The meeting was hosted by Ontotext and chaired by Dieter Fensel. On the database side we had Ontotext, SYSTAP (Bigdata), CWI (MonetDB), Karlsruhe Institute of Technology (YARS2/SWSE). LarKC was well represented, being our hosts, with STI, Ontotext, CYC, and VU Amsterdam. Notable absences were Oracle, Garlik, Franz, and Talis.
Now of semantic data management... What is the difference between a relational database and a semantic repository, a triple/quad store, a whatever-you-call-them?
I had last fall a meeting at CWI with Martin Kersten, Peter Boncz and Lefteris Sidirourgos from CWI, and Frank van Harmelen and Spiros Kotoulas of VU Amsterdam, to start a dialogue between semanticists and databasers. Here we were with many more people trying to discover what the case might be. What are the differences?
Michael Stonebraker and Martin Kersten have basically said that what is sauce for the goose is sauce for the gander, and that there is no real difference between relational DB and RDF storage, except maybe for a little tuning in some data structures or parameters. Semantic repository implementors on the other hand say that when they tried putting triples inside an RDB it worked so poorly that they did everything from scratch. (It is a geekly penchant to do things from scratch, but then this is not always unjustified.)
OpenLink Software and Virtuoso are in agreement with both sides, contradictory as this might sound. We took our RDBMS and added data types and structures and cost model alterations to an existing platform. Oracle did the same. MonetDB considers doing this and time will tell the extent of their RDF-oriented alterations. Right now the estimate is that this will be small and not in the kernel.
I would say with confidence that without source code access to the RDB, RDF will not be particularly convenient or efficient to accommodate. With source access, we found that what serves RDB also serves RDF. For example, execution engine and data compression considerations are the same, with minimal tweaks for RDF's run time typing needs.
So now we are founding a platform for continuing this discussion. There will be workshops and calls for papers and the beginnings of a research community.
After the initial meeting at CWI, I tried to figure what the difference was between the databaser and semanticist minds. Really, the things are close but there is still a disconnect. Database is about big sets and semantics is about individuals, maybe. The databaser discovers that the operation on each member of the set is not always the same, and the semanticist discovers that the operation on each member of the set is often the same.
So the semanticist says that big joins take time. The databaser tells the semanticist not to repeat what's been obvious for 40 years and for which there is anything from partitioned hashes to merges to various vectored execution models. Not to mention columns.
Spiros of VU Amsterdam/LarKC says that map-reduce materializes inferential closure really fast. Lefteris of CWI says that while he is not a semantic person, he does not understand what the point of all this materializing is, nobody is asking the question, right? So why answer? I say that computing inferential closure is a semanticist tradition; this is just what they do. Atanas Kiryakov of Ontotext says that this is not just a tradition whose start and justification is in the forgotten mists of history, but actually a clear and present need; just look at all the joining you would need.
Michael Witbrock of CYC says that it is not about forward or backward inference on toy rule sets, but that both will be needed and on massively bigger rule sets at that. Further, there can be machine learning to direct the inference, doing the meta-reasoning merged with the reasoning itself.
I say that there is nothing wrong with materialization if it is guided by need, in the vein of memo-ization or cracking or recycling as is done in MonetDB. Do the work when it is needed, and do not do it again.
Brian Thompson of Systap/Bigdata asks whether it is not a contradiction in terms to both want pluggability and merging inference into the data, like LarKC would be doing. I say that this is difficult but not impossible and that when you run joins in a cluster database, as you decide based on the data where the next join step will be, so it will be with inference. Right there, between join steps, integrated with whatever data partitioning logic you have, for partitioning you will have, data being bigger and bigger. And if you have reuse of intermediates and demand driven indexing à la MonetDB, this too integrates and applies to inference results.
So then, LarKC and CYC, can you picture a pluggable inference interface at this level of granularity? So far, I have received some more detail as to the needs of inference and database integration, essentially validating our previous intuitions and plans.
Aside talking of inference, we have the more immediate issue of creating an industry out of the semantic data management offerings of today.
What do we need for this? We need close-to-parity with relational — doing your warehouse in RDF with the attendant agility thereof can't cost 10x more to deploy than the equivalent relational solution.
We also want to tell the key-value, anti-SQL people, who throw away transactions and queries, that there is a better way. And for this, we need to improve our gig just a little bit. Then you have the union of some level of ACID, at least consistent read, availability, complex query, large scale.
And to do this, we need a benchmark. It needs a differentiation of online queries and browsing and analytics, graph algorithms and such. We are getting there. We will soon propose a social web benchmark for RDF which has both online and analytical aspects, a data generator, a test driver, and so on, with a TPC-style set of rules. If there is agreement on this, we will all get a few times faster. At this point, RDF will be a lot more competitive with mainstream and we will cross another qualitative threshold.
|
03/15/2010 09:46 GMT
|
Modified:
03/15/2010 11:06 GMT
|
One Technology That Will Rock 2010 (Update 1)
[
Kingsley Uyi Idehen
]
Thanks to the TechCrunch post titled: Ten Technologies That Will Rock 2010, I've been able to quickly construct a derivative post that condenses the ten item list down to a Single Technology That Will Rock 2010 :-)
Sticking with the TechCrunch layout, here is why all roads simply lead to Linked Data come 2010 and beyond:
-
The Tablet: a new form factor addition re. Internet and Web application hosts which is just another way of saying: Linked Data will be accessible from Tablet applications.
-
Geo: GPS chips are now standard features of mobile phones, so geolocation is increasingly becoming a necessary feature for any killer app. Thus, GeoSpatial Linked Data and GeopSpatial Queries are going to be a critical success factor for any endeavor that seeks to engage mobile applications developers and ultimately their end-users. Basiacally, you want to be able to perform Esoteric Search from these devices of the form: Find Vendors of a Camcorder (e.g., with a Zoom Factor: Weight Ratio of X) within a 2km Radius of my current location. Or how many items from my WishList are available from a Vendor within a 2km radius of my current location. Conversely, provide Vendors with the ability to spot potential Customers within a 2km of a given "clicks & mortar" location (e.g. BestBuy store).
-
Realtime Search: Rich Structured Profiles that leverage standards such as FOAF and FOAF+SSL will enable Highly Personalized Realtime Search (HPRS) without compromisng privacy. Tecnically, this is about WebIDs securely bound to X.509 Certificates, providing access to verifiable and highly navigable Personal Profile Data Spaces that also double as personal search index entry points.
-
Chrome OS: Just another operating system for exploiting the burgeoning Web of Linked Data
-
HTML5: Courtesy of RDFa, just another mechanism for exposing Linked Data by making HTML+RDFa a bona fide markup for metadata (i.e., format for describing real world objects via their attribute-value graphs)
-
Mobile Video: Simplifies the production and sharing of Video annotations (comments, reviews etc.) en route to creating rich Linked Discourse Data Spaces.
-
Augmented Reality: Ditto
-
Mobile Transactions: As per points 1&2 above, Vendor Discovery and Transaction Conusmation will increasingly be driven by high SDQ applications. The "Funnel Effect" (more choices based on individual preferences) will be a critical success factor for any one operating in the Mobile Transaction realm. Note, without Linked Data you cannot deliver scalable solutions that handle the combined requirements of: SDQ, "Funnel Effect", and Mobile Device form factor, will simply maginify the importance of Web accessible Linked Data.
-
Android: An additional platform for items 1-8; basically, 2010 isn't going to be an iPhone only zone. Personally, this reminds me of a battle from the past i.e., Microsoft vs Apple, re. desktop computing dominance. Google has studied history very well :-)
-
Social CRM: this is simply about applying points 1-9 alongide the construction of Linked Data from eCRM Data Spaces.
As I've stated in the past (across a variety of mediums), you cannot build applications that have long term value without addressing the following issues:
- Data Item or Object Identity
- Data Structure -- Data Models
- Data Representation -- Data Model Entity & Relationships Representation mechanism (as delivered by metadata oriented markup)
- Data Storage -- Database Management Systems
- Data Access -- Data Access Protocols
- Data Presentation -- How you present Views and Reports from Structured Data Sources
- Data Security -- Data Access Policies
The items above basically showcase the very essence of the HTTP URI abstraction that drives HTTP based Linked Data; which is also the basic payload unit that underlies REST.
Conclusion
I simply hope that the next decade marks a period of broad appreciation and comprehension of Data Access, Integration, and Management issues on the parts of: application developers, integrators, analysts, end-users, and decision makers. Remember, without structured Data we cannot produce or share Information, and without Information, we cannot produce of share Knowledge.
Related
|
01/02/2010 12:30 GMT
|
Modified:
02/01/2010 09:02 GMT
|
European Commission and the Data Overflow
[
Orri Erling
]
The European Commission recently circulated a questionnaire to selected experts on what could be done for the future of big data.
Since the questionnaire is public, I am publishing my answers below.
-
Data and data types
-
What volumes of data are we dealing with today? What is the growth rate? Where can we expect to be in 2015?
Private data warehouses of corporations have more than doubled yearly for the past years; hundreds of TB is not exceptional. This will continue. The real shift is in structured data being published in increasing quantities with a minimum level of integrate-ability through use of RDF and linked data principles. There are rewards for use of standard vocabularies and identifiers through search engines recognizing such data. There is convergence around DBpedia identifiers for real-world entities, e.g., most things that would be in the news.
This also means that internal data processes and silos may be enriched with this content. There is consequent pressure for accommodating more diversity of data, with more flexible schema.
Ultimately, all content presently stored in RDBs and presented in public accessible dynamic web pages will end up on the web of linked data. Examples are product catalogs, price lists, event schedules and the like.
The volume of the well known linked data sets is around 10 billion statements. With the above mentioned trends, growth by two or three orders of magnitude by 2015 seems reasonable, This is so especially if explicit semantics are extracted from the document web and if there is some further progress in the precision/recall of such extraction.
Relevant sections of this mass of data are a potential addition to any present or future analytics application.
Since arbitrary analytics over the database which is the web cannot be economically provided by a centralized search engine, a cloud model may be used for on-demand selection of relevant data and mixing it with private data. This will drive database innovation for the next years even more than the continued classical warehouse growth.
Science data is another driver of the data overflow. For example, faster gene sequencing, more accurate measurements in high energy physics, better imaging, and remote sensing will produce large volumes of data. This data has highly regular structure but labeling this data with source and lineage calls for a flexible, schema-last, self-describing model, such as RDF and linked data. Data and metadata should travel together but may have different data models.
By and large, the metadata of science data will be another stream to the web of linked data, at least to the degree it is publicly accessible. Restricted circles can and likely will implement similar ideas.
-
What types of data can we deal with intelligently due to their inherent structure (geospatial, temporal, social or knowledge graphs, 3D, sensor streams...)?
All the above types should be supported inside one DBMS so as to allow efficient querying combining conditions on all these types of data, e.g., photos of sunsets taken last summer in Ibiza, with over 20 megapixels, by people I know.
Note that the test for being a sunset is an operation on the image blob that should be taken to the data; the images cannot be economically transferred.
Interleaving of all database functions and types becomes increasingly important.
-
Industries, communities
-
Who is producing these data and why? Could they do it better? How?
Right now, projects such as Bio2RDF, Neurocommons, and DBPedia produce this data. The processes are in place and are reasonable. Incremental improvement is to be expected. These processes, along with the linked data meme generally taking off, drive demand for better NLP (Natural Language Processing), e.g., entity and relationship extraction, especially extraction that can produce instance data in given ontologies (e.g., events) using common identifiers (e.g., DBPedia URIs).
Mapping of RDBs to RDF is possible, and a W3C working group is developing standards for this. The required baseline level has been reached; the rest is a matter of automating deployment. Within the enterprise, there are advantages to be gained for information integration; e.g., all entities in the CRM space can be integrated with all email and support tickets through giving everything a URI. Some of this information may even be published on an extranet for self-service and web-service interfaces. This has been done at small scales and the rest is a matter of spreading adoption and lowering the entry barrier. Incremental progress will take place, eventually resulting in qualitatively better integration along the value chain when adoption is sufficiently widespread.
-
Who is consuming these data and why? Could they do it better? How?
Consumers are various. The greatest need is for tools that summarize complex data and allow getting a bird's eye view of what data is in the first instance available. Consuming the data is hindered by the user not even necessarily knowing what data there is. This is somewhat new, as traditionally the business analyst did know the schema of the warehouse and was proficient with SQL report generators and statistics packages.
Where Web 2.0 made the citizen journalist, the web of linked data will make the citizen analyst. For this to happen, with benefits for individuals, enterprises, and governments alike, more work in user interfaces, knowledge discovery, and query composition will be useful. We may envision a "meshup economy" where data is plentiful, but the unit of value and exchange is the smart report that crystallizes actionable value from this ocean.
-
What industrial sectors in Europe could become more competitive if they became much better at managing data?
Any sector could benefit. Early adopters are seen in the biomedical field and to an extent in media.
-
Is the regulation landscape imposing constraints (privacy, compliance ...) that don't have today good tool support?
The regulation landscape drives database demand through data retention requirements and the like.
With data integration, especially with privacy-sensitive data (as in medicine), there are issues of whether one dares put otherwise-shareable information online. Regulation is needed to protect individuals, but integration should still be possible for science.
For this, we see a need for progress in applying policy-based approaches (e.g., row level security) to relatively schema-last data such as RDF. This is possible but needs some more work. Also, creating on-the-fly-anonymizing views on data might help.
More research is needed for reconciling the need for security with the advantages of broad-based ad hoc integration. Ideally, data should be intelligent, aware of its origins and classification and cautious of whom it interacts with, all of this supported under the covers so that the user could ask anything but the data might refuse to answer or might restrict answers according to the user's profile. This is a tall order and implementing something of the sort is an open question.
-
What are the main practical problem identified for individuals and organizations? Please give examples and tell us about the main obstacles and barriers.
We have come across the following:
- Knowing that the data exists in the first place.
- If the data is found, figuring out the provenance, units and precision of measurement, identifiers, and the like.
- Compatible subject matter but incompatible representation: For example, one has numbers on a map with different maps for different points in time; another has time series of instrument data with geo-location for the instrument. It is only to be expected that the time interval between measurements is not the same. So there is need for a lot of one-off programming to align data.
Other problems have to do with sheer volume, i.e., transfer of data even in a local area network is too slow, let alone over a wide area network. Computation needs to go to the data, and databases need to support this.
-
Services, software stacks, protocols, standards, benchmarks
-
What combinations of components are needed to deal with these problems?
Recent times have seen a proliferation of special purpose databases. Since the data needs of the future are about combining data with maximum agility and minimum performance hit, there is need to gather the currently-separate functionality into an integrated system with sufficient flexibility. We see some of this in integration of map-reduce and scale-out databases. The former antagonists have become partners. Vertica, Greenplum, and OpenLink Virtuoso are example of DBMS featuring work in this direction.
Interoperability and at least de facto standards in ways of doing this will emerge.
-
What data exchange and processing mechanisms will be needed to work across platforms and programming languages?
HTTP, XML, and RDF are in fact very verbose, yet these are the formats and models that have uptake. Thus, these will continue to be used even though one might think binary formats to be more efficient.
There are of course science data set standards that are more compressed and these will continue, hopefully adding a practice of rich metadata in RDF.
For internals of systems, MPI and TCP/IP with proprietary optimized wire formats will continue. Inter-system communication will likely continue to be HTTP, XML, and RDF as appropriate.
-
What data environments are today so wastefully messy that they would benefit from the development of standards?
RDF and OWL are not messy but they could use some more performance; we are working on this. SPARQL is finally acquiring the capabilities of a serious query language, so things are slowly coming together.
Community process for developing application domain specific vocabularies works quite well, even though one could argue it is ad hoc and not up to what a modeling purist might wish.
Top-down imposition of standards has a mixed history, with long and expensive development and sometimes no or little uptake, consider some WS* standards for example.
-
What kind of performance is expected or required of these systems? Who will measure it reliably? How?
Relational databases have a history of substantial investment in optimization and some of them are very good for what they do, e.g., the newer generation of analytics databases.
The very large schema-last, no-SQL, sometimes eventually consistent key-value stores have a somewhat shorter history but do fill a real need.
These trends will merge: Extreme scale, schema-last, complex queries, even more complex inference, custom code for in-database machine learning and other bulk processing.
We find RDF augmented with some binary types at this crossroads. This point of the design space will have to provide performance roughly on the level of today's best relational solution for workloads that fit the relational model. The added cost of schema-last and inference must come down. We are working on this. Research work such as carried out with MonetDB gives clues as to how these aims can be reached.
The separation of query language and inference is artificial. After the concepts are mature, these functions will merge and execute close to the data; there are clear evolutionary pressures in this direction.
Benchmarks are key. Some gain can be had even from repurposing standard relational benchmarks like TPC-H. But the TPC-H rules do not allow official reporting of such.
Development of benchmarks for RDF, complex queries, and inference is needed. A bold challenge to the community, it should be rooted in real-life integration needs and involve high heterogeneity. A key-value store benchmark might also be conceived. A transaction benchmark like TPC-C might be the basis, maybe augmented with massive user-generated content like reviews and blogs.
If benchmarks exist and are not too easy nor inaccessibly difficult nor too expensive to run — think of the high end TPC-C results — then TPC-style rules and processes would be quite adequate. The threshold to publish should be lowered: Everybody runs the TPC workloads internally but few publish.
Some EC initiative for benchmarking could make sense, similar to the TREC initiative of the US government. Industry should be consulted for the specific content; possibly the answers to the present questionnaire can provide an approximate direction.
Benchmarks should be run by software vendors on their own systems, tuned by themselves. But there should be a process of disclosure and auditing; the TPC rules give an example. Compliance should not be too expensive or time consuming. Some community development for automating these things would be a worthwhile target for EC funding.
-
Usability and training
-
How difficult will it be for a developer of average competence to deploy components whose core is based on rather deep computer science? Do we all need to understand Monads and Continuations? What can be done to make it ever easier?
In the database world, huge advances in technology have taken place behind a relatively simple and stable interface: SQL. For the linked data web, the same will take place behind SPARQL.
Beyond these, for example, programming with MPI with good utilization of a cluster platform for an arbitrary algorithm, is quite difficult. The casual amateur is hereby warned.
There is no single solution. For automatic parallelization, since explicit, programmatic parallelization of things with MPI for example is very unscalable in terms of required skill, we should favor declarative and/or functional approaches.
Developing a debugger and explanation engine for rule-based and description-logics-based inference would be an idea.
For procedural workloads, things like Erlang may be good in cases and are not overly difficult in principle, especially if there are good debugging facilities.
For shipping functions in a cluster or cloud, the BOOM (Berkeley Orders Of Magnitude) approach or logic programming with explicit specification of compute location seem promising, surely more flexible than map-reduce. The question is whether a PHP developer can be made to do logic programming.
This bridge will be crossed only with actual need and even then reluctantly. We may look at the Web 2.0 practice of sharding MySQL, inconvenient as this may be, for an example. There is inertia and thus re-architecting is a constant process that is generally in reaction to facts, post hoc, often a point solution. One could argue that planning ahead would be smarter but by and large the world does not work so.
One part of the answer is an infinitely-scalable SQL database that expands and shrinks in the clouds, with the usual semantics, maybe optional eventual consistency and built-in map reduce. If such a thing is inexpensive enough and syntax-level-compatible with present installed base, many developers do not have to learn very much more.
This is maybe good for the bread-and-butter IT, but European competitiveness should not rest on this. Therefore we wish to go for bold new application types for which the client-server database application is not the model. Data-centric languages like BOOM, if they can be made very efficient and have good debugging support, are attractive there. These do require more intellectual investment but that is not a problem since the less-inquisitive part of the developer community is served by the first part of the answer.
-
How is a developer of average skills going to learn about these new advanced tools? How can we plan for excellent documentation and training, community mentoring, exchange of good practices, etc... across all EU countries?
For the most part, developers do not learn things for the sake of learning. When they have learned something and it is adequate, they stay with it for the most part and are even reluctant to engage in cross-camps interaction. The research world is often similarly insular. A new inflection in the application landscape is needed to drive learning. This inflection is provided by the ubiquity of mobile devices, sensor data, explicit semantics, NLP concept extraction, web of linked data, and such factors.
RDFa is a good example of a new technique piggybacking on something everybody uses, namely HTML. These new things should, within possibility, be deployed in the usual technology stack, LAMP or Java. Of course these do not have to be LAMP or Java or HTML or HTTP themselves but they must manifest through these.
A lot of the semantic web potential can be realized within the client-server database application model, thus no fundamental re-architecting, just some new data types and queries.
For data- or processing-intensive tasks, an on-demand hookup to cloud-based servers with Erlang and/or BOOM for programming model would be easy enough to learn and utilize.
The question is one of providing challenges. Addressing actual challenges with these techniques will lead to maturity, documentation, examples, and training. With virtual, Europe-wide distributed teams a reality in many places, Europe-wide dissemination is no longer insurmountable.
As the data overflow proceeds, its victims will multiply and create demand for solutions. The EC could here encourage research project use cases gaining an extended life past the end of research projects, possibly being maintained and multiplied and spun off.
If such things could be mutated into self-sustaining service businesses with pay-per-use revenue, say through a cloud SaaS business model, still primarily leveraging an open source technology stack, we could have self-propagating and self-supporting models for exploiting advanced IT. This would create interest, and interest would drive training and dissemination.
The problem is creating the pull.
-
Challenges
-
What should be, in this domain, the equivalent of the Netflix challenge, Ansari X Prize, Google Lunar X Prize, etc. ... ?
The EC itself no doubt suffers from data overflow in one function or another. Unless security/secrecy prohibits, simply publishing a large data set and a description of what operations should be done on it would be a start. The more real the data, the better — reality is consistently more complex and surprising than imagination. Since many interesting problems touch on fraud detection and law enforcement, there may be some security obstacles for using these application domains as subject matters of open challenges.
Once there is a good benchmark, as discussed above, there can be some prize money allocated for the winners, specially if the race is tight.
The Semantic Web Challenge and the Billion Triples Challenge exist and are useful as such, but do not seem to have any huge impact.
The incentives should be sufficient and part of the expenses arising from running for such challenges could be funded. Otherwise investing in existing business development will be more interesting to industry. Some industry participation seems necessary; we would wish academia and industry to work closer. Also, having industry supply the baseline guarantees that academia actually does further the state of the art. This is not always certain.
If challenges are based on actual problems, whether of the EC, its member governments, or private entities, and winning the challenge may lead to a contract for supplying an actual solution, these will naturally become more interesting for consortia involving integrators, specialist software vendors, and academia. Such a model would build actual capacity to deploy leading edge technologies in production, which is sorely needed.
-
What should one do to set up such a challenge, administer, and monitor it?
The EC should probably circulate a call for actual problem scenarios involving big data. If the matter of the overflow is as dire as represented, cases should be easy to find. A few should be selected and then anonymized if needed.
The party with the use case would benefit by having hopefully the best work on it. The contestants would benefit from having real world needs guide R&D. The EC would not have to do very much, except possibly use some money for funding the best proposals. The winner would possibly get a large account and related sales and service income. The contestants would have to be teams possibly involving many organizations; for example, development and first-line services and support could come from different companies along a systems integrator model such as is widely used in the US.
There may be a good benchmark at the time, possibly resulting from FP7 itself. In such a case, the EC could offer a prize for winners. Details would have to be worked out case by case. Such a challenge could be repeated a few times, as benchmark-driven progress in databases or TREC for example have taken some years to reach a point of slowdown in progress.
Administrating such an activity should not be prohibitive, as most of the expertise can be found with the stakeholders.
|
10/27/2009 13:29 GMT
|
Modified:
10/27/2009 14:57 GMT
|
European Commission and the Data Overflow
[
Virtuso Data Space Bot
]
The European Commission recently circulated a questionnaire to selected experts on what could be done for the future of big data.
Since the questionnaire is public, I am publishing my answers below.
-
Data and data types
-
What volumes of data are we dealing with today? What is the growth rate? Where can we expect to be in 2015?
Private data warehouses of corporations have more than doubled yearly for the past years; hundreds of TB is not exceptional. This will continue. The real shift is in structured data being published in increasing quantities with a minimum level of integrate-ability through use of RDF and linked data principles. There are rewards for use of standard vocabularies and identifiers through search engines recognizing such data. There is convergence around DBpedia identifiers for real-world entities, e.g., most things that would be in the news.
This also means that internal data processes and silos may be enriched with this content. There is consequent pressure for accommodating more diversity of data, with more flexible schema.
Ultimately, all content presently stored in RDBs and presented in public accessible dynamic web pages will end up on the web of linked data. Examples are product catalogs, price lists, event schedules and the like.
The volume of the well known linked data sets is around 10 billion statements. With the above mentioned trends, growth by two or three orders of magnitude by 2015 seems reasonable, This is so especially if explicit semantics are extracted from the document web and if there is some further progress in the precision/recall of such extraction.
Relevant sections of this mass of data are a potential addition to any present or future analytics application.
Since arbitrary analytics over the database which is the web cannot be economically provided by a centralized search engine, a cloud model may be used for on-demand selection of relevant data and mixing it with private data. This will drive database innovation for the next years even more than the continued classical warehouse growth.
Science data is another driver of the data overflow. For example, faster gene sequencing, more accurate measurements in high energy physics, better imaging, and remote sensing will produce large volumes of data. This data has highly regular structure but labeling this data with source and lineage calls for a flexible, schema-last, self-describing model, such as RDF and linked data. Data and metadata should travel together but may have different data models.
By and large, the metadata of science data will be another stream to the web of linked data, at least to the degree it is publicly accessible. Restricted circles can and likely will implement similar ideas.
-
What types of data can we deal with intelligently due to their inherent structure (geospatial, temporal, social or knowledge graphs, 3D, sensor streams...)?
All the above types should be supported inside one DBMS so as to allow efficient querying combining conditions on all these types of data, e.g., photos of sunsets taken last summer in Ibiza, with over 20 megapixels, by people I know.
Note that the test for being a sunset is an operation on the image blob that should be taken to the data; the images cannot be economically transferred.
Interleaving of all database functions and types becomes increasingly important.
-
Industries, communities
-
Who is producing these data and why? Could they do it better? How?
Right now, projects such as Bio2RDF, Neurocommons, and DBPedia produce this data. The processes are in place and are reasonable. Incremental improvement is to be expected. These processes, along with the linked data meme generally taking off, drive demand for better NLP (Natural Language Processing), e.g., entity and relationship extraction, especially extraction that can produce instance data in given ontologies (e.g., events) using common identifiers (e.g., DBPedia URIs).
Mapping of RDBs to RDF is possible, and a W3C working group is developing standards for this. The required baseline level has been reached; the rest is a matter of automating deployment. Within the enterprise, there are advantages to be gained for information integration; e.g., all entities in the CRM space can be integrated with all email and support tickets through giving everything a URI. Some of this information may even be published on an extranet for self-service and web-service interfaces. This has been done at small scales and the rest is a matter of spreading adoption and lowering the entry barrier. Incremental progress will take place, eventually resulting in qualitatively better integration along the value chain when adoption is sufficiently widespread.
-
Who is consuming these data and why? Could they do it better? How?
Consumers are various. The greatest need is for tools that summarize complex data and allow getting a bird's eye view of what data is in the first instance available. Consuming the data is hindered by the user not even necessarily knowing what data there is. This is somewhat new, as traditionally the business analyst did know the schema of the warehouse and was proficient with SQL report generators and statistics packages.
Where Web 2.0 made the citizen journalist, the web of linked data will make the citizen analyst. For this to happen, with benefits for individuals, enterprises, and governments alike, more work in user interfaces, knowledge discovery, and query composition will be useful. We may envision a "meshup economy" where data is plentiful, but the unit of value and exchange is the smart report that crystallizes actionable value from this ocean.
-
What industrial sectors in Europe could become more competitive if they became much better at managing data?
Any sector could benefit. Early adopters are seen in the biomedical field and to an extent in media.
-
Is the regulation landscape imposing constraints (privacy, compliance ...) that don't have today good tool support?
The regulation landscape drives database demand through data retention requirements and the like.
With data integration, especially with privacy-sensitive data (as in medicine), there are issues of whether one dares put otherwise-shareable information online. Regulation is needed to protect individuals, but integration should still be possible for science.
For this, we see a need for progress in applying policy-based approaches (e.g., row level security) to relatively schema-last data such as RDF. This is possible but needs some more work. Also, creating on-the-fly-anonymizing views on data might help.
More research is needed for reconciling the need for security with the advantages of broad-based ad hoc integration. Ideally, data should be intelligent, aware of its origins and classification and cautious of whom it interacts with, all of this supported under the covers so that the user could ask anything but the data might refuse to answer or might restrict answers according to the user's profile. This is a tall order and implementing something of the sort is an open question.
-
What are the main practical problem identified for individuals and organizations? Please give examples and tell us about the main obstacles and barriers.
We have come across the following:
- Knowing that the data exists in the first place.
- If the data is found, figuring out the provenance, units and precision of measurement, identifiers, and the like.
- Compatible subject matter but incompatible representation: For example, one has numbers on a map with different maps for different points in time; another has time series of instrument data with geo-location for the instrument. It is only to be expected that the time interval between measurements is not the same. So there is need for a lot of one-off programming to align data.
Other problems have to do with sheer volume, i.e., transfer of data even in a local area network is too slow, let alone over a wide area network. Computation needs to go to the data, and databases need to support this.
-
Services, software stacks, protocols, standards, benchmarks
-
What combinations of components are needed to deal with these problems?
Recent times have seen a proliferation of special purpose databases. Since the data needs of the future are about combining data with maximum agility and minimum performance hit, there is need to gather the currently-separate functionality into an integrated system with sufficient flexibility. We see some of this in integration of map-reduce and scale-out databases. The former antagonists have become partners. Vertica, Greenplum, and OpenLink Virtuoso are example of DBMS featuring work in this direction.
Interoperability and at least de facto standards in ways of doing this will emerge.
-
What data exchange and processing mechanisms will be needed to work across platforms and programming languages?
HTTP, XML, and RDF are in fact very verbose, yet these are the formats and models that have uptake. Thus, these will continue to be used even though one might think binary formats to be more efficient.
There are of course science data set standards that are more compressed and these will continue, hopefully adding a practice of rich metadata in RDF.
For internals of systems, MPI and TCP/IP with proprietary optimized wire formats will continue. Inter-system communication will likely continue to be HTTP, XML, and RDF as appropriate.
-
What data environments are today so wastefully messy that they would benefit from the development of standards?
RDF and OWL are not messy but they could use some more performance; we are working on this. SPARQL is finally acquiring the capabilities of a serious query language, so things are slowly coming together.
Community process for developing application domain specific vocabularies works quite well, even though one could argue it is ad hoc and not up to what a modeling purist might wish.
Top-down imposition of standards has a mixed history, with long and expensive development and sometimes no or little uptake, consider some WS* standards for example.
-
What kind of performance is expected or required of these systems? Who will measure it reliably? How?
Relational databases have a history of substantial investment in optimization and some of them are very good for what they do, e.g., the newer generation of analytics databases.
The very large schema-last, no-SQL, sometimes eventually consistent key-value stores have a somewhat shorter history but do fill a real need.
These trends will merge: Extreme scale, schema-last, complex queries, even more complex inference, custom code for in-database machine learning and other bulk processing.
We find RDF augmented with some binary types at this crossroads. This point of the design space will have to provide performance roughly on the level of today's best relational solution for workloads that fit the relational model. The added cost of schema-last and inference must come down. We are working on this. Research work such as carried out with MonetDB gives clues as to how these aims can be reached.
The separation of query language and inference is artificial. After the concepts are mature, these functions will merge and execute close to the data; there are clear evolutionary pressures in this direction.
Benchmarks are key. Some gain can be had even from repurposing standard relational benchmarks like TPC-H. But the TPC-H rules do not allow official reporting of such.
Development of benchmarks for RDF, complex queries, and inference is needed. A bold challenge to the community, it should be rooted in real-life integration needs and involve high heterogeneity. A key-value store benchmark might also be conceived. A transaction benchmark like TPC-C might be the basis, maybe augmented with massive user-generated content like reviews and blogs.
If benchmarks exist and are not too easy nor inaccessibly difficult nor too expensive to run — think of the high end TPC-C results — then TPC-style rules and processes would be quite adequate. The threshold to publish should be lowered: Everybody runs the TPC workloads internally but few publish.
Some EC initiative for benchmarking could make sense, similar to the TREC initiative of the US government. Industry should be consulted for the specific content; possibly the answers to the present questionnaire can provide an approximate direction.
Benchmarks should be run by software vendors on their own systems, tuned by themselves. But there should be a process of disclosure and auditing; the TPC rules give an example. Compliance should not be too expensive or time consuming. Some community development for automating these things would be a worthwhile target for EC funding.
-
Usability and training
-
How difficult will it be for a developer of average competence to deploy components whose core is based on rather deep computer science? Do we all need to understand Monads and Continuations? What can be done to make it ever easier?
In the database world, huge advances in technology have taken place behind a relatively simple and stable interface: SQL. For the linked data web, the same will take place behind SPARQL.
Beyond these, for example, programming with MPI with good utilization of a cluster platform for an arbitrary algorithm, is quite difficult. The casual amateur is hereby warned.
There is no single solution. For automatic parallelization, since explicit, programmatic parallelization of things with MPI for example is very unscalable in terms of required skill, we should favor declarative and/or functional approaches.
Developing a debugger and explanation engine for rule-based and description-logics-based inference would be an idea.
For procedural workloads, things like Erlang may be good in cases and are not overly difficult in principle, especially if there are good debugging facilities.
For shipping functions in a cluster or cloud, the BOOM (Berkeley Orders Of Magnitude) approach or logic programming with explicit specification of compute location seem promising, surely more flexible than map-reduce. The question is whether a PHP developer can be made to do logic programming.
This bridge will be crossed only with actual need and even then reluctantly. We may look at the Web 2.0 practice of sharding MySQL, inconvenient as this may be, for an example. There is inertia and thus re-architecting is a constant process that is generally in reaction to facts, post hoc, often a point solution. One could argue that planning ahead would be smarter but by and large the world does not work so.
One part of the answer is an infinitely-scalable SQL database that expands and shrinks in the clouds, with the usual semantics, maybe optional eventual consistency and built-in map reduce. If such a thing is inexpensive enough and syntax-level-compatible with present installed base, many developers do not have to learn very much more.
This is maybe good for the bread-and-butter IT, but European competitiveness should not rest on this. Therefore we wish to go for bold new application types for which the client-server database application is not the model. Data-centric languages like BOOM, if they can be made very efficient and have good debugging support, are attractive there. These do require more intellectual investment but that is not a problem since the less-inquisitive part of the developer community is served by the first part of the answer.
-
How is a developer of average skills going to learn about these new advanced tools? How can we plan for excellent documentation and training, community mentoring, exchange of good practices, etc... across all EU countries?
For the most part, developers do not learn things for the sake of learning. When they have learned something and it is adequate, they stay with it for the most part and are even reluctant to engage in cross-camps interaction. The research world is often similarly insular. A new inflection in the application landscape is needed to drive learning. This inflection is provided by the ubiquity of mobile devices, sensor data, explicit semantics, NLP concept extraction, web of linked data, and such factors.
RDFa is a good example of a new technique piggybacking on something everybody uses, namely HTML. These new things should, within possibility, be deployed in the usual technology stack, LAMP or Java. Of course these do not have to be LAMP or Java or HTML or HTTP themselves but they must manifest through these.
A lot of the semantic web potential can be realized within the client-server database application model, thus no fundamental re-architecting, just some new data types and queries.
For data- or processing-intensive tasks, an on-demand hookup to cloud-based servers with Erlang and/or BOOM for programming model would be easy enough to learn and utilize.
The question is one of providing challenges. Addressing actual challenges with these techniques will lead to maturity, documentation, examples, and training. With virtual, Europe-wide distributed teams a reality in many places, Europe-wide dissemination is no longer insurmountable.
As the data overflow proceeds, its victims will multiply and create demand for solutions. The EC could here encourage research project use cases gaining an extended life past the end of research projects, possibly being maintained and multiplied and spun off.
If such things could be mutated into self-sustaining service businesses with pay-per-use revenue, say through a cloud SaaS business model, still primarily leveraging an open source technology stack, we could have self-propagating and self-supporting models for exploiting advanced IT. This would create interest, and interest would drive training and dissemination.
The problem is creating the pull.
-
Challenges
-
What should be, in this domain, the equivalent of the Netflix challenge, Ansari X Prize, Google Lunar X Prize, etc. ... ?
The EC itself no doubt suffers from data overflow in one function or another. Unless security/secrecy prohibits, simply publishing a large data set and a description of what operations should be done on it would be a start. The more real the data, the better — reality is consistently more complex and surprising than imagination. Since many interesting problems touch on fraud detection and law enforcement, there may be some security obstacles for using these application domains as subject matters of open challenges.
Once there is a good benchmark, as discussed above, there can be some prize money allocated for the winners, specially if the race is tight.
The Semantic Web Challenge and the Billion Triples Challenge exist and are useful as such, but do not seem to have any huge impact.
The incentives should be sufficient and part of the expenses arising from running for such challenges could be funded. Otherwise investing in existing business development will be more interesting to industry. Some industry participation seems necessary; we would wish academia and industry to work closer. Also, having industry supply the baseline guarantees that academia actually does further the state of the art. This is not always certain.
If challenges are based on actual problems, whether of the EC, its member governments, or private entities, and winning the challenge may lead to a contract for supplying an actual solution, these will naturally become more interesting for consortia involving integrators, specialist software vendors, and academia. Such a model would build actual capacity to deploy leading edge technologies in production, which is sorely needed.
-
What should one do to set up such a challenge, administer, and monitor it?
The EC should probably circulate a call for actual problem scenarios involving big data. If the matter of the overflow is as dire as represented, cases should be easy to find. A few should be selected and then anonymized if needed.
The party with the use case would benefit by having hopefully the best work on it. The contestants would benefit from having real world needs guide R&D. The EC would not have to do very much, except possibly use some money for funding the best proposals. The winner would possibly get a large account and related sales and service income. The contestants would have to be teams possibly involving many organizations; for example, development and first-line services and support could come from different companies along a systems integrator model such as is widely used in the US.
There may be a good benchmark at the time, possibly resulting from FP7 itself. In such a case, the EC could offer a prize for winners. Details would have to be worked out case by case. Such a challenge could be repeated a few times, as benchmark-driven progress in databases or TREC for example have taken some years to reach a point of slowdown in progress.
Administrating such an activity should not be prohibitive, as most of the expertise can be found with the stakeholders.
|
10/27/2009 13:29 GMT
|
Modified:
10/27/2009 14:57 GMT
|
VLDB 2009 Web Scale Data Management Panel (5 of 5)
[
Orri Erling
]
"The universe of cycles is not exactly one of literal cycles, but rather one of spirals," mused Joe Hellerstein of UC Berkeley.
"Come on, let's all drop some ACID," interjected another.
"It is not that we end up repeating the exact same things, rather even if some patterns seem to repeat, they do so at a higher level, enhanced by the experience gained," continued Joe.
Thus did the Web Scale Data Management panel conclude.
Whether successive generations are made wiser by the ones that have gone before may be argued either way.
The cycle in question was that of developers discovering ACID in the 1960s, i.e. Atomicity, Consistency, Integrity, Durability. Thus did the DBMS come into being. Then DBMSs kept becoming more complex until, as there will be a counter-force to each force, came the meme of key value stores and BASE, no multiple-row transactions, eventual consistency, no query language but scaling to thousands of computers. So now, the DBMS community asks itself what went wrong.
In the words of one panelist, another demonstrated a "shocking familiarity with the subject matter of substance abuse" when he called for the DBMS community to get on a 12 step program and to look where addiction to certain ideas, among which ACID, had brought its life. Look at yourself: The influential papers in what ought to be your space by rights are coming from the OS community: Google Bigtable, Amazon Dynamo, want more? When you ought to drive, you give excuses and play catch up! Stop denial, drop SQL, drop ACID!
The web developers have revolted against the time-honored principles of the DBMS. This is true. Sharded MySQL is not the ticket — or is it? Must they rediscover the virtues of ACID, just like the previous generation did?
Nothing under the sun is new. As in music and fashion, trends keep cycling also in science and engineering.
But seriously, does the full-featured DBMS scale to web scale? Microsoft says the Azure version of SQL server does. Yahoo says they want no SQL but Hadoop and PNUTS.
Twitter, Facebook, and other web names got their own discussion. Why do they not go to serious DBMS vendors for their data but make their own, like Facebook with Hive?
Who can divine the mind of the web developer? What makes them go to memcached, manually sharded MySQL, and MapReduce, walking away from the 40 years of technology invested in declarative query and ACID? What is this highly visible but hard to grasp entity? My guess is that they want something they can understand, at least at the beginning. A DBMS, especially on a cluster, is complicated, and it is not so easy to say how it works and how its performance is determined. The big brands, if deployed on a thousand PCs, would also be prohibitively expensive. But if all you do with the DBMS is single row selects and updates, it is no longer so scary, but you end up doing all the distributed things in a middle layer, and abandoning expressive queries, transactions, and database-supported transparency of location. But at least now you know how it works and what it is good/not good for.
This would be the case for those who make a conscious choice. But by and large the choice is not deliberate; it is something one drifts into: The application gains popularity; the single LAMP can no longer keep all in memory; you need a second MySQL in the LAMP and you decide that users A–M go left and N–Z right (horizontal partitioning). This siren of sharding beckons you and all is good until you hit the reef of re-architecting. Memcached and duct-tape help, like aspirin helps with hangover, but the root cause of the headache lies unaddressed.
The conclusion was that there ought to be something incrementally scalable from the get-go. Low cost of entry and built-in scale-out. No, the web developers do not hate SQL; they just have gotten the idea that it does not scale. But they would really wish it to. So, DBMS people, show there is life in you yet.
Joe Hellerstein was the philosopher and paradigmatician of the panel. His team had developed a protocol-compatible Hadoop in a few months using a declarative logic programming style approach. His claim was that developers made the market. Thus, for writing applications against web scale data, there would have to be data centric languages. Why not? These are discussed in Berkeley Orders Of Magnitude (BOOM).
I come from Lisp myself, way back. I have since abandoned any desire to tell anybody what they ought to program in. This is a bit like religion: Attempting to impose or legislate or ram it on somebody just results in anything from lip service to rejection to war. The appeal exerted by the diverse language/paradigm -isms on their followers seems to be based on hitting a simplification of reality that coincides with a problem in the air. MapReduce is an example of this. PHP is another. A quick fix for a present need: Scripting web servers (PHP) or processing tons of files (MapReduce). The full database is not as quick a fix, even though it has many desirable features. It is also not as easy to tell what happens inside one, so MapReduce may give a greater feeling of control.
Totally self-managing, dynamically-scalable RDF would be a fix for not having to design or administer databases: Since it would be indexed on everything, complex queries would be possible; no full database scans would stop everything. For the mid-size segment of web sites this might be a fit. For the extreme ends of the spectrum, the choice is likely something custom built and much less expressive.
The BOOM rule language for data-centric programming would be something very easy for us to implement, in fact we will get something of the sort essentially for free when we do the rule support already planned.
The question is, can one induce web developers to do logic? The history is one of procedures, both in LAMP and MapReduce. On the other hand, the query languages that were ever universally adopted were declarative, i.e., keyword search and SQL. There certainly is a quest for an application model for the cloud space beyond just migrating apps. We'll see. More on this another time.
|
09/01/2009 12:24 GMT
|
Modified:
09/02/2009 12:05 GMT
|
VLDB 2009 Web Scale Data Management Panel (5 of 5)
[
Virtuso Data Space Bot
]
"The universe of cycles is not exactly one of literal cycles, but rather one of spirals," mused Joe Hellerstein of UC Berkeley.
"Come on, let's all drop some ACID," interjected another.
"It is not that we end up repeating the exact same things, rather even if some patterns seem to repeat, they do so at a higher level, enhanced by the experience gained," continued Joe.
Thus did the Web Scale Data Management panel conclude.
Whether successive generations are made wiser by the ones that have gone before may be argued either way.
The cycle in question was that of developers discovering ACID in the 1960s, i.e. Atomicity, Consistency, Integrity, Durability. Thus did the DBMS come into being. Then DBMSs kept becoming more complex until, as there will be a counter-force to each force, came the meme of key value stores and BASE, no multiple-row transactions, eventual consistency, no query language but scaling to thousands of computers. So now, the DBMS community asks itself what went wrong.
In the words of one panelist, another demonstrated a "shocking familiarity with the subject matter of substance abuse" when he called for the DBMS community to get on a 12 step program and to look where addiction to certain ideas, among which ACID, had brought its life. Look at yourself: The influential papers in what ought to be your space by rights are coming from the OS community: Google Bigtable, Amazon Dynamo, want more? When you ought to drive, you give excuses and play catch up! Stop denial, drop SQL, drop ACID!
The web developers have revolted against the time-honored principles of the DBMS. This is true. Sharded MySQL is not the ticket — or is it? Must they rediscover the virtues of ACID, just like the previous generation did?
Nothing under the sun is new. As in music and fashion, trends keep cycling also in science and engineering.
But seriously, does the full-featured DBMS scale to web scale? Microsoft says the Azure version of SQL server does. Yahoo says they want no SQL but Hadoop and PNUTS.
Twitter, Facebook, and other web names got their own discussion. Why do they not go to serious DBMS vendors for their data but make their own, like Facebook with Hive?
Who can divine the mind of the web developer? What makes them go to memcached, manually sharded MySQL, and MapReduce, walking away from the 40 years of technology invested in declarative query and ACID? What is this highly visible but hard to grasp entity? My guess is that they want something they can understand, at least at the beginning. A DBMS, especially on a cluster, is complicated, and it is not so easy to say how it works and how its performance is determined. The big brands, if deployed on a thousand PCs, would also be prohibitively expensive. But if all you do with the DBMS is single row selects and updates, it is no longer so scary, but you end up doing all the distributed things in a middle layer, and abandoning expressive queries, transactions, and database-supported transparency of location. But at least now you know how it works and what it is good/not good for.
This would be the case for those who make a conscious choice. But by and large the choice is not deliberate; it is something one drifts into: The application gains popularity; the single LAMP can no longer keep all in memory; you need a second MySQL in the LAMP and you decide that users A–M go left and N–Z right (horizontal partitioning). This siren of sharding beckons you and all is good until you hit the reef of re-architecting. Memcached and duct-tape help, like aspirin helps with hangover, but the root cause of the headache lies unaddressed.
The conclusion was that there ought to be something incrementally scalable from the get-go. Low cost of entry and built-in scale-out. No, the web developers do not hate SQL; they just have gotten the idea that it does not scale. But they would really wish it to. So, DBMS people, show there is life in you yet.
Joe Hellerstein was the philosopher and paradigmatician of the panel. His team had developed a protocol-compatible Hadoop in a few months using a declarative logic programming style approach. His claim was that developers made the market. Thus, for writing applications against web scale data, there would have to be data centric languages. Why not? These are discussed in Berkeley Orders Of Magnitude (BOOM).
I come from Lisp myself, way back. I have since abandoned any desire to tell anybody what they ought to program in. This is a bit like religion: Attempting to impose or legislate or ram it on somebody just results in anything from lip service to rejection to war. The appeal exerted by the diverse language/paradigm -isms on their followers seems to be based on hitting a simplification of reality that coincides with a problem in the air. MapReduce is an example of this. PHP is another. A quick fix for a present need: Scripting web servers (PHP) or processing tons of files (MapReduce). The full database is not as quick a fix, even though it has many desirable features. It is also not as easy to tell what happens inside one, so MapReduce may give a greater feeling of control.
Totally self-managing, dynamically-scalable RDF would be a fix for not having to design or administer databases: Since it would be indexed on everything, complex queries would be possible; no full database scans would stop everything. For the mid-size segment of web sites this might be a fit. For the extreme ends of the spectrum, the choice is likely something custom built and much less expressive.
The BOOM rule language for data-centric programming would be something very easy for us to implement, in fact we will get something of the sort essentially for free when we do the rule support already planned.
The question is, can one induce web developers to do logic? The history is one of procedures, both in LAMP and MapReduce. On the other hand, the query languages that were ever universally adopted were declarative, i.e., keyword search and SQL. There certainly is a quest for an application model for the cloud space beyond just migrating apps. We'll see. More on this another time.
|
09/01/2009 12:24 GMT
|
Modified:
09/02/2009 12:05 GMT
|
The URI, URL, and Linked Data Meme's Generic HTTP URI (Updated)
[
Kingsley Uyi Idehen
]
Situation Analysis As the "Linked Data" meme has gained momentum you've more than likely been on the receiving end of dialog with Linked Open Data community members (myself included) that goes something like this: "Do you have a URI", "Get yourself a URI", "Give me a de-referencable URI" etc.. And each time, you respond with a URL -- which to the best of your Web knowledge is a bona fide URI. But to your utter confusion you are told: Nah! You gave me a Document URI instead of the URI of a real-world thing or object etc.. What's up with that? Well our everyday use of the Web is an unfortunate conflation of two distinct things, which have Identity: Real World Objects (RWOs) & Address/Location of Documents (Information bearing Resources). The "Linked Data" meme is about enhancing the Web by unobtrusively reintroducing its core essence: the generic HTTP URI, a vital piece of Web Architecture DNA. Basically, its about so realizing the full capabilities of the Web as a platform for Open Data Identification, Definition, Access, Storage, Representation, Presentation, and Integration. What is a Real World Object? People, Places, Music, Books, Cars, Ideas, Emotions etc.. What is a URI? A Uniform Resource Identifier. A global identifier mechanism for network addressable data items. Its sole function is Name oriented Identification. URI Generic Syntax The constituent parts of a URI (from URI Generic Syntax RFC) are depicted below: What is a URL? A location oriented HTTP scheme based URI. The HTTP scheme introduces a powerful and inherent duality that delivers: - Resource Address/Location Identifier
- Data Access mechanism for an Information bearing Resource (Document, File etc..)
So far so good! What is an HTTP based URI? The kind of URI Linked Data aficionados mean when they use the term: URI. An HTTP URI is an HTTP scheme based URI. Unlike a URL, this kind of HTTP scheme URI is devoid of any Web Location orientation or specificity. Thus, Its inherent duality provides a more powerful level of abstraction. Hence, you can use this form of URI to assign Names/Identifiers to Real World Objects (RWO). Even better, courtesy of the Identity/Address duality of the HTTP scheme, a single URI can deliver the following: - RWO Identfier/Name
- RWO Metadata document Locator (courtesy of URL aspect)
- Negotiable Representation of the Located Document (courtesy of HTTP's content negotiation feature).
What is Metadata? Data about Data. Put differently, data that describes other data in a structured manner. How Do we Model Metadata? The predominant model for metadata is the Entity-Attribute-Value + Classes & Relationships model (EAV/CR). A model that's been with us since the inception of modern computing (long before the Web). What about RDF? The Resource Description Framework (RDF) is a framework for describing Web addressable resources. In a nutshell, its a framework for adding Metadata bearing Information Resources to the current Web. Its comprised of: - Entity-Attribute-Value (aka. Subject-Predictate-Object) plus Classes & Relationships (Data Dictionaries e.g., OWL) metadata model
- A plethora of instance data representation formats that include: RDFa (when doing so within (X)HTML docs), Turtle, N3, TriX, RDF/XML etc.
What's the Problem Today? The ubiquitous use of the Web is primarily focused on a Linked Mesh of Information bearing Documents. URLs rather than generic HTTP URIs are the prime mechanism for Web tapestry; basically, we use URLs to conduct Information -- which is inherently subjective -- instead of using HTTP URIs to conduct "Raw Data" -- which is inherently objective. Note: Information is "data in context", it isn't the same thing as "Raw Data". Thus, if we can link to Information via the Web, why shouldn't we be able to do the same for "Raw Data"? How Does the Link Data meme solve the problem? The meme simply provides a set of guidelines (best practices) for producing Web architecture friendly metadata. Meaning: when producing EAV/CR model based metadata, endow Subjects, their Attributes, and Attribute Values (optionally) with HTTP URIs. By doing so, a new level of Link Abstraction on the Web is possible i.e., "Data Item to Data Item" level links (aka hyperdata links). Even better, when you de-reference a RWO hyperdata link you end up with a negotiated representations of its metadata. Conclusion Linked Data is ultimately about an HTTP URI for each item in the Data Organization Hierarchy :-) Related - History of how "Resource" became part of URI - historic account by TimBL
- Linked Data Design Issues Document - TimBL's initial Linked Data Guide
- Linked Data Rules Simplified - My attempt at simplifying the Linked Data Meme without SPARQL & RDF distraction
- Linked Data & Identity - another related post
- The Linked Data Meme's Value Proposition
- My Del.icio.us hosted Bookmark Data Space for Identity Schemes
- TimBL's Ted Talk re. "Raw Linked Data"
- Resource Oriented Architecture
-
More Famous Than Simon Cowell .
|
08/07/2009 14:34 GMT
|
Modified:
02/03/2010 15:26 GMT
|
The URI, URL, and Linked Data Meme's Generic HTTP URI (Updated)
[
Kingsley Uyi Idehen
]
Situation Analysis As the "Linked Data" meme has gained momentum you've more than likely been on the receiving end of dialog with Linked Open Data community members (myself included) that goes something like this: "Do you have a URI", "Get yourself a URI", "Give me a de-referencable URI" etc.. And each time, you respond with a URL -- which to the best of your Web knowledge is a bona fide URI. But to your utter confusion you are told: Nah! You gave me a Document URI instead of the URI of a real-world thing or object etc.. What's up with that? Well our everyday use of the Web is an unfortunate conflation of two distinct things, which have Identity: Real World Objects (RWOs) & Address/Location of Documents (Information bearing Resources). The "Linked Data" meme is about enhancing the Web by unobtrusively reintroducing its core essence: the generic HTTP URI, a vital piece of Web Architecture DNA. Basically, its about so realizing the full capabilities of the Web as a platform for Open Data Identification, Definition, Access, Storage, Representation, Presentation, and Integration. What is a Real World Object? People, Places, Music, Books, Cars, Ideas, Emotions etc.. What is a URI? A Uniform Resource Identifier. A global identifier mechanism for network addressable data items. Its sole function is Name oriented Identification. URI Generic Syntax The constituent parts of a URI (from URI Generic Syntax RFC) are depicted below: What is a URL? A location oriented HTTP scheme based URI. The HTTP scheme introduces a powerful and inherent duality that delivers: - Resource Address/Location Identifier
- Data Access mechanism for an Information bearing Resource (Document, File etc..)
So far so good! What is an HTTP based URI? The kind of URI Linked Data aficionados mean when they use the term: URI. An HTTP URI is an HTTP scheme based URI. Unlike a URL, this kind of HTTP scheme URI is devoid of any Web Location orientation or specificity. Thus, Its inherent duality provides a more powerful level of abstraction. Hence, you can use this form of URI to assign Names/Identifiers to Real World Objects (RWO). Even better, courtesy of the Identity/Address duality of the HTTP scheme, a single URI can deliver the following: - RWO Identfier/Name
- RWO Metadata document Locator (courtesy of URL aspect)
- Negotiable Representation of the Located Document (courtesy of HTTP's content negotiation feature).
What is Metadata? Data about Data. Put differently, data that describes other data in a structured manner. How Do we Model Metadata? The predominant model for metadata is the Entity-Attribute-Value + Classes & Relationships model (EAV/CR). A model that's been with us since the inception of modern computing (long before the Web). What about RDF? The Resource Description Framework (RDF) is a framework for describing Web addressable resources. In a nutshell, its a framework for adding Metadata bearing Information Resources to the current Web. Its comprised of: - Entity-Attribute-Value (aka. Subject-Predictate-Object) plus Classes & Relationships (Data Dictionaries e.g., OWL) metadata model
- A plethora of instance data representation formats that include: RDFa (when doing so within (X)HTML docs), Turtle, N3, TriX, RDF/XML etc.
What's the Problem Today? The ubiquitous use of the Web is primarily focused on a Linked Mesh of Information bearing Documents. URLs rather than generic HTTP URIs are the prime mechanism for Web tapestry; basically, we use URLs to conduct Information -- which is inherently subjective -- instead of using HTTP URIs to conduct "Raw Data" -- which is inherently objective. Note: Information is "data in context", it isn't the same thing as "Raw Data". Thus, if we can link to Information via the Web, why shouldn't we be able to do the same for "Raw Data"? How Does the Link Data meme solve the problem? The meme simply provides a set of guidelines (best practices) for producing Web architecture friendly metadata. Meaning: when producing EAV/CR model based metadata, endow Subjects, their Attributes, and Attribute Values (optionally) with HTTP URIs. By doing so, a new level of Link Abstraction on the Web is possible i.e., "Data Item to Data Item" level links (aka hyperdata links). Even better, when you de-reference a RWO hyperdata link you end up with a negotiated representations of its metadata. Conclusion Linked Data is ultimately about an HTTP URI for each item in the Data Organization Hierarchy :-) Related - History of how "Resource" became part of URI - historic account by TimBL
- Linked Data Design Issues Document - TimBL's initial Linked Data Guide
- Linked Data Rules Simplified - My attempt at simplifying the Linked Data Meme without SPARQL & RDF distraction
- Linked Data & Identity - another related post
- The Linked Data Meme's Value Proposition
- My Del.icio.us hosted Bookmark Data Space for Identity Schemes
- TimBL's Ted Talk re. "Raw Linked Data"
- Resource Oriented Architecture
-
More Famous Than Simon Cowell .
|
08/07/2009 14:34 GMT
|
Modified:
02/03/2010 15:26 GMT
|
Social Web Camp (#5 of 5)
[
Orri Erling
]
(Last of five posts related to the WWW 2009 conference, held the week of April 20, 2009.)
The social networks camp was interesting, with a special meeting around Twitter. Half jokingly, we (that is, the OpenLink folks attending) concluded that societies would never be completely classless, although mobility between, as well as criteria for membership in, given classes would vary with time and circumstance. Now, there would be a new class division between people for whom micro-blogging is obligatory and those for whom it is an option.
By my experience, a great deal is possible in a short time, but this possibility depends on focus and concentration. These are increasingly rare. I am a great believer in core competence and focus. This is not only for geeks — one can have a lot of breadth-of-scope but this too depends on not getting sidetracked by constant information overload.
Insofar as personal success depends on constant reaction to online social media, this comes at a cost in time and focus and this cost will have to be managed somehow, for example by automation or outsourcing. But if the social media is only automated fronts twitting and re-twitting among themselves, a bit like electronic trading systems do with securities, with or without human operators, the value of the medium decreases.
There are contradictory requirements. On one hand, what is said in electronic media is essentially permanent, so one had best only say things that are well considered. On the other hand, one must say these things without adequate time for reflection or analysis. To cope with this, one must have a well-rehearsed position that is compacted so that it fits in a short format and is easy to remember and unambiguous to express. A culture of pre-cooked fast-food advertising cuts down on depth. Real-world things are complex and multifaceted. Besides, prevalent patterns of communication train the brain for a certain mode of functioning. If we train for rapid-fire 140-character messaging, we optimize one side but probably at the expense of another. In the meantime, the world continues developing increased complexity by all kinds of emergent effects. Connectivity is good but don't get lost in it.
There is a CIA memorandum about how analysts misinterpret data and see what they want to see. This is a relevant resource for understanding some psychology of perception and memory. With the information overload, largely driven by user generated content, interpreting fragmented and variously-biased real-time information is not only for the analyst but for everyone who needs to intelligently function in cyber-social space.
I participated in discussions on security and privacy and on mobile social networks and context.
For privacy, the main thing turned out to be whether people should be protected from themselves. Should information expire? Will it get buried by itself under huge volumes of new content? Well, for purposes of visibility, it will certainly get buried and will require constant management to stay visible. But for purposes of future finding of dirt, it will stay findable for those who are looking.
There is also the corollary of setting security for resources, like documents, versus setting security for statements, i.e., structured data like social networks. As I have blogged before, policies à la SQL do not work well when schema is fluid and end-users can't be expected to formulate or understand these. Remember Ted Nelson? A user interface should be such that a beginner understands it in 10 seconds in an emergency. The user interaction question is how to present things so that the user understands who will have access to what content. Also, users should themselves be able to check what potentially sensitive information can be found out about them. A service along the lines of Garlic's Data Patrol should be a part of the social web infrastructure of the future.
People at MIT have developed AIR (Accountability In RDF) for expressing policies about what can be done with data and for explaining why access is denied if it is denied. However, if we at all look at the history of secrets, it is rather seldom that one hears that access to information about X is restricted to compartment so-and-so; it is much more common to hear that there is no X. I would say that a policy system that just leaves out information that is not supposed to be available will please the users more. This is not only so for organizations; it is fully plausible that an individual might not wish to expose even the existence of some selected inner circle of friends, their parties together, or whatever.
In conclusion, there is no self-evident solution for careless use of social media. A site that requires people to confirm multiple times that they know what they are doing when publishing a photo will not get much use. We will see.
For mobility, there was some talk about the context of usage. Again, this is difficult. For different contexts, one would for example disclose one's location at the granularity of the city; for some other purposes, one would say which conference room one is in.
Embarrassing social situations may arise if mobile devices are too clever: If information about travel is pushed into the social network, one would feel like having to explain why one does not call on such-and-such a person and so on. Too much initiative in the mobile phone seems like a recipe for problems.
There is a thin line between convenience and having IT infrastructure rule one's life. The complexities and subtleties of social situations ought not to be reduced to the level of if-then rules. People and their interactions are more complex than they themselves often realize. A system is not its own metasystem, as Gödel put it. Similarly, human self-knowledge, let alone knowledge about another, is by this very principle only approximate. Not to forget what psychology tells us about state-dependent recall and of how circumstance can evoke patterns of behavior before one even notices. The history of expert systems did show that people do not do very well at putting their skills in the form of if-then rules. Thus automating sociality past a certain point seems a problematic proposition.
|
04/30/2009 12:14 GMT
|
Modified:
04/30/2009 12:51 GMT
|
|
|