Details
OpenLink Software
Burlington, United States
Subscribe
Post Categories
Recent Articles
Community Member Blogs
Display Settings
Translate
|
Showing posts in all categories Refresh
European Commission and the Data Overflow
[
Orri Erling
]
The European Commission recently circulated a questionnaire to selected experts on what could be done for the future of big data.
Since the questionnaire is public, I am publishing my answers below.
-
Data and data types
-
What volumes of data are we dealing with today? What is the growth rate? Where can we expect to be in 2015?
Private data warehouses of corporations have more than doubled yearly for the past years; hundreds of TB is not exceptional. This will continue. The real shift is in structured data being published in increasing quantities with a minimum level of integrate-ability through use of RDF and linked data principles. There are rewards for use of standard vocabularies and identifiers through search engines recognizing such data. There is convergence around DBpedia identifiers for real-world entities, e.g., most things that would be in the news.
This also means that internal data processes and silos may be enriched with this content. There is consequent pressure for accommodating more diversity of data, with more flexible schema.
Ultimately, all content presently stored in RDBs and presented in public accessible dynamic web pages will end up on the web of linked data. Examples are product catalogs, price lists, event schedules and the like.
The volume of the well known linked data sets is around 10 billion statements. With the above mentioned trends, growth by two or three orders of magnitude by 2015 seems reasonable, This is so especially if explicit semantics are extracted from the document web and if there is some further progress in the precision/recall of such extraction.
Relevant sections of this mass of data are a potential addition to any present or future analytics application.
Since arbitrary analytics over the database which is the web cannot be economically provided by a centralized search engine, a cloud model may be used for on-demand selection of relevant data and mixing it with private data. This will drive database innovation for the next years even more than the continued classical warehouse growth.
Science data is another driver of the data overflow. For example, faster gene sequencing, more accurate measurements in high energy physics, better imaging, and remote sensing will produce large volumes of data. This data has highly regular structure but labeling this data with source and lineage calls for a flexible, schema-last, self-describing model, such as RDF and linked data. Data and metadata should travel together but may have different data models.
By and large, the metadata of science data will be another stream to the web of linked data, at least to the degree it is publicly accessible. Restricted circles can and likely will implement similar ideas.
-
What types of data can we deal with intelligently due to their inherent structure (geospatial, temporal, social or knowledge graphs, 3D, sensor streams...)?
All the above types should be supported inside one DBMS so as to allow efficient querying combining conditions on all these types of data, e.g., photos of sunsets taken last summer in Ibiza, with over 20 megapixels, by people I know.
Note that the test for being a sunset is an operation on the image blob that should be taken to the data; the images cannot be economically transferred.
Interleaving of all database functions and types becomes increasingly important.
-
Industries, communities
-
Who is producing these data and why? Could they do it better? How?
Right now, projects such as Bio2RDF, Neurocommons, and DBPedia produce this data. The processes are in place and are reasonable. Incremental improvement is to be expected. These processes, along with the linked data meme generally taking off, drive demand for better NLP (Natural Language Processing), e.g., entity and relationship extraction, especially extraction that can produce instance data in given ontologies (e.g., events) using common identifiers (e.g., DBPedia URIs).
Mapping of RDBs to RDF is possible, and a W3C working group is developing standards for this. The required baseline level has been reached; the rest is a matter of automating deployment. Within the enterprise, there are advantages to be gained for information integration; e.g., all entities in the CRM space can be integrated with all email and support tickets through giving everything a URI. Some of this information may even be published on an extranet for self-service and web-service interfaces. This has been done at small scales and the rest is a matter of spreading adoption and lowering the entry barrier. Incremental progress will take place, eventually resulting in qualitatively better integration along the value chain when adoption is sufficiently widespread.
-
Who is consuming these data and why? Could they do it better? How?
Consumers are various. The greatest need is for tools that summarize complex data and allow getting a bird's eye view of what data is in the first instance available. Consuming the data is hindered by the user not even necessarily knowing what data there is. This is somewhat new, as traditionally the business analyst did know the schema of the warehouse and was proficient with SQL report generators and statistics packages.
Where Web 2.0 made the citizen journalist, the web of linked data will make the citizen analyst. For this to happen, with benefits for individuals, enterprises, and governments alike, more work in user interfaces, knowledge discovery, and query composition will be useful. We may envision a "meshup economy" where data is plentiful, but the unit of value and exchange is the smart report that crystallizes actionable value from this ocean.
-
What industrial sectors in Europe could become more competitive if they became much better at managing data?
Any sector could benefit. Early adopters are seen in the biomedical field and to an extent in media.
-
Is the regulation landscape imposing constraints (privacy, compliance ...) that don't have today good tool support?
The regulation landscape drives database demand through data retention requirements and the like.
With data integration, especially with privacy-sensitive data (as in medicine), there are issues of whether one dares put otherwise-shareable information online. Regulation is needed to protect individuals, but integration should still be possible for science.
For this, we see a need for progress in applying policy-based approaches (e.g., row level security) to relatively schema-last data such as RDF. This is possible but needs some more work. Also, creating on-the-fly-anonymizing views on data might help.
More research is needed for reconciling the need for security with the advantages of broad-based ad hoc integration. Ideally, data should be intelligent, aware of its origins and classification and cautious of whom it interacts with, all of this supported under the covers so that the user could ask anything but the data might refuse to answer or might restrict answers according to the user's profile. This is a tall order and implementing something of the sort is an open question.
-
What are the main practical problem identified for individuals and organizations? Please give examples and tell us about the main obstacles and barriers.
We have come across the following:
- Knowing that the data exists in the first place.
- If the data is found, figuring out the provenance, units and precision of measurement, identifiers, and the like.
- Compatible subject matter but incompatible representation: For example, one has numbers on a map with different maps for different points in time; another has time series of instrument data with geo-location for the instrument. It is only to be expected that the time interval between measurements is not the same. So there is need for a lot of one-off programming to align data.
Other problems have to do with sheer volume, i.e., transfer of data even in a local area network is too slow, let alone over a wide area network. Computation needs to go to the data, and databases need to support this.
-
Services, software stacks, protocols, standards, benchmarks
-
What combinations of components are needed to deal with these problems?
Recent times have seen a proliferation of special purpose databases. Since the data needs of the future are about combining data with maximum agility and minimum performance hit, there is need to gather the currently-separate functionality into an integrated system with sufficient flexibility. We see some of this in integration of map-reduce and scale-out databases. The former antagonists have become partners. Vertica, Greenplum, and OpenLink Virtuoso are example of DBMS featuring work in this direction.
Interoperability and at least de facto standards in ways of doing this will emerge.
-
What data exchange and processing mechanisms will be needed to work across platforms and programming languages?
HTTP, XML, and RDF are in fact very verbose, yet these are the formats and models that have uptake. Thus, these will continue to be used even though one might think binary formats to be more efficient.
There are of course science data set standards that are more compressed and these will continue, hopefully adding a practice of rich metadata in RDF.
For internals of systems, MPI and TCP/IP with proprietary optimized wire formats will continue. Inter-system communication will likely continue to be HTTP, XML, and RDF as appropriate.
-
What data environments are today so wastefully messy that they would benefit from the development of standards?
RDF and OWL are not messy but they could use some more performance; we are working on this. SPARQL is finally acquiring the capabilities of a serious query language, so things are slowly coming together.
Community process for developing application domain specific vocabularies works quite well, even though one could argue it is ad hoc and not up to what a modeling purist might wish.
Top-down imposition of standards has a mixed history, with long and expensive development and sometimes no or little uptake, consider some WS* standards for example.
-
What kind of performance is expected or required of these systems? Who will measure it reliably? How?
Relational databases have a history of substantial investment in optimization and some of them are very good for what they do, e.g., the newer generation of analytics databases.
The very large schema-last, no-SQL, sometimes eventually consistent key-value stores have a somewhat shorter history but do fill a real need.
These trends will merge: Extreme scale, schema-last, complex queries, even more complex inference, custom code for in-database machine learning and other bulk processing.
We find RDF augmented with some binary types at this crossroads. This point of the design space will have to provide performance roughly on the level of today's best relational solution for workloads that fit the relational model. The added cost of schema-last and inference must come down. We are working on this. Research work such as carried out with MonetDB gives clues as to how these aims can be reached.
The separation of query language and inference is artificial. After the concepts are mature, these functions will merge and execute close to the data; there are clear evolutionary pressures in this direction.
Benchmarks are key. Some gain can be had even from repurposing standard relational benchmarks like TPC-H. But the TPC-H rules do not allow official reporting of such.
Development of benchmarks for RDF, complex queries, and inference is needed. A bold challenge to the community, it should be rooted in real-life integration needs and involve high heterogeneity. A key-value store benchmark might also be conceived. A transaction benchmark like TPC-C might be the basis, maybe augmented with massive user-generated content like reviews and blogs.
If benchmarks exist and are not too easy nor inaccessibly difficult nor too expensive to run — think of the high end TPC-C results — then TPC-style rules and processes would be quite adequate. The threshold to publish should be lowered: Everybody runs the TPC workloads internally but few publish.
Some EC initiative for benchmarking could make sense, similar to the TREC initiative of the US government. Industry should be consulted for the specific content; possibly the answers to the present questionnaire can provide an approximate direction.
Benchmarks should be run by software vendors on their own systems, tuned by themselves. But there should be a process of disclosure and auditing; the TPC rules give an example. Compliance should not be too expensive or time consuming. Some community development for automating these things would be a worthwhile target for EC funding.
-
Usability and training
-
How difficult will it be for a developer of average competence to deploy components whose core is based on rather deep computer science? Do we all need to understand Monads and Continuations? What can be done to make it ever easier?
In the database world, huge advances in technology have taken place behind a relatively simple and stable interface: SQL. For the linked data web, the same will take place behind SPARQL.
Beyond these, for example, programming with MPI with good utilization of a cluster platform for an arbitrary algorithm, is quite difficult. The casual amateur is hereby warned.
There is no single solution. For automatic parallelization, since explicit, programmatic parallelization of things with MPI for example is very unscalable in terms of required skill, we should favor declarative and/or functional approaches.
Developing a debugger and explanation engine for rule-based and description-logics-based inference would be an idea.
For procedural workloads, things like Erlang may be good in cases and are not overly difficult in principle, especially if there are good debugging facilities.
For shipping functions in a cluster or cloud, the BOOM (Berkeley Orders Of Magnitude) approach or logic programming with explicit specification of compute location seem promising, surely more flexible than map-reduce. The question is whether a PHP developer can be made to do logic programming.
This bridge will be crossed only with actual need and even then reluctantly. We may look at the Web 2.0 practice of sharding MySQL, inconvenient as this may be, for an example. There is inertia and thus re-architecting is a constant process that is generally in reaction to facts, post hoc, often a point solution. One could argue that planning ahead would be smarter but by and large the world does not work so.
One part of the answer is an infinitely-scalable SQL database that expands and shrinks in the clouds, with the usual semantics, maybe optional eventual consistency and built-in map reduce. If such a thing is inexpensive enough and syntax-level-compatible with present installed base, many developers do not have to learn very much more.
This is maybe good for the bread-and-butter IT, but European competitiveness should not rest on this. Therefore we wish to go for bold new application types for which the client-server database application is not the model. Data-centric languages like BOOM, if they can be made very efficient and have good debugging support, are attractive there. These do require more intellectual investment but that is not a problem since the less-inquisitive part of the developer community is served by the first part of the answer.
-
How is a developer of average skills going to learn about these new advanced tools? How can we plan for excellent documentation and training, community mentoring, exchange of good practices, etc... across all EU countries?
For the most part, developers do not learn things for the sake of learning. When they have learned something and it is adequate, they stay with it for the most part and are even reluctant to engage in cross-camps interaction. The research world is often similarly insular. A new inflection in the application landscape is needed to drive learning. This inflection is provided by the ubiquity of mobile devices, sensor data, explicit semantics, NLP concept extraction, web of linked data, and such factors.
RDFa is a good example of a new technique piggybacking on something everybody uses, namely HTML. These new things should, within possibility, be deployed in the usual technology stack, LAMP or Java. Of course these do not have to be LAMP or Java or HTML or HTTP themselves but they must manifest through these.
A lot of the semantic web potential can be realized within the client-server database application model, thus no fundamental re-architecting, just some new data types and queries.
For data- or processing-intensive tasks, an on-demand hookup to cloud-based servers with Erlang and/or BOOM for programming model would be easy enough to learn and utilize.
The question is one of providing challenges. Addressing actual challenges with these techniques will lead to maturity, documentation, examples, and training. With virtual, Europe-wide distributed teams a reality in many places, Europe-wide dissemination is no longer insurmountable.
As the data overflow proceeds, its victims will multiply and create demand for solutions. The EC could here encourage research project use cases gaining an extended life past the end of research projects, possibly being maintained and multiplied and spun off.
If such things could be mutated into self-sustaining service businesses with pay-per-use revenue, say through a cloud SaaS business model, still primarily leveraging an open source technology stack, we could have self-propagating and self-supporting models for exploiting advanced IT. This would create interest, and interest would drive training and dissemination.
The problem is creating the pull.
-
Challenges
-
What should be, in this domain, the equivalent of the Netflix challenge, Ansari X Prize, Google Lunar X Prize, etc. ... ?
The EC itself no doubt suffers from data overflow in one function or another. Unless security/secrecy prohibits, simply publishing a large data set and a description of what operations should be done on it would be a start. The more real the data, the better — reality is consistently more complex and surprising than imagination. Since many interesting problems touch on fraud detection and law enforcement, there may be some security obstacles for using these application domains as subject matters of open challenges.
Once there is a good benchmark, as discussed above, there can be some prize money allocated for the winners, specially if the race is tight.
The Semantic Web Challenge and the Billion Triples Challenge exist and are useful as such, but do not seem to have any huge impact.
The incentives should be sufficient and part of the expenses arising from running for such challenges could be funded. Otherwise investing in existing business development will be more interesting to industry. Some industry participation seems necessary; we would wish academia and industry to work closer. Also, having industry supply the baseline guarantees that academia actually does further the state of the art. This is not always certain.
If challenges are based on actual problems, whether of the EC, its member governments, or private entities, and winning the challenge may lead to a contract for supplying an actual solution, these will naturally become more interesting for consortia involving integrators, specialist software vendors, and academia. Such a model would build actual capacity to deploy leading edge technologies in production, which is sorely needed.
-
What should one do to set up such a challenge, administer, and monitor it?
The EC should probably circulate a call for actual problem scenarios involving big data. If the matter of the overflow is as dire as represented, cases should be easy to find. A few should be selected and then anonymized if needed.
The party with the use case would benefit by having hopefully the best work on it. The contestants would benefit from having real world needs guide R&D. The EC would not have to do very much, except possibly use some money for funding the best proposals. The winner would possibly get a large account and related sales and service income. The contestants would have to be teams possibly involving many organizations; for example, development and first-line services and support could come from different companies along a systems integrator model such as is widely used in the US.
There may be a good benchmark at the time, possibly resulting from FP7 itself. In such a case, the EC could offer a prize for winners. Details would have to be worked out case by case. Such a challenge could be repeated a few times, as benchmark-driven progress in databases or TREC for example have taken some years to reach a point of slowdown in progress.
Administrating such an activity should not be prohibitive, as most of the expertise can be found with the stakeholders.
|
10/27/2009 13:29 GMT
|
Modified:
10/27/2009 14:57 GMT
|
European Commission and the Data Overflow
[
Virtuso Data Space Bot
]
The European Commission recently circulated a questionnaire to selected experts on what could be done for the future of big data.
Since the questionnaire is public, I am publishing my answers below.
-
Data and data types
-
What volumes of data are we dealing with today? What is the growth rate? Where can we expect to be in 2015?
Private data warehouses of corporations have more than doubled yearly for the past years; hundreds of TB is not exceptional. This will continue. The real shift is in structured data being published in increasing quantities with a minimum level of integrate-ability through use of RDF and linked data principles. There are rewards for use of standard vocabularies and identifiers through search engines recognizing such data. There is convergence around DBpedia identifiers for real-world entities, e.g., most things that would be in the news.
This also means that internal data processes and silos may be enriched with this content. There is consequent pressure for accommodating more diversity of data, with more flexible schema.
Ultimately, all content presently stored in RDBs and presented in public accessible dynamic web pages will end up on the web of linked data. Examples are product catalogs, price lists, event schedules and the like.
The volume of the well known linked data sets is around 10 billion statements. With the above mentioned trends, growth by two or three orders of magnitude by 2015 seems reasonable, This is so especially if explicit semantics are extracted from the document web and if there is some further progress in the precision/recall of such extraction.
Relevant sections of this mass of data are a potential addition to any present or future analytics application.
Since arbitrary analytics over the database which is the web cannot be economically provided by a centralized search engine, a cloud model may be used for on-demand selection of relevant data and mixing it with private data. This will drive database innovation for the next years even more than the continued classical warehouse growth.
Science data is another driver of the data overflow. For example, faster gene sequencing, more accurate measurements in high energy physics, better imaging, and remote sensing will produce large volumes of data. This data has highly regular structure but labeling this data with source and lineage calls for a flexible, schema-last, self-describing model, such as RDF and linked data. Data and metadata should travel together but may have different data models.
By and large, the metadata of science data will be another stream to the web of linked data, at least to the degree it is publicly accessible. Restricted circles can and likely will implement similar ideas.
-
What types of data can we deal with intelligently due to their inherent structure (geospatial, temporal, social or knowledge graphs, 3D, sensor streams...)?
All the above types should be supported inside one DBMS so as to allow efficient querying combining conditions on all these types of data, e.g., photos of sunsets taken last summer in Ibiza, with over 20 megapixels, by people I know.
Note that the test for being a sunset is an operation on the image blob that should be taken to the data; the images cannot be economically transferred.
Interleaving of all database functions and types becomes increasingly important.
-
Industries, communities
-
Who is producing these data and why? Could they do it better? How?
Right now, projects such as Bio2RDF, Neurocommons, and DBPedia produce this data. The processes are in place and are reasonable. Incremental improvement is to be expected. These processes, along with the linked data meme generally taking off, drive demand for better NLP (Natural Language Processing), e.g., entity and relationship extraction, especially extraction that can produce instance data in given ontologies (e.g., events) using common identifiers (e.g., DBPedia URIs).
Mapping of RDBs to RDF is possible, and a W3C working group is developing standards for this. The required baseline level has been reached; the rest is a matter of automating deployment. Within the enterprise, there are advantages to be gained for information integration; e.g., all entities in the CRM space can be integrated with all email and support tickets through giving everything a URI. Some of this information may even be published on an extranet for self-service and web-service interfaces. This has been done at small scales and the rest is a matter of spreading adoption and lowering the entry barrier. Incremental progress will take place, eventually resulting in qualitatively better integration along the value chain when adoption is sufficiently widespread.
-
Who is consuming these data and why? Could they do it better? How?
Consumers are various. The greatest need is for tools that summarize complex data and allow getting a bird's eye view of what data is in the first instance available. Consuming the data is hindered by the user not even necessarily knowing what data there is. This is somewhat new, as traditionally the business analyst did know the schema of the warehouse and was proficient with SQL report generators and statistics packages.
Where Web 2.0 made the citizen journalist, the web of linked data will make the citizen analyst. For this to happen, with benefits for individuals, enterprises, and governments alike, more work in user interfaces, knowledge discovery, and query composition will be useful. We may envision a "meshup economy" where data is plentiful, but the unit of value and exchange is the smart report that crystallizes actionable value from this ocean.
-
What industrial sectors in Europe could become more competitive if they became much better at managing data?
Any sector could benefit. Early adopters are seen in the biomedical field and to an extent in media.
-
Is the regulation landscape imposing constraints (privacy, compliance ...) that don't have today good tool support?
The regulation landscape drives database demand through data retention requirements and the like.
With data integration, especially with privacy-sensitive data (as in medicine), there are issues of whether one dares put otherwise-shareable information online. Regulation is needed to protect individuals, but integration should still be possible for science.
For this, we see a need for progress in applying policy-based approaches (e.g., row level security) to relatively schema-last data such as RDF. This is possible but needs some more work. Also, creating on-the-fly-anonymizing views on data might help.
More research is needed for reconciling the need for security with the advantages of broad-based ad hoc integration. Ideally, data should be intelligent, aware of its origins and classification and cautious of whom it interacts with, all of this supported under the covers so that the user could ask anything but the data might refuse to answer or might restrict answers according to the user's profile. This is a tall order and implementing something of the sort is an open question.
-
What are the main practical problem identified for individuals and organizations? Please give examples and tell us about the main obstacles and barriers.
We have come across the following:
- Knowing that the data exists in the first place.
- If the data is found, figuring out the provenance, units and precision of measurement, identifiers, and the like.
- Compatible subject matter but incompatible representation: For example, one has numbers on a map with different maps for different points in time; another has time series of instrument data with geo-location for the instrument. It is only to be expected that the time interval between measurements is not the same. So there is need for a lot of one-off programming to align data.
Other problems have to do with sheer volume, i.e., transfer of data even in a local area network is too slow, let alone over a wide area network. Computation needs to go to the data, and databases need to support this.
-
Services, software stacks, protocols, standards, benchmarks
-
What combinations of components are needed to deal with these problems?
Recent times have seen a proliferation of special purpose databases. Since the data needs of the future are about combining data with maximum agility and minimum performance hit, there is need to gather the currently-separate functionality into an integrated system with sufficient flexibility. We see some of this in integration of map-reduce and scale-out databases. The former antagonists have become partners. Vertica, Greenplum, and OpenLink Virtuoso are example of DBMS featuring work in this direction.
Interoperability and at least de facto standards in ways of doing this will emerge.
-
What data exchange and processing mechanisms will be needed to work across platforms and programming languages?
HTTP, XML, and RDF are in fact very verbose, yet these are the formats and models that have uptake. Thus, these will continue to be used even though one might think binary formats to be more efficient.
There are of course science data set standards that are more compressed and these will continue, hopefully adding a practice of rich metadata in RDF.
For internals of systems, MPI and TCP/IP with proprietary optimized wire formats will continue. Inter-system communication will likely continue to be HTTP, XML, and RDF as appropriate.
-
What data environments are today so wastefully messy that they would benefit from the development of standards?
RDF and OWL are not messy but they could use some more performance; we are working on this. SPARQL is finally acquiring the capabilities of a serious query language, so things are slowly coming together.
Community process for developing application domain specific vocabularies works quite well, even though one could argue it is ad hoc and not up to what a modeling purist might wish.
Top-down imposition of standards has a mixed history, with long and expensive development and sometimes no or little uptake, consider some WS* standards for example.
-
What kind of performance is expected or required of these systems? Who will measure it reliably? How?
Relational databases have a history of substantial investment in optimization and some of them are very good for what they do, e.g., the newer generation of analytics databases.
The very large schema-last, no-SQL, sometimes eventually consistent key-value stores have a somewhat shorter history but do fill a real need.
These trends will merge: Extreme scale, schema-last, complex queries, even more complex inference, custom code for in-database machine learning and other bulk processing.
We find RDF augmented with some binary types at this crossroads. This point of the design space will have to provide performance roughly on the level of today's best relational solution for workloads that fit the relational model. The added cost of schema-last and inference must come down. We are working on this. Research work such as carried out with MonetDB gives clues as to how these aims can be reached.
The separation of query language and inference is artificial. After the concepts are mature, these functions will merge and execute close to the data; there are clear evolutionary pressures in this direction.
Benchmarks are key. Some gain can be had even from repurposing standard relational benchmarks like TPC-H. But the TPC-H rules do not allow official reporting of such.
Development of benchmarks for RDF, complex queries, and inference is needed. A bold challenge to the community, it should be rooted in real-life integration needs and involve high heterogeneity. A key-value store benchmark might also be conceived. A transaction benchmark like TPC-C might be the basis, maybe augmented with massive user-generated content like reviews and blogs.
If benchmarks exist and are not too easy nor inaccessibly difficult nor too expensive to run — think of the high end TPC-C results — then TPC-style rules and processes would be quite adequate. The threshold to publish should be lowered: Everybody runs the TPC workloads internally but few publish.
Some EC initiative for benchmarking could make sense, similar to the TREC initiative of the US government. Industry should be consulted for the specific content; possibly the answers to the present questionnaire can provide an approximate direction.
Benchmarks should be run by software vendors on their own systems, tuned by themselves. But there should be a process of disclosure and auditing; the TPC rules give an example. Compliance should not be too expensive or time consuming. Some community development for automating these things would be a worthwhile target for EC funding.
-
Usability and training
-
How difficult will it be for a developer of average competence to deploy components whose core is based on rather deep computer science? Do we all need to understand Monads and Continuations? What can be done to make it ever easier?
In the database world, huge advances in technology have taken place behind a relatively simple and stable interface: SQL. For the linked data web, the same will take place behind SPARQL.
Beyond these, for example, programming with MPI with good utilization of a cluster platform for an arbitrary algorithm, is quite difficult. The casual amateur is hereby warned.
There is no single solution. For automatic parallelization, since explicit, programmatic parallelization of things with MPI for example is very unscalable in terms of required skill, we should favor declarative and/or functional approaches.
Developing a debugger and explanation engine for rule-based and description-logics-based inference would be an idea.
For procedural workloads, things like Erlang may be good in cases and are not overly difficult in principle, especially if there are good debugging facilities.
For shipping functions in a cluster or cloud, the BOOM (Berkeley Orders Of Magnitude) approach or logic programming with explicit specification of compute location seem promising, surely more flexible than map-reduce. The question is whether a PHP developer can be made to do logic programming.
This bridge will be crossed only with actual need and even then reluctantly. We may look at the Web 2.0 practice of sharding MySQL, inconvenient as this may be, for an example. There is inertia and thus re-architecting is a constant process that is generally in reaction to facts, post hoc, often a point solution. One could argue that planning ahead would be smarter but by and large the world does not work so.
One part of the answer is an infinitely-scalable SQL database that expands and shrinks in the clouds, with the usual semantics, maybe optional eventual consistency and built-in map reduce. If such a thing is inexpensive enough and syntax-level-compatible with present installed base, many developers do not have to learn very much more.
This is maybe good for the bread-and-butter IT, but European competitiveness should not rest on this. Therefore we wish to go for bold new application types for which the client-server database application is not the model. Data-centric languages like BOOM, if they can be made very efficient and have good debugging support, are attractive there. These do require more intellectual investment but that is not a problem since the less-inquisitive part of the developer community is served by the first part of the answer.
-
How is a developer of average skills going to learn about these new advanced tools? How can we plan for excellent documentation and training, community mentoring, exchange of good practices, etc... across all EU countries?
For the most part, developers do not learn things for the sake of learning. When they have learned something and it is adequate, they stay with it for the most part and are even reluctant to engage in cross-camps interaction. The research world is often similarly insular. A new inflection in the application landscape is needed to drive learning. This inflection is provided by the ubiquity of mobile devices, sensor data, explicit semantics, NLP concept extraction, web of linked data, and such factors.
RDFa is a good example of a new technique piggybacking on something everybody uses, namely HTML. These new things should, within possibility, be deployed in the usual technology stack, LAMP or Java. Of course these do not have to be LAMP or Java or HTML or HTTP themselves but they must manifest through these.
A lot of the semantic web potential can be realized within the client-server database application model, thus no fundamental re-architecting, just some new data types and queries.
For data- or processing-intensive tasks, an on-demand hookup to cloud-based servers with Erlang and/or BOOM for programming model would be easy enough to learn and utilize.
The question is one of providing challenges. Addressing actual challenges with these techniques will lead to maturity, documentation, examples, and training. With virtual, Europe-wide distributed teams a reality in many places, Europe-wide dissemination is no longer insurmountable.
As the data overflow proceeds, its victims will multiply and create demand for solutions. The EC could here encourage research project use cases gaining an extended life past the end of research projects, possibly being maintained and multiplied and spun off.
If such things could be mutated into self-sustaining service businesses with pay-per-use revenue, say through a cloud SaaS business model, still primarily leveraging an open source technology stack, we could have self-propagating and self-supporting models for exploiting advanced IT. This would create interest, and interest would drive training and dissemination.
The problem is creating the pull.
-
Challenges
-
What should be, in this domain, the equivalent of the Netflix challenge, Ansari X Prize, Google Lunar X Prize, etc. ... ?
The EC itself no doubt suffers from data overflow in one function or another. Unless security/secrecy prohibits, simply publishing a large data set and a description of what operations should be done on it would be a start. The more real the data, the better — reality is consistently more complex and surprising than imagination. Since many interesting problems touch on fraud detection and law enforcement, there may be some security obstacles for using these application domains as subject matters of open challenges.
Once there is a good benchmark, as discussed above, there can be some prize money allocated for the winners, specially if the race is tight.
The Semantic Web Challenge and the Billion Triples Challenge exist and are useful as such, but do not seem to have any huge impact.
The incentives should be sufficient and part of the expenses arising from running for such challenges could be funded. Otherwise investing in existing business development will be more interesting to industry. Some industry participation seems necessary; we would wish academia and industry to work closer. Also, having industry supply the baseline guarantees that academia actually does further the state of the art. This is not always certain.
If challenges are based on actual problems, whether of the EC, its member governments, or private entities, and winning the challenge may lead to a contract for supplying an actual solution, these will naturally become more interesting for consortia involving integrators, specialist software vendors, and academia. Such a model would build actual capacity to deploy leading edge technologies in production, which is sorely needed.
-
What should one do to set up such a challenge, administer, and monitor it?
The EC should probably circulate a call for actual problem scenarios involving big data. If the matter of the overflow is as dire as represented, cases should be easy to find. A few should be selected and then anonymized if needed.
The party with the use case would benefit by having hopefully the best work on it. The contestants would benefit from having real world needs guide R&D. The EC would not have to do very much, except possibly use some money for funding the best proposals. The winner would possibly get a large account and related sales and service income. The contestants would have to be teams possibly involving many organizations; for example, development and first-line services and support could come from different companies along a systems integrator model such as is widely used in the US.
There may be a good benchmark at the time, possibly resulting from FP7 itself. In such a case, the EC could offer a prize for winners. Details would have to be worked out case by case. Such a challenge could be repeated a few times, as benchmark-driven progress in databases or TREC for example have taken some years to reach a point of slowdown in progress.
Administrating such an activity should not be prohibitive, as most of the expertise can be found with the stakeholders.
|
10/27/2009 13:29 GMT
|
Modified:
10/27/2009 14:57 GMT
|
Provenance and Reification in Virtuoso
[
Orri Erling
]
These days, data provenance is a big topic across the board, ranging from the linked data web, to RDF in general, to any kind of data integration, with or without RDF. Especially with scientific data we encounter the need for metadata and provenance, repeatability of experiments, etc. Data without context is worthless, yet the producers of said data do not always have a model or budget for metadata. And if they do, the approach is often a proprietary relational schema with web services in front.
RDF and linked data principles could evidently be a great help. This is a large topic that goes into the culture of doing science and will deserve a more extensive treatment down the road.
For now, I will talk about possible ways of dealing with provenance annotations in Virtuoso at a fairly technical level.
If data comes many-triples-at-a-time from some source (e.g., library catalogue, user of a social network), then it is often easiest to put the data from each source/user into its own graph. Annotations can then be made on the graph. The graph IRI will simply occur as the subject of a triple in the same or some other graph. For example, all such annotations could go into a special annotations graph.
On the query side, having lots of distinct graphs does not have to be a problem if the index scheme is the right one, i.e., the 4 index scheme discussed in the Virtuoso documentation. If the query does not specify a graph, then triples in any graph will be considered when evaluating the query.
One could write queries like —
SELECT ?pub
WHERE
{
GRAPH ?g
{
?person foaf:knows ?contact
}
?contact foaf:name "Alice" .
?g xx:has_publisher ?pub
}
This would return the publishers of graphs that assert that somebody knows Alice.
Of course, the RDF reification vocabulary can be used as-is to say things about single triples. It is however very inefficient and is not supported by any specific optimization. Further, reification does not seem to get used very much; thus there is no great pressure to specially optimize it.
If we have to say things about specific triples and this occurs frequently (i.e., for more than 10% or so of the triples), then modifying the quad table becomes an option. For all its inefficiency, the RDF reification vocabulary is applicable if reification is a rarity.
Virtuoso's RDF_QUAD table can be altered to have more columns. The problem with this is that space usage is increased and the RDF loading and query functions will not know about the columns. A SQL update statement can be used to set values for these additional columns if one knows the G,S,P,O.
Suppose we annotated each quad with the user who inserted it and a timestamp. These would be columns in the RDF_QUAD table. The next choice would be whether these were primary key parts or dependent parts. If primary key parts, these would be non-NULL and would occur on every index. The same quad would exist for each distinct user and time this quad had been inserted. For loading functions to work, these columns would need a default. In practice, we think that having such metadata as a dependent part is more likely, so that G,S,P,O are the unique identifier of the quad. Whether one would then include these columns on indices other than the primary key would depend on how frequently they were accessed.
In SPARQL, one could use an extension syntax like —
SELECT *
WHERE
{ ?person foaf:knows ?connection
OPTION ( time ?ts ) .
?connection foaf:name "Alice" .
FILTER ( ?ts > "2009-08-08"^^xsd:datetime )
}
This would return everybody who knows Alice since a date more recent than 2009-08-08. This presupposes that the quad table has been extended with a datetime column.
The OPTION (time ?ts) syntax is not presently supported but we can easily add something of the sort if there is user demand for it. In practice, this would be an extension mechanism enabling one to access extension columns of RDF_QUAD via a column ?variable syntax in the OPTION clause.
If quad metadata were not for every quad but still relatively frequent, another possibility would be making a separate table with a key of GSPO and a dependent part of R, where R would be the reification URI of the quad. Reification statements would then be made with R as a subject. This would be more compact than the reification vocabulary and would not modify the RDF_QUAD table. The syntax for referring to this could be something like —
SELECT *
WHERE
{ ?person foaf:knows ?contact
OPTION ( reify ?r ) .
?r xx:assertion_time ?ts .
?contact foaf:name "Alice" .
FILTER ( ?ts > "2008-8-8"^^xsd:datetime )
}
We could even recognize the reification vocabulary and convert it into the reify option if this were really necessary. But since it is so unwieldy I don't think there would be huge demand. Who knows? You tell us.
|
09/01/2009 10:44 GMT
|
Modified:
09/01/2009 11:20 GMT
|
Provenance and Reification in Virtuoso
[
Virtuso Data Space Bot
]
These days, data provenance is a big topic across the board, ranging from the linked data web, to RDF in general, to any kind of data integration, with or without RDF. Especially with scientific data we encounter the need for metadata and provenance, repeatability of experiments, etc. Data without context is worthless, yet the producers of said data do not always have a model or budget for metadata. And if they do, the approach is often a proprietary relational schema with web services in front.
RDF and linked data principles could evidently be a great help. This is a large topic that goes into the culture of doing science and will deserve a more extensive treatment down the road.
For now, I will talk about possible ways of dealing with provenance annotations in Virtuoso at a fairly technical level.
If data comes many-triples-at-a-time from some source (e.g., library catalogue, user of a social network), then it is often easiest to put the data from each source/user into its own graph. Annotations can then be made on the graph. The graph IRI will simply occur as the subject of a triple in the same or some other graph. For example, all such annotations could go into a special annotations graph.
On the query side, having lots of distinct graphs does not have to be a problem if the index scheme is the right one, i.e., the 4 index scheme discussed in the Virtuoso documentation. If the query does not specify a graph, then triples in any graph will be considered when evaluating the query.
One could write queries like —
SELECT ?pub
WHERE
{
GRAPH ?g
{
?person foaf:knows ?contact
}
?contact foaf:name "Alice" .
?g xx:has_publisher ?pub
}
This would return the publishers of graphs that assert that somebody knows Alice.
Of course, the RDF reification vocabulary can be used as-is to say things about single triples. It is however very inefficient and is not supported by any specific optimization. Further, reification does not seem to get used very much; thus there is no great pressure to specially optimize it.
If we have to say things about specific triples and this occurs frequently (i.e., for more than 10% or so of the triples), then modifying the quad table becomes an option. For all its inefficiency, the RDF reification vocabulary is applicable if reification is a rarity.
Virtuoso's RDF_QUAD table can be altered to have more columns. The problem with this is that space usage is increased and the RDF loading and query functions will not know about the columns. A SQL update statement can be used to set values for these additional columns if one knows the G,S,P,O.
Suppose we annotated each quad with the user who inserted it and a timestamp. These would be columns in the RDF_QUAD table. The next choice would be whether these were primary key parts or dependent parts. If primary key parts, these would be non-NULL and would occur on every index. The same quad would exist for each distinct user and time this quad had been inserted. For loading functions to work, these columns would need a default. In practice, we think that having such metadata as a dependent part is more likely, so that G,S,P,O are the unique identifier of the quad. Whether one would then include these columns on indices other than the primary key would depend on how frequently they were accessed.
In SPARQL, one could use an extension syntax like —
SELECT *
WHERE
{ ?person foaf:knows ?connection
OPTION ( time ?ts ) .
?connection foaf:name "Alice" .
FILTER ( ?ts > "2009-08-08"^^xsd:datetime )
}
This would return everybody who knows Alice since a date more recent than 2009-08-08. This presupposes that the quad table has been extended with a datetime column.
The OPTION (time ?ts) syntax is not presently supported but we can easily add something of the sort if there is user demand for it. In practice, this would be an extension mechanism enabling one to access extension columns of RDF_QUAD via a column ?variable syntax in the OPTION clause.
If quad metadata were not for every quad but still relatively frequent, another possibility would be making a separate table with a key of GSPO and a dependent part of R, where R would be the reification URI of the quad. Reification statements would then be made with R as a subject. This would be more compact than the reification vocabulary and would not modify the RDF_QUAD table. The syntax for referring to this could be something like —
SELECT *
WHERE
{ ?person foaf:knows ?contact
OPTION ( reify ?r ) .
?r xx:assertion_time ?ts .
?contact foaf:name "Alice" .
FILTER ( ?ts > "2008-8-8"^^xsd:datetime )
}
We could even recognize the reification vocabulary and convert it into the reify option if this were really necessary. But since it is so unwieldy I don't think there would be huge demand. Who knows? You tell us.
|
09/01/2009 10:44 GMT
|
Modified:
09/01/2009 11:20 GMT
|
More On Parallel RDF/Text Query Evaluation
[
Orri Erling
]
We have received some more questions about Virtuoso's parallel query evaluation model.
In answer, we will here explain how we do search engine style processing by writing SPARQL. There is no need for custom procedural code because the query optimizer does all the partitioning and the equivalent of map reduce.
The point is that what used to require programming can often be done in a generic query language. The technical detail is that the implementation must be smart enough with respect to parallelizing queries for this to be of practical benefit. But by combining these two things, we are a step closer to the web being the database.
I will here show how we do some joins combining full text, RDF conditions, and aggregates and ORDER BY. The sample task is finding the top 20 entities with New York in some attribute value. Then we specify the search further by only taking actors associated with New York. The results are returned in the order of a composite of entity rank and text match score.
The basic query is:
SELECT
(
sql:s_sum_page
(
<sql:vector_agg>
(
<bif:vector> ( ?c1 , ?sm )
),
bif:vector
( 'new', 'york' )
)
) AS ?res
WHERE
{
{
SELECT
(
<SHORT_OR_LONG::>(?s1)
) AS ?c1
(
<sql:S_SUM>
(
<SHORT_OR_LONG::IRI_RANK> ( ?s1 ) ,
<SHORT_OR_LONG::> ( ?s1textp ) ,
<SHORT_OR_LONG::> ( ?o1 ) ,
?sc
)
) AS ?sm
WHERE
{
?s1 ?s1textp ?o1 .
?o1 bif:contains "new AND york"
OPTION ( SCORE ?sc )
}
ORDER BY
DESC
(
<sql:sum_rank>
((
<sql:S_SUM>
(
<SHORT_OR_LONG::IRI_RANK> ( ?s1 ) ,
<SHORT_OR_LONG::> ( ?s1textp ) ,
<SHORT_OR_LONG::> ( ?o1 ) ,
?sc
)
))
)
LIMIT 20
}
}
This takes some explaining. The basic part is
{
?s1 ?s1textp ?o1 .
?o1 bif:contains "new AND york"
OPTION ( SCORE ?sc )
}
This just makes tuples where ?s1 is the object, ?s1textp the property, and ?o1 the literal which contains "New York". For a single ?s1, there can of course be many properties which all contain "New York".
The rest of the query gathers all the "New York" containing properties of an entity into a single aggregate, and then gets the entity ranks of all such entities.
After this, the aggregates are sorted by a sum of the entity rank and a combined text score calculated based on the individual text match scores between "New York" and the strings containing "New York". The text hit score is higher if the words repeat often and in close proximity.
The s_sum function is a user-defined aggregate which takes 4 arguments: The rank of the subject of the triple; the predicate of the triple containing the text; the object of the triple containing the text; and the text match score.
These are grouped by the subject of the triple. After this, these are sorted by sum_score of the aggregate constructed with s_sum. The sum_score is a SQL function combining the entity rank with the text scores of the different literals.
This executes as one would expect: All partitions make a text index lookup, retrieving the object of the triple. The text index entries of an object are stored in the same partition as the object. But the entity rank is a property of the subject and is partitioned by the subject. Also the GROUP BY is by the subject. Thus the data is produced from all partitions, then streamed into the receiving partitions, determined by the subject. This partition can then get the score and group the matches by the subject. Since all these partial aggregates are partitioned by the subject, there is no need to merge them; thus, the top k sort can be done for each partition separately. Finally, the top 20 of each partition are merged into the global top 20. This is then passed to a final function s_sum_page that turns this all into an XML fragment that can be processed with XSLT for inclusion on a web page.
This differs from the text search engine in that the query pipeline can contain arbitrary cross-partition joins. Also, the string "New York" is a common label that occurs in many distinct entities. Thus one text match, to one document, in the case the containing only the string "New York" will get many entities, likely all from different partitions.
So, if we only want actors with a mention of "New York", we need to get the inner part of the query as:
{
?s1 ?s1textp ?o1 .
?o1 bif:contains "new AND york"
OPTION ( SCORE ?sc ) .
?s1 a <http://umbel.org/umbel/sc/Actor>
}
Whether an entity is an actor can be checked in the same partition as the rank of the entity. Thus the query plan gets this check right before getting the rank. This is natural since there is no point in getting the rank of something that is not an actor.
The <short_or_long::sql:func> notation means that we call func, which is a SQL stored procedure with the arguments in their internal form. Thus, if a variable bound to an IRI is passed, the short_or_long specifies that it is passed as its internal ID and is not converted into its text form. This is essential, since there is no point getting the text of half a million IRIs when only 20 at most will be shown in the end.
Now, when we run this on a collection of 4.5 billion triples of linked data, once we have the working set, we can get the top 20 "New York" occurrences, with text summaries and all, in just 1.1s, with 12 of 16 cores busy. (The hardware is two boxes with two quad-core Xeon 5345 each.)
If we run this query in two parallel sessions, we get both results in 1.9s, with 14 of 16 cores busy. This gets about 200K "New York" strings, which becomes about 400K entities with New York somewhere, for which a rank then has to be retrieved. After this, all the possibly-many occurrences of New York in the title, text, and other properties of the entity are aggregated together, resolving into some 220K groups. These are then sorted. This is internally over 1.5 million random lookups and some 40MB of traffic between processes. Restricting the type of the entity to actor drops the execution time of one query to 0.8s because there are then fewer ranks to retrieve and less data to aggregate and sort.
By adding partitions and cores, we scale horizontally, as evaluating the query involves almost no central control, even though data are swapped between partitions. There is some flow control to avoid constructing overly-large intermediate results but generally partitions run independently and asynchronously. In the above case, there is just one fence at the point where all aggregates are complete, so that they can be sorted; otherwise, all is asynchronous.
Doing JOINs between partitions and partitioned GROUP BY/ORDER BY is pretty regular database stuff. Applying this to RDF is a most natural thing.
If we do not parallelize the user-defined aggregate for grouping all the "New York" occurrences, the query takes 8s instead of 1.1s. If we could not put SQL procedures as user-defined aggregates to be parallelized with the query, we'd have to either bring all the data to a central point before the top k, which would destroy performance, or we would have to do procedures with explicit parallel procedure calls which is hard to write, surely too hard for ad hoc queries.
Results of live execution may not be complete on initial load, as this link includes a "Virtuoso Anytime" timeout of 10 seconds. Running against a cold cache, these results may take much longer to return; a warm cache will deliver response times along the lines of those discussed above.
Engineering matters. If we wish to commoditize queries on a lot of data, such intelligence in the DBMS is necessary; it is very unscalable to require people to do procedural code or give query parallelization hints. If you need to optimize a workload of 10 different transactions, this is of course possible and even desirable, but for the infinity of all search or analysis, this will not happen.
|
08/19/2009 13:28 GMT
|
Modified:
08/19/2009 14:00 GMT
|
More On Parallel RDF/Text Query Evaluation
[
Virtuso Data Space Bot
]
We have received some more questions about Virtuoso's parallel query evaluation model.
In answer, we will here explain how we do search engine style processing by writing SPARQL. There is no need for custom procedural code because the query optimizer does all the partitioning and the equivalent of map reduce.
The point is that what used to require programming can often be done in a generic query language. The technical detail is that the implementation must be smart enough with respect to parallelizing queries for this to be of practical benefit. But by combining these two things, we are a step closer to the web being the database.
I will here show how we do some joins combining full text, RDF conditions, and aggregates and ORDER BY. The sample task is finding the top 20 entities with New York in some attribute value. Then we specify the search further by only taking actors associated with New York. The results are returned in the order of a composite of entity rank and text match score.
The basic query is:
SELECT
(
sql:s_sum_page
(
<sql:vector_agg>
(
<bif:vector> ( ?c1 , ?sm )
),
bif:vector
( 'new', 'york' )
)
) AS ?res
WHERE
{
{
SELECT
(
<SHORT_OR_LONG::>(?s1)
) AS ?c1
(
<sql:S_SUM>
(
<SHORT_OR_LONG::IRI_RANK> ( ?s1 ) ,
<SHORT_OR_LONG::> ( ?s1textp ) ,
<SHORT_OR_LONG::> ( ?o1 ) ,
?sc
)
) AS ?sm
WHERE
{
?s1 ?s1textp ?o1 .
?o1 bif:contains "new AND york"
OPTION ( SCORE ?sc )
}
ORDER BY
DESC
(
<sql:sum_rank>
((
<sql:S_SUM>
(
<SHORT_OR_LONG::IRI_RANK> ( ?s1 ) ,
<SHORT_OR_LONG::> ( ?s1textp ) ,
<SHORT_OR_LONG::> ( ?o1 ) ,
?sc
)
))
)
LIMIT 20
}
}
This takes some explaining. The basic part is
{
?s1 ?s1textp ?o1 .
?o1 bif:contains "new AND york"
OPTION ( SCORE ?sc )
}
This just makes tuples where ?s1 is the object, ?s1textp the property, and ?o1 the literal which contains "New York". For a single ?s1, there can of course be many properties which all contain "New York".
The rest of the query gathers all the "New York" containing properties of an entity into a single aggregate, and then gets the entity ranks of all such entities.
After this, the aggregates are sorted by a sum of the entity rank and a combined text score calculated based on the individual text match scores between "New York" and the strings containing "New York". The text hit score is higher if the words repeat often and in close proximity.
The s_sum function is a user-defined aggregate which takes 4 arguments: The rank of the subject of the triple; the predicate of the triple containing the text; the object of the triple containing the text; and the text match score.
These are grouped by the subject of the triple. After this, these are sorted by sum_score of the aggregate constructed with s_sum. The sum_score is a SQL function combining the entity rank with the text scores of the different literals.
This executes as one would expect: All partitions make a text index lookup, retrieving the object of the triple. The text index entries of an object are stored in the same partition as the object. But the entity rank is a property of the subject and is partitioned by the subject. Also the GROUP BY is by the subject. Thus the data is produced from all partitions, then streamed into the receiving partitions, determined by the subject. This partition can then get the score and group the matches by the subject. Since all these partial aggregates are partitioned by the subject, there is no need to merge them; thus, the top k sort can be done for each partition separately. Finally, the top 20 of each partition are merged into the global top 20. This is then passed to a final function s_sum_page that turns this all into an XML fragment that can be processed with XSLT for inclusion on a web page.
This differs from the text search engine in that the query pipeline can contain arbitrary cross-partition joins. Also, the string "New York" is a common label that occurs in many distinct entities. Thus one text match, to one document, in the case the containing only the string "New York" will get many entities, likely all from different partitions.
So, if we only want actors with a mention of "New York", we need to get the inner part of the query as:
{
?s1 ?s1textp ?o1 .
?o1 bif:contains "new AND york"
OPTION ( SCORE ?sc ) .
?s1 a <http://umbel.org/umbel/sc/Actor>
}
Whether an entity is an actor can be checked in the same partition as the rank of the entity. Thus the query plan gets this check right before getting the rank. This is natural since there is no point in getting the rank of something that is not an actor.
The <short_or_long::sql:func> notation means that we call func, which is a SQL stored procedure with the arguments in their internal form. Thus, if a variable bound to an IRI is passed, the short_or_long specifies that it is passed as its internal ID and is not converted into its text form. This is essential, since there is no point getting the text of half a million IRIs when only 20 at most will be shown in the end.
Now, when we run this on a collection of 4.5 billion triples of linked data, once we have the working set, we can get the top 20 "New York" occurrences, with text summaries and all, in just 1.1s, with 12 of 16 cores busy. (The hardware is two boxes with two quad-core Xeon 5345 each.)
If we run this query in two parallel sessions, we get both results in 1.9s, with 14 of 16 cores busy. This gets about 200K "New York" strings, which becomes about 400K entities with New York somewhere, for which a rank then has to be retrieved. After this, all the possibly-many occurrences of New York in the title, text, and other properties of the entity are aggregated together, resolving into some 220K groups. These are then sorted. This is internally over 1.5 million random lookups and some 40MB of traffic between processes. Restricting the type of the entity to actor drops the execution time of one query to 0.8s because there are then fewer ranks to retrieve and less data to aggregate and sort.
By adding partitions and cores, we scale horizontally, as evaluating the query involves almost no central control, even though data are swapped between partitions. There is some flow control to avoid constructing overly-large intermediate results but generally partitions run independently and asynchronously. In the above case, there is just one fence at the point where all aggregates are complete, so that they can be sorted; otherwise, all is asynchronous.
Doing JOINs between partitions and partitioned GROUP BY/ORDER BY is pretty regular database stuff. Applying this to RDF is a most natural thing.
If we do not parallelize the user-defined aggregate for grouping all the "New York" occurrences, the query takes 8s instead of 1.1s. If we could not put SQL procedures as user-defined aggregates to be parallelized with the query, we'd have to either bring all the data to a central point before the top k, which would destroy performance, or we would have to do procedures with explicit parallel procedure calls which is hard to write, surely too hard for ad hoc queries.
Results of live execution may not be complete on initial load, as this link includes a "Virtuoso Anytime" timeout of 10 seconds. Running against a cold cache, these results may take much longer to return; a warm cache will deliver response times along the lines of those discussed above.
Engineering matters. If we wish to commoditize queries on a lot of data, such intelligence in the DBMS is necessary; it is very unscalable to require people to do procedural code or give query parallelization hints. If you need to optimize a workload of 10 different transactions, this is of course possible and even desirable, but for the infinity of all search or analysis, this will not happen.
|
08/19/2009 13:28 GMT
|
Modified:
08/19/2009 14:00 GMT
|
The URI, URL, and Linked Data Meme's Generic HTTP URI (Updated)
[
Kingsley Uyi Idehen
]
Situation Analysis
As the "Linked Data" meme has gained momentum you've more than likely been on the receiving end of dialog with Linked Open Data community members (myself included) that goes something like this:
"Do you have a URI", "Get yourself a URI", "Give me a de-referencable URI" etc..
And each time, you respond with a URL -- which to the best of your Web knowledge is a bona fide URI. But to your utter confusion you are told: Nah! You gave me a Document URI instead of the URI of a real-world thing or object etc..
What's up with that?
Well our everyday use of the Web is an unfortunate conflation of two distinct things, which have Identity: Real World Objects (RWOs) & Address/Location of Documents (Information bearing Resources).
The "Linked Data" meme is about enhancing the Web by unobtrusively reintroducing its core essence: the generic HTTP URI, a vital piece of Web Architecture DNA. Basically, its about so realizing the full capabilities of the Web as a platform for Open Data Identification, Definition, Access, Storage, Representation, Presentation, and Integration.
What is a Real World Object?
People, Places, Music, Books, Cars, Ideas, Emotions etc..
What is a URI?
A Uniform Resource Identifier. A global identifier mechanism for network addressable data items. Its sole function is Name oriented Identification.
URI Generic Syntax
The constituent parts of a URI (from URI Generic Syntax RFC) are depicted below:
What is a URL?
A location oriented HTTP scheme based URI. The HTTP scheme introduces a powerful and inherent duality that delivers:
-
Resource Address/Location Identifier
-
Data Access mechanism for an Information bearing Resource (Document, File etc..)
So far so good!
What is an HTTP based URI?
The kind of URI Linked Data aficionados mean when they use the term: URI.
An HTTP URI is an HTTP scheme based URI. Unlike a URL, this kind of HTTP scheme URI is devoid of any Web Location orientation or specificity. Thus, Its inherent duality provides a more powerful level of abstraction. Hence, you can use this form of URI to assign Names/Identifiers to Real World Objects (RWO). Even better, courtesy of the Identity/Address duality of the HTTP scheme, a single URI can deliver the following:
-
RWO Identfier/Name
-
RWO Metadata document Locator (courtesy of URL aspect)
-
Negotiable Representation of the Located Document (courtesy of HTTP's content negotiation feature).
What is Metadata?
Data about Data. Put differently, data that describes other data in a structured manner.
How Do we Model Metadata?
The predominant model for metadata is the Entity-Attribute-Value + Classes & Relationships model (EAV/CR). A model that's been with us since the inception of modern computing (long before the Web).
What about RDF?
The Resource Description Framework (RDF) is a framework for describing Web addressable resources. In a nutshell, its a framework for adding Metadata bearing Information Resources to the current Web. Its comprised of:
-
Entity-Attribute-Value (aka. Subject-Predictate-Object) plus Classes & Relationships (Data Dictionaries e.g., OWL) metadata model
-
A plethora of instance data representation formats that include: RDFa (when doing so within (X)HTML docs), Turtle, N3, TriX, RDF/XML etc.
What's the Problem Today?
The ubiquitous use of the Web is primarily focused on a Linked Mesh of Information bearing Documents. URLs rather than generic HTTP URIs are the prime mechanism for Web tapestry; basically, we use URLs to conduct Information -- which is inherently subjective -- instead of using HTTP URIs to conduct "Raw Data" -- which is inherently objective.
Note: Information is "data in context", it isn't the same thing as "Raw Data". Thus, if we can link to Information via the Web, why shouldn't we be able to do the same for "Raw Data"?
How Does the Link Data meme solve the problem?
The meme simply provides a set of guidelines (best practices) for producing Web architecture friendly metadata. Meaning: when producing EAV/CR model based metadata, endow Subjects, their Attributes, and Attribute Values (optionally) with HTTP URIs. By doing so, a new level of Link Abstraction on the Web is possible i.e., "Data Item to Data Item" level links (aka hyperdata links). Even better, when you de-reference a RWO hyperdata link you end up with a negotiated representations of its metadata.
Conclusion
Linked Data is ultimately about an HTTP URI for each item in the Data Organization Hierarchy :-)
Related
-
History of how "Resource" became part of URI - historic account by TimBL
-
Linked Data Design Issues Document - TimBL's initial Linked Data Guide
-
Linked Data Rules Simplified - My attempt at simplifying the Linked Data Meme without SPARQL & RDF distraction
-
Linked Data & Identity - another related post
-
The Linked Data Meme's Value Proposition
-
My Del.icio.us hosted Bookmark Data Space for Identity Schemes
-
TimBL's Ted Talk re. "Raw Linked Data".
|
08/07/2009 14:34 GMT
|
Modified:
10/07/2009 08:02 GMT
|
The URI, URL, and Linked Data Meme's Generic HTTP URI (Updated)
[
Kingsley Uyi Idehen
]
Situation Analysis
As the "Linked Data" meme has gained momentum you've more than likely been on the receiving end of dialog with Linked Open Data community members (myself included) that goes something like this:
"Do you have a URI", "Get yourself a URI", "Give me a de-referencable URI" etc..
And each time, you respond with a URL -- which to the best of your Web knowledge is a bona fide URI. But to your utter confusion you are told: Nah! You gave me a Document URI instead of the URI of a real-world thing or object etc..
What's up with that?
Well our everyday use of the Web is an unfortunate conflation of two distinct things, which have Identity: Real World Objects (RWOs) & Address/Location of Documents (Information bearing Resources).
The "Linked Data" meme is about enhancing the Web by unobtrusively reintroducing its core essence: the generic HTTP URI, a vital piece of Web Architecture DNA. Basically, its about so realizing the full capabilities of the Web as a platform for Open Data Identification, Definition, Access, Storage, Representation, Presentation, and Integration.
What is a Real World Object?
People, Places, Music, Books, Cars, Ideas, Emotions etc..
What is a URI?
A Uniform Resource Identifier. A global identifier mechanism for network addressable data items. Its sole function is Name oriented Identification.
URI Generic Syntax
The constituent parts of a URI (from URI Generic Syntax RFC) are depicted below:
What is a URL?
A location oriented HTTP scheme based URI. The HTTP scheme introduces a powerful and inherent duality that delivers:
-
Resource Address/Location Identifier
-
Data Access mechanism for an Information bearing Resource (Document, File etc..)
So far so good!
What is an HTTP based URI?
The kind of URI Linked Data aficionados mean when they use the term: URI.
An HTTP URI is an HTTP scheme based URI. Unlike a URL, this kind of HTTP scheme URI is devoid of any Web Location orientation or specificity. Thus, Its inherent duality provides a more powerful level of abstraction. Hence, you can use this form of URI to assign Names/Identifiers to Real World Objects (RWO). Even better, courtesy of the Identity/Address duality of the HTTP scheme, a single URI can deliver the following:
-
RWO Identfier/Name
-
RWO Metadata document Locator (courtesy of URL aspect)
-
Negotiable Representation of the Located Document (courtesy of HTTP's content negotiation feature).
What is Metadata?
Data about Data. Put differently, data that describes other data in a structured manner.
How Do we Model Metadata?
The predominant model for metadata is the Entity-Attribute-Value + Classes & Relationships model (EAV/CR). A model that's been with us since the inception of modern computing (long before the Web).
What about RDF?
The Resource Description Framework (RDF) is a framework for describing Web addressable resources. In a nutshell, its a framework for adding Metadata bearing Information Resources to the current Web. Its comprised of:
-
Entity-Attribute-Value (aka. Subject-Predictate-Object) plus Classes & Relationships (Data Dictionaries e.g., OWL) metadata model
-
A plethora of instance data representation formats that include: RDFa (when doing so within (X)HTML docs), Turtle, N3, TriX, RDF/XML etc.
What's the Problem Today?
The ubiquitous use of the Web is primarily focused on a Linked Mesh of Information bearing Documents. URLs rather than generic HTTP URIs are the prime mechanism for Web tapestry; basically, we use URLs to conduct Information -- which is inherently subjective -- instead of using HTTP URIs to conduct "Raw Data" -- which is inherently objective.
Note: Information is "data in context", it isn't the same thing as "Raw Data". Thus, if we can link to Information via the Web, why shouldn't we be able to do the same for "Raw Data"?
How Does the Link Data meme solve the problem?
The meme simply provides a set of guidelines (best practices) for producing Web architecture friendly metadata. Meaning: when producing EAV/CR model based metadata, endow Subjects, their Attributes, and Attribute Values (optionally) with HTTP URIs. By doing so, a new level of Link Abstraction on the Web is possible i.e., "Data Item to Data Item" level links (aka hyperdata links). Even better, when you de-reference a RWO hyperdata link you end up with a negotiated representations of its metadata.
Conclusion
Linked Data is ultimately about an HTTP URI for each item in the Data Organization Hierarchy :-)
Related
-
History of how "Resource" became part of URI - historic account by TimBL
-
Linked Data Design Issues Document - TimBL's initial Linked Data Guide
-
Linked Data Rules Simplified - My attempt at simplifying the Linked Data Meme without SPARQL & RDF distraction
-
Linked Data & Identity - another related post
-
The Linked Data Meme's Value Proposition
-
My Del.icio.us hosted Bookmark Data Space for Identity Schemes
-
TimBL's Ted Talk re. "Raw Linked Data".
|
08/07/2009 14:34 GMT
|
Modified:
10/07/2009 08:02 GMT
|
Important Things to Note about the World Wide Web
[
Kingsley Uyi Idehen
]
Based on the prevalence of confusion re. the Linked Data meme, here are a few important points to remember about the World Wide Web.
- Its an HTTP based Network Cluster within the Internet (remember: Networks are about meshes of Nodes connected by Links)
- Its underlying data model is that of a Network (we've had Network Data models for eons. EAV/CR is an example)
- Links are facilitated via URIs
- Until recently the granularity of Networking on the Web was scoped to Data Containers (documents) (due to prevalence of URL style links
- The Linked Data meme adds Data Item (Datum) level granularity to World Wide Web networking via HTTP URIs
- Data Items become Web Reference-able when you Identify/Name them using HTTP based URIs
- An HTTP URI implicitly binds a Web Reference-able Data Item (Entity, Datum, Data Object, Resource) to its Web Accessible Metadata
- Web Accessible Metadata resides within Data Containers (documents or information resources)
- The representation of a Web Accessible Metadata container is negotiable
- I am able to write and dispatch this blog post courtesy of the Web features listed above
- You are able to explore the many dimensions to data exposed by this blog should you decide to explore the Linked Data mesh exposed by this post's HTTP URI (via its permalink permalink)
The HTTP URI is the secret sauce of the Web that is powerfully and unobtrusively reintroduced via the Linked Data meme (classic back to the future act). This powerful sauce possess a unique power courtesy of its inherent duality i.e., how it uniquely combines Data Item Identity (think keys in traditional DBMS parlance) with Data Access (e.g. access to negotiable representations of associated metadata).
As you can see, I've made no mention of RDF or SPARQL, and I can still articulate the inherent value of the "Linked Data" dimension that the "Linked Data" meme adds to the World Wide Web.
As per usual this post is a live demonstration of Linked Data (dog-food style) :-)
Related
|
07/23/2009 09:27 GMT
|
Modified:
07/23/2009 10:33 GMT
|
Library of Congress & Reasonable Linked Data
[
Kingsley Uyi Idehen
]
While exploring the Subject Headings Linked Data Space (LCSH) recently unveiled by the Library of Congress, I noticed that the URI for the subject heading: World Wide Web, exposes an "owl:sameAs" link to resource URI: "info:lc/authorities/sh95000541" -- in fact, a URI.URN that isn't HTTP protocol scheme based.
The observations above triggered a discussion thread on Twitter that involved: @edsu, @iand, and moi. Naturally, it morphed into a live demonstration of: human vs machine, interpretation of claims expressed in the RDF graph.
What makes this whole thing interesting?
It showcases (in Man vs Machine style) the issue of unambiguously discerning the meaning of the owl:sameAs claim expressed in the LCSH Linked Data Space.
Perspectives & Potential Confusion
From the Linked Data perspective, it may spook a few people to see owl:sameAs values such as: "info:lc/authorities/sh95000541", that cannot be de-referenced using HTTP.
It may confuse a few people or user agents that see URI de-referencing as not necessarily HTTP specific, thereby attempting to de-reference the URI.URN on the assumption that it's associated with a "handle system", for instance.
It may even confuse RDFizer / RDFization middleware that use owl:sameAs as a data provider attribution mechanism via hint/nudge URI values derived from original content / data URI.URLs that de-reference to nothing e.g., an original resource URI.URL plus "#this" which produces URI.URN-URL -- think of this pattern as "owl:shameAs" in a sense :-)
Unambiguously Discerning Meaning
Simply bring OWL reasoning (inference rules and reasoners) into the mix, thereby negating human dialogue about interpretation which ultimately unveils a mesh of orthogonal view points. Remember, OWL is all about infrastructure that ultimately enables you to express yourself clearly i.e., say what you mean, and mean what you say.
Path to Clarity (using Virtuoso, its in-built Sponger Middleware, and Inference Engine):
- GET the data into the Virtuoso Quad store -- what the sponger does via its URIBurner Service (while following designated predicates such as owl:sameAs in case they point to other mesh-able data sources)
- Query the data in Quad Store with "owl:sameAs" inference rules enabled
- Repeat the last step with the inference rules excluded.
Actual SPARQL Queries:
Observations:
The SPARQL queries against the Graph generated and automatically populated by the Sponger reveal -- without human intervention-- that: "info:lc/authorities/sh95000541", is just an alternative name for < xmlns="http" id.loc.gov="id.loc.gov" authorities="authorities" sh95000541="sh95000541" concept="concept">, and that the graph produced by LCSH is self-describing enough for an OWL reasoner to figure this all out courtesy of the owl:sameAs property :-).
Hopefully, this post also provides a simple example of how OWL facilitates "Reasonable Linked Data".
Related
|
05/05/2009 13:53 GMT
|
Modified:
05/06/2009 14:26 GMT
|
|
|