Details
OpenLink Software
Burlington, United States
Subscribe
Post Categories
Recent Articles
Community Member Blogs
Display Settings
Translate
|
Showing posts in all categories Refresh
European Commission and the Data Overflow
[
Orri Erling
]
The European Commission recently circulated a questionnaire to selected experts on what could be done for the future of big data.
Since the questionnaire is public, I am publishing my answers below.
-
Data and data types
-
What volumes of data are we dealing with today? What is the growth rate? Where can we expect to be in 2015?
Private data warehouses of corporations have more than doubled yearly for the past years; hundreds of TB is not exceptional. This will continue. The real shift is in structured data being published in increasing quantities with a minimum level of integrate-ability through use of RDF and linked data principles. There are rewards for use of standard vocabularies and identifiers through search engines recognizing such data. There is convergence around DBpedia identifiers for real-world entities, e.g., most things that would be in the news.
This also means that internal data processes and silos may be enriched with this content. There is consequent pressure for accommodating more diversity of data, with more flexible schema.
Ultimately, all content presently stored in RDBs and presented in public accessible dynamic web pages will end up on the web of linked data. Examples are product catalogs, price lists, event schedules and the like.
The volume of the well known linked data sets is around 10 billion statements. With the above mentioned trends, growth by two or three orders of magnitude by 2015 seems reasonable, This is so especially if explicit semantics are extracted from the document web and if there is some further progress in the precision/recall of such extraction.
Relevant sections of this mass of data are a potential addition to any present or future analytics application.
Since arbitrary analytics over the database which is the web cannot be economically provided by a centralized search engine, a cloud model may be used for on-demand selection of relevant data and mixing it with private data. This will drive database innovation for the next years even more than the continued classical warehouse growth.
Science data is another driver of the data overflow. For example, faster gene sequencing, more accurate measurements in high energy physics, better imaging, and remote sensing will produce large volumes of data. This data has highly regular structure but labeling this data with source and lineage calls for a flexible, schema-last, self-describing model, such as RDF and linked data. Data and metadata should travel together but may have different data models.
By and large, the metadata of science data will be another stream to the web of linked data, at least to the degree it is publicly accessible. Restricted circles can and likely will implement similar ideas.
-
What types of data can we deal with intelligently due to their inherent structure (geospatial, temporal, social or knowledge graphs, 3D, sensor streams...)?
All the above types should be supported inside one DBMS so as to allow efficient querying combining conditions on all these types of data, e.g., photos of sunsets taken last summer in Ibiza, with over 20 megapixels, by people I know.
Note that the test for being a sunset is an operation on the image blob that should be taken to the data; the images cannot be economically transferred.
Interleaving of all database functions and types becomes increasingly important.
-
Industries, communities
-
Who is producing these data and why? Could they do it better? How?
Right now, projects such as Bio2RDF, Neurocommons, and DBPedia produce this data. The processes are in place and are reasonable. Incremental improvement is to be expected. These processes, along with the linked data meme generally taking off, drive demand for better NLP (Natural Language Processing), e.g., entity and relationship extraction, especially extraction that can produce instance data in given ontologies (e.g., events) using common identifiers (e.g., DBPedia URIs).
Mapping of RDBs to RDF is possible, and a W3C working group is developing standards for this. The required baseline level has been reached; the rest is a matter of automating deployment. Within the enterprise, there are advantages to be gained for information integration; e.g., all entities in the CRM space can be integrated with all email and support tickets through giving everything a URI. Some of this information may even be published on an extranet for self-service and web-service interfaces. This has been done at small scales and the rest is a matter of spreading adoption and lowering the entry barrier. Incremental progress will take place, eventually resulting in qualitatively better integration along the value chain when adoption is sufficiently widespread.
-
Who is consuming these data and why? Could they do it better? How?
Consumers are various. The greatest need is for tools that summarize complex data and allow getting a bird's eye view of what data is in the first instance available. Consuming the data is hindered by the user not even necessarily knowing what data there is. This is somewhat new, as traditionally the business analyst did know the schema of the warehouse and was proficient with SQL report generators and statistics packages.
Where Web 2.0 made the citizen journalist, the web of linked data will make the citizen analyst. For this to happen, with benefits for individuals, enterprises, and governments alike, more work in user interfaces, knowledge discovery, and query composition will be useful. We may envision a "meshup economy" where data is plentiful, but the unit of value and exchange is the smart report that crystallizes actionable value from this ocean.
-
What industrial sectors in Europe could become more competitive if they became much better at managing data?
Any sector could benefit. Early adopters are seen in the biomedical field and to an extent in media.
-
Is the regulation landscape imposing constraints (privacy, compliance ...) that don't have today good tool support?
The regulation landscape drives database demand through data retention requirements and the like.
With data integration, especially with privacy-sensitive data (as in medicine), there are issues of whether one dares put otherwise-shareable information online. Regulation is needed to protect individuals, but integration should still be possible for science.
For this, we see a need for progress in applying policy-based approaches (e.g., row level security) to relatively schema-last data such as RDF. This is possible but needs some more work. Also, creating on-the-fly-anonymizing views on data might help.
More research is needed for reconciling the need for security with the advantages of broad-based ad hoc integration. Ideally, data should be intelligent, aware of its origins and classification and cautious of whom it interacts with, all of this supported under the covers so that the user could ask anything but the data might refuse to answer or might restrict answers according to the user's profile. This is a tall order and implementing something of the sort is an open question.
-
What are the main practical problem identified for individuals and organizations? Please give examples and tell us about the main obstacles and barriers.
We have come across the following:
- Knowing that the data exists in the first place.
- If the data is found, figuring out the provenance, units and precision of measurement, identifiers, and the like.
- Compatible subject matter but incompatible representation: For example, one has numbers on a map with different maps for different points in time; another has time series of instrument data with geo-location for the instrument. It is only to be expected that the time interval between measurements is not the same. So there is need for a lot of one-off programming to align data.
Other problems have to do with sheer volume, i.e., transfer of data even in a local area network is too slow, let alone over a wide area network. Computation needs to go to the data, and databases need to support this.
-
Services, software stacks, protocols, standards, benchmarks
-
What combinations of components are needed to deal with these problems?
Recent times have seen a proliferation of special purpose databases. Since the data needs of the future are about combining data with maximum agility and minimum performance hit, there is need to gather the currently-separate functionality into an integrated system with sufficient flexibility. We see some of this in integration of map-reduce and scale-out databases. The former antagonists have become partners. Vertica, Greenplum, and OpenLink Virtuoso are example of DBMS featuring work in this direction.
Interoperability and at least de facto standards in ways of doing this will emerge.
-
What data exchange and processing mechanisms will be needed to work across platforms and programming languages?
HTTP, XML, and RDF are in fact very verbose, yet these are the formats and models that have uptake. Thus, these will continue to be used even though one might think binary formats to be more efficient.
There are of course science data set standards that are more compressed and these will continue, hopefully adding a practice of rich metadata in RDF.
For internals of systems, MPI and TCP/IP with proprietary optimized wire formats will continue. Inter-system communication will likely continue to be HTTP, XML, and RDF as appropriate.
-
What data environments are today so wastefully messy that they would benefit from the development of standards?
RDF and OWL are not messy but they could use some more performance; we are working on this. SPARQL is finally acquiring the capabilities of a serious query language, so things are slowly coming together.
Community process for developing application domain specific vocabularies works quite well, even though one could argue it is ad hoc and not up to what a modeling purist might wish.
Top-down imposition of standards has a mixed history, with long and expensive development and sometimes no or little uptake, consider some WS* standards for example.
-
What kind of performance is expected or required of these systems? Who will measure it reliably? How?
Relational databases have a history of substantial investment in optimization and some of them are very good for what they do, e.g., the newer generation of analytics databases.
The very large schema-last, no-SQL, sometimes eventually consistent key-value stores have a somewhat shorter history but do fill a real need.
These trends will merge: Extreme scale, schema-last, complex queries, even more complex inference, custom code for in-database machine learning and other bulk processing.
We find RDF augmented with some binary types at this crossroads. This point of the design space will have to provide performance roughly on the level of today's best relational solution for workloads that fit the relational model. The added cost of schema-last and inference must come down. We are working on this. Research work such as carried out with MonetDB gives clues as to how these aims can be reached.
The separation of query language and inference is artificial. After the concepts are mature, these functions will merge and execute close to the data; there are clear evolutionary pressures in this direction.
Benchmarks are key. Some gain can be had even from repurposing standard relational benchmarks like TPC-H. But the TPC-H rules do not allow official reporting of such.
Development of benchmarks for RDF, complex queries, and inference is needed. A bold challenge to the community, it should be rooted in real-life integration needs and involve high heterogeneity. A key-value store benchmark might also be conceived. A transaction benchmark like TPC-C might be the basis, maybe augmented with massive user-generated content like reviews and blogs.
If benchmarks exist and are not too easy nor inaccessibly difficult nor too expensive to run — think of the high end TPC-C results — then TPC-style rules and processes would be quite adequate. The threshold to publish should be lowered: Everybody runs the TPC workloads internally but few publish.
Some EC initiative for benchmarking could make sense, similar to the TREC initiative of the US government. Industry should be consulted for the specific content; possibly the answers to the present questionnaire can provide an approximate direction.
Benchmarks should be run by software vendors on their own systems, tuned by themselves. But there should be a process of disclosure and auditing; the TPC rules give an example. Compliance should not be too expensive or time consuming. Some community development for automating these things would be a worthwhile target for EC funding.
-
Usability and training
-
How difficult will it be for a developer of average competence to deploy components whose core is based on rather deep computer science? Do we all need to understand Monads and Continuations? What can be done to make it ever easier?
In the database world, huge advances in technology have taken place behind a relatively simple and stable interface: SQL. For the linked data web, the same will take place behind SPARQL.
Beyond these, for example, programming with MPI with good utilization of a cluster platform for an arbitrary algorithm, is quite difficult. The casual amateur is hereby warned.
There is no single solution. For automatic parallelization, since explicit, programmatic parallelization of things with MPI for example is very unscalable in terms of required skill, we should favor declarative and/or functional approaches.
Developing a debugger and explanation engine for rule-based and description-logics-based inference would be an idea.
For procedural workloads, things like Erlang may be good in cases and are not overly difficult in principle, especially if there are good debugging facilities.
For shipping functions in a cluster or cloud, the BOOM (Berkeley Orders Of Magnitude) approach or logic programming with explicit specification of compute location seem promising, surely more flexible than map-reduce. The question is whether a PHP developer can be made to do logic programming.
This bridge will be crossed only with actual need and even then reluctantly. We may look at the Web 2.0 practice of sharding MySQL, inconvenient as this may be, for an example. There is inertia and thus re-architecting is a constant process that is generally in reaction to facts, post hoc, often a point solution. One could argue that planning ahead would be smarter but by and large the world does not work so.
One part of the answer is an infinitely-scalable SQL database that expands and shrinks in the clouds, with the usual semantics, maybe optional eventual consistency and built-in map reduce. If such a thing is inexpensive enough and syntax-level-compatible with present installed base, many developers do not have to learn very much more.
This is maybe good for the bread-and-butter IT, but European competitiveness should not rest on this. Therefore we wish to go for bold new application types for which the client-server database application is not the model. Data-centric languages like BOOM, if they can be made very efficient and have good debugging support, are attractive there. These do require more intellectual investment but that is not a problem since the less-inquisitive part of the developer community is served by the first part of the answer.
-
How is a developer of average skills going to learn about these new advanced tools? How can we plan for excellent documentation and training, community mentoring, exchange of good practices, etc... across all EU countries?
For the most part, developers do not learn things for the sake of learning. When they have learned something and it is adequate, they stay with it for the most part and are even reluctant to engage in cross-camps interaction. The research world is often similarly insular. A new inflection in the application landscape is needed to drive learning. This inflection is provided by the ubiquity of mobile devices, sensor data, explicit semantics, NLP concept extraction, web of linked data, and such factors.
RDFa is a good example of a new technique piggybacking on something everybody uses, namely HTML. These new things should, within possibility, be deployed in the usual technology stack, LAMP or Java. Of course these do not have to be LAMP or Java or HTML or HTTP themselves but they must manifest through these.
A lot of the semantic web potential can be realized within the client-server database application model, thus no fundamental re-architecting, just some new data types and queries.
For data- or processing-intensive tasks, an on-demand hookup to cloud-based servers with Erlang and/or BOOM for programming model would be easy enough to learn and utilize.
The question is one of providing challenges. Addressing actual challenges with these techniques will lead to maturity, documentation, examples, and training. With virtual, Europe-wide distributed teams a reality in many places, Europe-wide dissemination is no longer insurmountable.
As the data overflow proceeds, its victims will multiply and create demand for solutions. The EC could here encourage research project use cases gaining an extended life past the end of research projects, possibly being maintained and multiplied and spun off.
If such things could be mutated into self-sustaining service businesses with pay-per-use revenue, say through a cloud SaaS business model, still primarily leveraging an open source technology stack, we could have self-propagating and self-supporting models for exploiting advanced IT. This would create interest, and interest would drive training and dissemination.
The problem is creating the pull.
-
Challenges
-
What should be, in this domain, the equivalent of the Netflix challenge, Ansari X Prize, Google Lunar X Prize, etc. ... ?
The EC itself no doubt suffers from data overflow in one function or another. Unless security/secrecy prohibits, simply publishing a large data set and a description of what operations should be done on it would be a start. The more real the data, the better — reality is consistently more complex and surprising than imagination. Since many interesting problems touch on fraud detection and law enforcement, there may be some security obstacles for using these application domains as subject matters of open challenges.
Once there is a good benchmark, as discussed above, there can be some prize money allocated for the winners, specially if the race is tight.
The Semantic Web Challenge and the Billion Triples Challenge exist and are useful as such, but do not seem to have any huge impact.
The incentives should be sufficient and part of the expenses arising from running for such challenges could be funded. Otherwise investing in existing business development will be more interesting to industry. Some industry participation seems necessary; we would wish academia and industry to work closer. Also, having industry supply the baseline guarantees that academia actually does further the state of the art. This is not always certain.
If challenges are based on actual problems, whether of the EC, its member governments, or private entities, and winning the challenge may lead to a contract for supplying an actual solution, these will naturally become more interesting for consortia involving integrators, specialist software vendors, and academia. Such a model would build actual capacity to deploy leading edge technologies in production, which is sorely needed.
-
What should one do to set up such a challenge, administer, and monitor it?
The EC should probably circulate a call for actual problem scenarios involving big data. If the matter of the overflow is as dire as represented, cases should be easy to find. A few should be selected and then anonymized if needed.
The party with the use case would benefit by having hopefully the best work on it. The contestants would benefit from having real world needs guide R&D. The EC would not have to do very much, except possibly use some money for funding the best proposals. The winner would possibly get a large account and related sales and service income. The contestants would have to be teams possibly involving many organizations; for example, development and first-line services and support could come from different companies along a systems integrator model such as is widely used in the US.
There may be a good benchmark at the time, possibly resulting from FP7 itself. In such a case, the EC could offer a prize for winners. Details would have to be worked out case by case. Such a challenge could be repeated a few times, as benchmark-driven progress in databases or TREC for example have taken some years to reach a point of slowdown in progress.
Administrating such an activity should not be prohibitive, as most of the expertise can be found with the stakeholders.
|
10/27/2009 13:29 GMT
|
Modified:
10/27/2009 14:57 GMT
|
European Commission and the Data Overflow
[
Virtuso Data Space Bot
]
The European Commission recently circulated a questionnaire to selected experts on what could be done for the future of big data.
Since the questionnaire is public, I am publishing my answers below.
-
Data and data types
-
What volumes of data are we dealing with today? What is the growth rate? Where can we expect to be in 2015?
Private data warehouses of corporations have more than doubled yearly for the past years; hundreds of TB is not exceptional. This will continue. The real shift is in structured data being published in increasing quantities with a minimum level of integrate-ability through use of RDF and linked data principles. There are rewards for use of standard vocabularies and identifiers through search engines recognizing such data. There is convergence around DBpedia identifiers for real-world entities, e.g., most things that would be in the news.
This also means that internal data processes and silos may be enriched with this content. There is consequent pressure for accommodating more diversity of data, with more flexible schema.
Ultimately, all content presently stored in RDBs and presented in public accessible dynamic web pages will end up on the web of linked data. Examples are product catalogs, price lists, event schedules and the like.
The volume of the well known linked data sets is around 10 billion statements. With the above mentioned trends, growth by two or three orders of magnitude by 2015 seems reasonable, This is so especially if explicit semantics are extracted from the document web and if there is some further progress in the precision/recall of such extraction.
Relevant sections of this mass of data are a potential addition to any present or future analytics application.
Since arbitrary analytics over the database which is the web cannot be economically provided by a centralized search engine, a cloud model may be used for on-demand selection of relevant data and mixing it with private data. This will drive database innovation for the next years even more than the continued classical warehouse growth.
Science data is another driver of the data overflow. For example, faster gene sequencing, more accurate measurements in high energy physics, better imaging, and remote sensing will produce large volumes of data. This data has highly regular structure but labeling this data with source and lineage calls for a flexible, schema-last, self-describing model, such as RDF and linked data. Data and metadata should travel together but may have different data models.
By and large, the metadata of science data will be another stream to the web of linked data, at least to the degree it is publicly accessible. Restricted circles can and likely will implement similar ideas.
-
What types of data can we deal with intelligently due to their inherent structure (geospatial, temporal, social or knowledge graphs, 3D, sensor streams...)?
All the above types should be supported inside one DBMS so as to allow efficient querying combining conditions on all these types of data, e.g., photos of sunsets taken last summer in Ibiza, with over 20 megapixels, by people I know.
Note that the test for being a sunset is an operation on the image blob that should be taken to the data; the images cannot be economically transferred.
Interleaving of all database functions and types becomes increasingly important.
-
Industries, communities
-
Who is producing these data and why? Could they do it better? How?
Right now, projects such as Bio2RDF, Neurocommons, and DBPedia produce this data. The processes are in place and are reasonable. Incremental improvement is to be expected. These processes, along with the linked data meme generally taking off, drive demand for better NLP (Natural Language Processing), e.g., entity and relationship extraction, especially extraction that can produce instance data in given ontologies (e.g., events) using common identifiers (e.g., DBPedia URIs).
Mapping of RDBs to RDF is possible, and a W3C working group is developing standards for this. The required baseline level has been reached; the rest is a matter of automating deployment. Within the enterprise, there are advantages to be gained for information integration; e.g., all entities in the CRM space can be integrated with all email and support tickets through giving everything a URI. Some of this information may even be published on an extranet for self-service and web-service interfaces. This has been done at small scales and the rest is a matter of spreading adoption and lowering the entry barrier. Incremental progress will take place, eventually resulting in qualitatively better integration along the value chain when adoption is sufficiently widespread.
-
Who is consuming these data and why? Could they do it better? How?
Consumers are various. The greatest need is for tools that summarize complex data and allow getting a bird's eye view of what data is in the first instance available. Consuming the data is hindered by the user not even necessarily knowing what data there is. This is somewhat new, as traditionally the business analyst did know the schema of the warehouse and was proficient with SQL report generators and statistics packages.
Where Web 2.0 made the citizen journalist, the web of linked data will make the citizen analyst. For this to happen, with benefits for individuals, enterprises, and governments alike, more work in user interfaces, knowledge discovery, and query composition will be useful. We may envision a "meshup economy" where data is plentiful, but the unit of value and exchange is the smart report that crystallizes actionable value from this ocean.
-
What industrial sectors in Europe could become more competitive if they became much better at managing data?
Any sector could benefit. Early adopters are seen in the biomedical field and to an extent in media.
-
Is the regulation landscape imposing constraints (privacy, compliance ...) that don't have today good tool support?
The regulation landscape drives database demand through data retention requirements and the like.
With data integration, especially with privacy-sensitive data (as in medicine), there are issues of whether one dares put otherwise-shareable information online. Regulation is needed to protect individuals, but integration should still be possible for science.
For this, we see a need for progress in applying policy-based approaches (e.g., row level security) to relatively schema-last data such as RDF. This is possible but needs some more work. Also, creating on-the-fly-anonymizing views on data might help.
More research is needed for reconciling the need for security with the advantages of broad-based ad hoc integration. Ideally, data should be intelligent, aware of its origins and classification and cautious of whom it interacts with, all of this supported under the covers so that the user could ask anything but the data might refuse to answer or might restrict answers according to the user's profile. This is a tall order and implementing something of the sort is an open question.
-
What are the main practical problem identified for individuals and organizations? Please give examples and tell us about the main obstacles and barriers.
We have come across the following:
- Knowing that the data exists in the first place.
- If the data is found, figuring out the provenance, units and precision of measurement, identifiers, and the like.
- Compatible subject matter but incompatible representation: For example, one has numbers on a map with different maps for different points in time; another has time series of instrument data with geo-location for the instrument. It is only to be expected that the time interval between measurements is not the same. So there is need for a lot of one-off programming to align data.
Other problems have to do with sheer volume, i.e., transfer of data even in a local area network is too slow, let alone over a wide area network. Computation needs to go to the data, and databases need to support this.
-
Services, software stacks, protocols, standards, benchmarks
-
What combinations of components are needed to deal with these problems?
Recent times have seen a proliferation of special purpose databases. Since the data needs of the future are about combining data with maximum agility and minimum performance hit, there is need to gather the currently-separate functionality into an integrated system with sufficient flexibility. We see some of this in integration of map-reduce and scale-out databases. The former antagonists have become partners. Vertica, Greenplum, and OpenLink Virtuoso are example of DBMS featuring work in this direction.
Interoperability and at least de facto standards in ways of doing this will emerge.
-
What data exchange and processing mechanisms will be needed to work across platforms and programming languages?
HTTP, XML, and RDF are in fact very verbose, yet these are the formats and models that have uptake. Thus, these will continue to be used even though one might think binary formats to be more efficient.
There are of course science data set standards that are more compressed and these will continue, hopefully adding a practice of rich metadata in RDF.
For internals of systems, MPI and TCP/IP with proprietary optimized wire formats will continue. Inter-system communication will likely continue to be HTTP, XML, and RDF as appropriate.
-
What data environments are today so wastefully messy that they would benefit from the development of standards?
RDF and OWL are not messy but they could use some more performance; we are working on this. SPARQL is finally acquiring the capabilities of a serious query language, so things are slowly coming together.
Community process for developing application domain specific vocabularies works quite well, even though one could argue it is ad hoc and not up to what a modeling purist might wish.
Top-down imposition of standards has a mixed history, with long and expensive development and sometimes no or little uptake, consider some WS* standards for example.
-
What kind of performance is expected or required of these systems? Who will measure it reliably? How?
Relational databases have a history of substantial investment in optimization and some of them are very good for what they do, e.g., the newer generation of analytics databases.
The very large schema-last, no-SQL, sometimes eventually consistent key-value stores have a somewhat shorter history but do fill a real need.
These trends will merge: Extreme scale, schema-last, complex queries, even more complex inference, custom code for in-database machine learning and other bulk processing.
We find RDF augmented with some binary types at this crossroads. This point of the design space will have to provide performance roughly on the level of today's best relational solution for workloads that fit the relational model. The added cost of schema-last and inference must come down. We are working on this. Research work such as carried out with MonetDB gives clues as to how these aims can be reached.
The separation of query language and inference is artificial. After the concepts are mature, these functions will merge and execute close to the data; there are clear evolutionary pressures in this direction.
Benchmarks are key. Some gain can be had even from repurposing standard relational benchmarks like TPC-H. But the TPC-H rules do not allow official reporting of such.
Development of benchmarks for RDF, complex queries, and inference is needed. A bold challenge to the community, it should be rooted in real-life integration needs and involve high heterogeneity. A key-value store benchmark might also be conceived. A transaction benchmark like TPC-C might be the basis, maybe augmented with massive user-generated content like reviews and blogs.
If benchmarks exist and are not too easy nor inaccessibly difficult nor too expensive to run — think of the high end TPC-C results — then TPC-style rules and processes would be quite adequate. The threshold to publish should be lowered: Everybody runs the TPC workloads internally but few publish.
Some EC initiative for benchmarking could make sense, similar to the TREC initiative of the US government. Industry should be consulted for the specific content; possibly the answers to the present questionnaire can provide an approximate direction.
Benchmarks should be run by software vendors on their own systems, tuned by themselves. But there should be a process of disclosure and auditing; the TPC rules give an example. Compliance should not be too expensive or time consuming. Some community development for automating these things would be a worthwhile target for EC funding.
-
Usability and training
-
How difficult will it be for a developer of average competence to deploy components whose core is based on rather deep computer science? Do we all need to understand Monads and Continuations? What can be done to make it ever easier?
In the database world, huge advances in technology have taken place behind a relatively simple and stable interface: SQL. For the linked data web, the same will take place behind SPARQL.
Beyond these, for example, programming with MPI with good utilization of a cluster platform for an arbitrary algorithm, is quite difficult. The casual amateur is hereby warned.
There is no single solution. For automatic parallelization, since explicit, programmatic parallelization of things with MPI for example is very unscalable in terms of required skill, we should favor declarative and/or functional approaches.
Developing a debugger and explanation engine for rule-based and description-logics-based inference would be an idea.
For procedural workloads, things like Erlang may be good in cases and are not overly difficult in principle, especially if there are good debugging facilities.
For shipping functions in a cluster or cloud, the BOOM (Berkeley Orders Of Magnitude) approach or logic programming with explicit specification of compute location seem promising, surely more flexible than map-reduce. The question is whether a PHP developer can be made to do logic programming.
This bridge will be crossed only with actual need and even then reluctantly. We may look at the Web 2.0 practice of sharding MySQL, inconvenient as this may be, for an example. There is inertia and thus re-architecting is a constant process that is generally in reaction to facts, post hoc, often a point solution. One could argue that planning ahead would be smarter but by and large the world does not work so.
One part of the answer is an infinitely-scalable SQL database that expands and shrinks in the clouds, with the usual semantics, maybe optional eventual consistency and built-in map reduce. If such a thing is inexpensive enough and syntax-level-compatible with present installed base, many developers do not have to learn very much more.
This is maybe good for the bread-and-butter IT, but European competitiveness should not rest on this. Therefore we wish to go for bold new application types for which the client-server database application is not the model. Data-centric languages like BOOM, if they can be made very efficient and have good debugging support, are attractive there. These do require more intellectual investment but that is not a problem since the less-inquisitive part of the developer community is served by the first part of the answer.
-
How is a developer of average skills going to learn about these new advanced tools? How can we plan for excellent documentation and training, community mentoring, exchange of good practices, etc... across all EU countries?
For the most part, developers do not learn things for the sake of learning. When they have learned something and it is adequate, they stay with it for the most part and are even reluctant to engage in cross-camps interaction. The research world is often similarly insular. A new inflection in the application landscape is needed to drive learning. This inflection is provided by the ubiquity of mobile devices, sensor data, explicit semantics, NLP concept extraction, web of linked data, and such factors.
RDFa is a good example of a new technique piggybacking on something everybody uses, namely HTML. These new things should, within possibility, be deployed in the usual technology stack, LAMP or Java. Of course these do not have to be LAMP or Java or HTML or HTTP themselves but they must manifest through these.
A lot of the semantic web potential can be realized within the client-server database application model, thus no fundamental re-architecting, just some new data types and queries.
For data- or processing-intensive tasks, an on-demand hookup to cloud-based servers with Erlang and/or BOOM for programming model would be easy enough to learn and utilize.
The question is one of providing challenges. Addressing actual challenges with these techniques will lead to maturity, documentation, examples, and training. With virtual, Europe-wide distributed teams a reality in many places, Europe-wide dissemination is no longer insurmountable.
As the data overflow proceeds, its victims will multiply and create demand for solutions. The EC could here encourage research project use cases gaining an extended life past the end of research projects, possibly being maintained and multiplied and spun off.
If such things could be mutated into self-sustaining service businesses with pay-per-use revenue, say through a cloud SaaS business model, still primarily leveraging an open source technology stack, we could have self-propagating and self-supporting models for exploiting advanced IT. This would create interest, and interest would drive training and dissemination.
The problem is creating the pull.
-
Challenges
-
What should be, in this domain, the equivalent of the Netflix challenge, Ansari X Prize, Google Lunar X Prize, etc. ... ?
The EC itself no doubt suffers from data overflow in one function or another. Unless security/secrecy prohibits, simply publishing a large data set and a description of what operations should be done on it would be a start. The more real the data, the better — reality is consistently more complex and surprising than imagination. Since many interesting problems touch on fraud detection and law enforcement, there may be some security obstacles for using these application domains as subject matters of open challenges.
Once there is a good benchmark, as discussed above, there can be some prize money allocated for the winners, specially if the race is tight.
The Semantic Web Challenge and the Billion Triples Challenge exist and are useful as such, but do not seem to have any huge impact.
The incentives should be sufficient and part of the expenses arising from running for such challenges could be funded. Otherwise investing in existing business development will be more interesting to industry. Some industry participation seems necessary; we would wish academia and industry to work closer. Also, having industry supply the baseline guarantees that academia actually does further the state of the art. This is not always certain.
If challenges are based on actual problems, whether of the EC, its member governments, or private entities, and winning the challenge may lead to a contract for supplying an actual solution, these will naturally become more interesting for consortia involving integrators, specialist software vendors, and academia. Such a model would build actual capacity to deploy leading edge technologies in production, which is sorely needed.
-
What should one do to set up such a challenge, administer, and monitor it?
The EC should probably circulate a call for actual problem scenarios involving big data. If the matter of the overflow is as dire as represented, cases should be easy to find. A few should be selected and then anonymized if needed.
The party with the use case would benefit by having hopefully the best work on it. The contestants would benefit from having real world needs guide R&D. The EC would not have to do very much, except possibly use some money for funding the best proposals. The winner would possibly get a large account and related sales and service income. The contestants would have to be teams possibly involving many organizations; for example, development and first-line services and support could come from different companies along a systems integrator model such as is widely used in the US.
There may be a good benchmark at the time, possibly resulting from FP7 itself. In such a case, the EC could offer a prize for winners. Details would have to be worked out case by case. Such a challenge could be repeated a few times, as benchmark-driven progress in databases or TREC for example have taken some years to reach a point of slowdown in progress.
Administrating such an activity should not be prohibitive, as most of the expertise can be found with the stakeholders.
|
10/27/2009 13:29 GMT
|
Modified:
10/27/2009 14:57 GMT
|
VLDB 2009 TPC Workshop (3 of 5)
[
Orri Erling
]
Michael Stonebraker gave the keynote at the TPC workshop. His message was that the TPC, at the venerable age of 21, was already a decade late in reinventing itself. From the height of relevance at the time of the debit/credit benchmark twenty years back, it was slipping into the sunset of irrelevance unless it paid attention.
Now we are great fans of the TPC and while we have not published results by the TPC book, we have extensively used TPC material for guiding optimization, as has pretty much everybody else.
It is true that the rules encourage unrealistic configurations. The emphasis on random access from disk that is built into the rules leads to disk configurations that are very improbable in practice, such as 1PB of disks for 3TB of data, just so there are enough disk arms in parallel. Stonebraker also pointed out that replication and failover were ubiquitous in real life and that roll forward from logs was unrealistic as a recovery model since it took so long. Benchmarks should therefore include replication.
Further, Stonebraker challenged the TPC to go for the new frontier, which he described as the huge data sets in science and on big web sites. Scientists, the ones who would save our planet from the diverse ills confronting it, do not like relational databases. They avoid them when can. They want arrays for physics, and graphs for biology and chemistry. MapReduce is eating database's lunch; what will you do about this?
I later suggested incorporating an RDF metadata benchmark into the TPC suite. We'll see about this; we'll first have to come up with a suitable one. There is a great deal of pressure for making good RDF benchmarks but this is not yet in the center of the mainstream that TPC tends to cover.
TPC's own talk was about the life cycle of benchmarks. A benchmark begins a bit ahead of the mainstream, with a problem that is difficult but not so difficult as to be uncommon. When the solution to this problem becomes commonplace, the benchmark's relevance gradually drops.
There was a talk on robustness of query plans which was well to the point. Indeed, there are performance cliffs at certain points; for example, when passing from memory-only to disk-pageable data structures, or when switching from indexed access to table scans, or from loop to hash joins. Quite so. The analysis I really would have liked to see would have been one of what happens when passing from single server to a cluster, and from local joins to cross-partition ones. Also contrasting of cache fusion and partitioning. We have our own data and experience but we find we don't have time to measure all the other systems.
Anyway it is good to raise the question of smooth and predictable performance.
|
09/01/2009 11:51 GMT
|
Modified:
09/01/2009 17:32 GMT
|
VLDB 2009 TPC Workshop (3 of 5)
[
Virtuso Data Space Bot
]
Michael Stonebraker gave the keynote at the TPC workshop. His message was that the TPC, at the venerable age of 21, was already a decade late in reinventing itself. From the height of relevance at the time of the debit/credit benchmark twenty years back, it was slipping into the sunset of irrelevance unless it paid attention.
Now we are great fans of the TPC and while we have not published results by the TPC book, we have extensively used TPC material for guiding optimization, as has pretty much everybody else.
It is true that the rules encourage unrealistic configurations. The emphasis on random access from disk that is built into the rules leads to disk configurations that are very improbable in practice, such as 1PB of disks for 3TB of data, just so there are enough disk arms in parallel. Stonebraker also pointed out that replication and failover were ubiquitous in real life and that roll forward from logs was unrealistic as a recovery model since it took so long. Benchmarks should therefore include replication.
Further, Stonebraker challenged the TPC to go for the new frontier, which he described as the huge data sets in science and on big web sites. Scientists, the ones who would save our planet from the diverse ills confronting it, do not like relational databases. They avoid them when can. They want arrays for physics, and graphs for biology and chemistry. MapReduce is eating database's lunch; what will you do about this?
I later suggested incorporating an RDF metadata benchmark into the TPC suite. We'll see about this; we'll first have to come up with a suitable one. There is a great deal of pressure for making good RDF benchmarks but this is not yet in the center of the mainstream that TPC tends to cover.
TPC's own talk was about the life cycle of benchmarks. A benchmark begins a bit ahead of the mainstream, with a problem that is difficult but not so difficult as to be uncommon. When the solution to this problem becomes commonplace, the benchmark's relevance gradually drops.
There was a talk on robustness of query plans which was well to the point. Indeed, there are performance cliffs at certain points; for example, when passing from memory-only to disk-pageable data structures, or when switching from indexed access to table scans, or from loop to hash joins. Quite so. The analysis I really would have liked to see would have been one of what happens when passing from single server to a cluster, and from local joins to cross-partition ones. Also contrasting of cache fusion and partitioning. We have our own data and experience but we find we don't have time to measure all the other systems.
Anyway it is good to raise the question of smooth and predictable performance.
|
09/01/2009 11:51 GMT
|
Modified:
09/01/2009 17:32 GMT
|
Updated hardware improves LUBM 8000 load rate in Virtuoso 6
[
Orri Erling
]
We repeated the earlier LUBM 8000 experiment on a newer machine, with 2 x Xeon 5520 and 72G 1333MHz memory, and once again with the 2 machines as a networked cluster. Otherwise the settings were the same.
The load rate is now 160,739 triples-per-second.
|
|
Virtuoso 6 (previous run) |
|
Virtuoso 6 (new run) |
|
Virtuoso 6 (newest run) |
| blades |
|
1 |
|
1 |
|
2 |
| processors |
|
2 x Xeon 5410 |
|
2 x Xeon 5520 |
|
2 x Xeon 5520 + 2 x Xeon 5410 with 1x1GigE interconnect |
| memory |
|
16G 667 MHz |
|
72G 1333 MHz |
|
72G 1333 MHz + 16G 667 MHz respectively |
reported load rate triples-per-second |
|
110,532 |
|
160,739 |
|
214,188 |
Again, if others talk about loading LUBM, so must we. Otherwise, this metric is rather uninteresting.
|
08/14/2009 19:01 GMT
|
Modified:
08/15/2009 15:27 GMT
|
Updated hardware improves LUBM 8000 load rate in Virtuoso 6
[
Virtuso Data Space Bot
]
We repeated the earlier LUBM 8000 experiment on a newer machine, with 2 x Xeon 5520 and 72G 1333MHz memory, and once again with the 2 machines as a networked cluster. Otherwise the settings were the same.
The load rate is now 160,739 triples-per-second.
|
|
Virtuoso 6 (previous run) |
|
Virtuoso 6 (new run) |
|
Virtuoso 6 (newest run) |
| blades |
|
1 |
|
1 |
|
2 |
| processors |
|
2 x Xeon 5410 |
|
2 x Xeon 5520 |
|
2 x Xeon 5520 + 2 x Xeon 5410 with 1x1GigE interconnect |
| memory |
|
16G 667 MHz |
|
72G 1333 MHz |
|
72G 1333 MHz + 16G 667 MHz respectively |
reported load rate triples-per-second |
|
110,532 |
|
160,739 |
|
214,188 |
Again, if others talk about loading LUBM, so must we. Otherwise, this metric is rather uninteresting.
|
08/14/2009 19:01 GMT
|
Modified:
08/15/2009 15:27 GMT
|
Single Virtuoso host loads 110,500 triples-per-second on LUBM 8000
[
Orri Erling
]
LUBM load speed still seems to be a metric that is quoted in comparisons of RDF stores. Consequently, we too measured the load time of LUBM 8000, 1,068-million triples, on the newest Virtuoso.
The real time for the load was 161m 3s. The rate was 110,532 triples-per-second. The hardware was one machine with 2 x Xeon 5410 (quad core, 2.33 GHz) and 16G 667 MHz RAM. The software was Virtuoso 6 Cluster, configured into 8 partitions (processes) — one partition per CPU core. Each partition had its database striped over 6 disks total; the 6 disks on the system were shared between the 8 database processes.
The load was done on 8 streams, one per server process. At the beginning of the load, the CPU usage was 740% with no disk; at the end, it was around 700% with 25% disk wait. 100% counts here for one CPU core or one disk being constantly busy.
The RDF store was configured with the default two indices over quads, these being GSPO and OGPS. Text indexing of literals was not enabled. No materialization of entailed triples was made.
We think that LUBM loading is not a realistic benchmark for the world but since other people publish such numbers, so do we.
|
06/29/2009 12:12 GMT
|
Modified:
08/15/2009 16:06 GMT
|
Single Virtuoso host loads 110,500 triples-per-second on LUBM 8000
[
Virtuso Data Space Bot
]
LUBM load speed still seems to be a metric that is quoted in comparisons of RDF stores. Consequently, we too measured the load time of LUBM 8000, 1,068-million triples, on the newest Virtuoso.
The real time for the load was 161m 3s. The rate was 110,532 triples-per-second. The hardware was one machine with 2 x Xeon 5410 (quad core, 2.33 GHz) and 16G 667 MHz RAM. The software was Virtuoso 6 Cluster, configured into 8 partitions (processes) — one partition per CPU core. Each partition had its database striped over 6 disks total; the 6 disks on the system were shared between the 8 database processes.
The load was done on 8 streams, one per server process. At the beginning of the load, the CPU usage was 740% with no disk; at the end, it was around 700% with 25% disk wait. 100% counts here for one CPU core or one disk being constantly busy.
The RDF store was configured with the default two indices over quads, these being GSPO and OGPS. Text indexing of literals was not enabled. No materialization of entailed triples was made.
We think that LUBM loading is not a realistic benchmark for the world but since other people publish such numbers, so do we.
|
06/29/2009 12:12 GMT
|
Modified:
08/15/2009 16:06 GMT
|
Virtuoso RDF: A Getting Started Guide for the Developer
[
Orri Erling
]
It is a long standing promise of mine to dispel the false impression that using Virtuoso to work with RDF is complicated.
The purpose of this presentation is to show a programmer how to put RDF into Virtuoso and how to query it. This is done programmatically, with no confusing user interfaces.
You should have a Virtuoso Open Source tree built and installed. We will look at the LUBM benchmark demo that comes with the package. All you need is a Unix shell. Running the shell under emacs (m-x shell) is the best. But the open source isql utility should have command line editing also. The emacs shell is however convenient for cutting and pasting things between shell and files.
To get started, cd into binsrc/tests/lubm.
To verify that this works, you can do
./test_server.sh virtuoso-t
This will test the server with the LUBM queries. This should report 45 tests passed. After this we will do the tests step-by-step.
Loading the Data
The file lubm-load.sql contains the commands for loading the LUBM single university qualification database.
The data files themselves are in lubm_8000, 15 files in RDFXML.
There is also a little ontology called inf.nt. This declares the subclass and subproperty relations used in the benchmark.
So now let's go through this procedure.
Start the server:
$ virtuoso-t -f &
This starts the server in foreground mode, and puts it in the background of the shell.
Now we connect to it with the isql utility.
$ isql 1111 dba dba
This gives a SQL> prompt. The default username and password are both dba.
When a command is SQL, it is entered directly. If it is SPARQL, it is prefixed with the keyword sparql. This is how all the SQL clients work. Any SQL client, such as any ODBC or JDBC application, can use SPARQL if the SQL string starts with this keyword.
The lubm-load.sql file is quite self-explanatory. It begins with defining an SQL procedure that calls the RDF/XML load function, DB..RDF_LOAD_RDFXML, for each file in a directory.
Next it calls this function for the lubm_8000 directory under the server's working directory.
sparql
CLEAR GRAPH <lubm>;
sparql
CLEAR GRAPH <inf>;
load_lubm ( server_root() || '/lubm_8000/' );
Then it verifies that the right number of triples is found in the <lubm> graph.
sparql
SELECT COUNT(*)
FROM <lubm>
WHERE { ?x ?y ?z } ;
The echo commands below this are interpreted by the isql utility, and produce output to show whether the test was passed. They can be ignored for now.
Then it adds some implied subOrganizationOf triples. This is part of setting up the LUBM test database.
sparql
PREFIX ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#>
INSERT
INTO GRAPH <lubm>
{ ?x ub:subOrganizationOf ?z }
FROM <lubm>
WHERE { ?x ub:subOrganizationOf ?y .
?y ub:subOrganizationOf ?z .
};
Then it loads the ontology file, inf.nt, using the Turtle load function, DB.DBA.TTLP. The arguments of the function are the text to load, the default namespace prefix, and the URI of the target graph.
DB.DBA.TTLP ( file_to_string ( 'inf.nt' ),
'http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl',
'inf'
) ;
sparql
SELECT COUNT(*)
FROM <inf>
WHERE { ?x ?y ?z } ;
Then we declare that the triples in the <inf> graph can be used for inference at run time. To enable this, a SPARQL query will declare that it uses the 'inft' rule set. Otherwise this has no effect.
rdfs_rule_set ('inft', 'inf');
This is just a log checkpoint to finalize the work and truncate the transaction log. The server would also eventually do this in its own time.
checkpoint;
Now we are ready for querying.
Querying the Data
The queries are given in 3 different versions: The first file, lubm.sql, has the queries with most inference open coded as UNIONs. The second file, lubm-inf.sql, has the inference performed at run time using the ontology information in the <inf> graph we just loaded. The last, lubm-phys.sql, relies on having the entailed triples physically present in the <lubm> graph. These entailed triples are inserted by the SPARUL commands in the lubm-cp.sql file.
If you wish to run all the commands in a SQL file, you can type load <filename>; (e.g., load lubm-cp.sql;) at the SQL> prompt. If you wish to try individual statements, you can paste them to the command line.
For example:
SQL> sparql
PREFIX ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#>
SELECT *
FROM <lubm>
WHERE { ?x a ub:Publication .
?x ub:publicationAuthor <http://www.Department0.University0.edu/AssistantProfessor0>
};
VARCHAR
_______________________________________________________________________
http://www.Department0.University0.edu/AssistantProfessor0/Publication0
http://www.Department0.University0.edu/AssistantProfessor0/Publication1
http://www.Department0.University0.edu/AssistantProfessor0/Publication2
http://www.Department0.University0.edu/AssistantProfessor0/Publication3
http://www.Department0.University0.edu/AssistantProfessor0/Publication4
http://www.Department0.University0.edu/AssistantProfessor0/Publication5
6 Rows. -- 4 msec.
To stop the server, simply type shutdown; at the SQL> prompt.
If you wish to use a SPARQL protocol end point, just enable the HTTP listener. This is done by adding a stanza like —
[HTTPServer]
ServerPort = 8421
ServerRoot = .
ServerThreads = 2
— to the end of the virtuoso.ini file in the lubm directory. Then shutdown and restart (type shutdown; at the SQL> prompt and then virtuoso-t -f & at the shell prompt).
Now you can connect to the end point with a web browser. The URL is http://localhost:8421/sparql. Without parameters, this will show a human readable form. With parameters, this will execute SPARQL.
We have shown how to load and query RDF with Virtuoso using the most basic SQL tools. Next you can access RDF from, for example, PHP, using the PHP ODBC interface.
To see how to use Jena or Sesame with Virtuoso, look at Native RDF Storage Providers. To see how RDF data types are supported, see Extension datatype for RDF
To work with large volumes of data, you must add memory to the configuration file and use the row-autocommit mode, i.e., do log_enable (2); before the load command. Otherwise Virtuoso will do the entire load as a single transaction, and will run out of rollback space. See documentation for more.
|
12/17/2008 12:31 GMT
|
Modified:
12/17/2008 12:41 GMT
|
Virtuoso RDF: A Getting Started Guide for the Developer
[
Virtuso Data Space Bot
]
It is a long standing promise of mine to dispel the false impression that using Virtuoso to work with RDF is complicated.
The purpose of this presentation is to show a programmer how to put RDF into Virtuoso and how to query it. This is done programmatically, with no confusing user interfaces.
You should have a Virtuoso Open Source tree built and installed. We will look at the LUBM benchmark demo that comes with the package. All you need is a Unix shell. Running the shell under emacs (m-x shell) is the best. But the open source isql utility should have command line editing also. The emacs shell is however convenient for cutting and pasting things between shell and files.
To get started, cd into binsrc/tests/lubm.
To verify that this works, you can do
./test_server.sh virtuoso-t
This will test the server with the LUBM queries. This should report 45 tests passed. After this we will do the tests step-by-step.
Loading the Data
The file lubm-load.sql contains the commands for loading the LUBM single university qualification database.
The data files themselves are in lubm_8000, 15 files in RDFXML.
There is also a little ontology called inf.nt. This declares the subclass and subproperty relations used in the benchmark.
So now let's go through this procedure.
Start the server:
$ virtuoso-t -f &
This starts the server in foreground mode, and puts it in the background of the shell.
Now we connect to it with the isql utility.
$ isql 1111 dba dba
This gives a SQL> prompt. The default username and password are both dba.
When a command is SQL, it is entered directly. If it is SPARQL, it is prefixed with the keyword sparql. This is how all the SQL clients work. Any SQL client, such as any ODBC or JDBC application, can use SPARQL if the SQL string starts with this keyword.
The lubm-load.sql file is quite self-explanatory. It begins with defining an SQL procedure that calls the RDF/XML load function, DB..RDF_LOAD_RDFXML, for each file in a directory.
Next it calls this function for the lubm_8000 directory under the server's working directory.
sparql
CLEAR GRAPH <lubm>;
sparql
CLEAR GRAPH <inf>;
load_lubm ( server_root() || '/lubm_8000/' );
Then it verifies that the right number of triples is found in the <lubm> graph.
sparql
SELECT COUNT(*)
FROM <lubm>
WHERE { ?x ?y ?z } ;
The echo commands below this are interpreted by the isql utility, and produce output to show whether the test was passed. They can be ignored for now.
Then it adds some implied subOrganizationOf triples. This is part of setting up the LUBM test database.
sparql
PREFIX ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#>
INSERT
INTO GRAPH <lubm>
{ ?x ub:subOrganizationOf ?z }
FROM <lubm>
WHERE { ?x ub:subOrganizationOf ?y .
?y ub:subOrganizationOf ?z .
};
Then it loads the ontology file, inf.nt, using the Turtle load function, DB.DBA.TTLP. The arguments of the function are the text to load, the default namespace prefix, and the URI of the target graph.
DB.DBA.TTLP ( file_to_string ( 'inf.nt' ),
'http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl',
'inf'
) ;
sparql
SELECT COUNT(*)
FROM <inf>
WHERE { ?x ?y ?z } ;
Then we declare that the triples in the <inf> graph can be used for inference at run time. To enable this, a SPARQL query will declare that it uses the 'inft' rule set. Otherwise this has no effect.
rdfs_rule_set ('inft', 'inf');
This is just a log checkpoint to finalize the work and truncate the transaction log. The server would also eventually do this in its own time.
checkpoint;
Now we are ready for querying.
Querying the Data
The queries are given in 3 different versions: The first file, lubm.sql, has the queries with most inference open coded as UNIONs. The second file, lubm-inf.sql, has the inference performed at run time using the ontology information in the <inf> graph we just loaded. The last, lubm-phys.sql, relies on having the entailed triples physically present in the <lubm> graph. These entailed triples are inserted by the SPARUL commands in the lubm-cp.sql file.
If you wish to run all the commands in a SQL file, you can type load <filename>; (e.g., load lubm-cp.sql;) at the SQL> prompt. If you wish to try individual statements, you can paste them to the command line.
For example:
SQL> sparql
PREFIX ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#>
SELECT *
FROM <lubm>
WHERE { ?x a ub:Publication .
?x ub:publicationAuthor <http://www.Department0.University0.edu/AssistantProfessor0>
};
VARCHAR
_______________________________________________________________________
http://www.Department0.University0.edu/AssistantProfessor0/Publication0
http://www.Department0.University0.edu/AssistantProfessor0/Publication1
http://www.Department0.University0.edu/AssistantProfessor0/Publication2
http://www.Department0.University0.edu/AssistantProfessor0/Publication3
http://www.Department0.University0.edu/AssistantProfessor0/Publication4
http://www.Department0.University0.edu/AssistantProfessor0/Publication5
6 Rows. -- 4 msec.
To stop the server, simply type shutdown; at the SQL> prompt.
If you wish to use a SPARQL protocol end point, just enable the HTTP listener. This is done by adding a stanza like —
[HTTPServer]
ServerPort = 8421
ServerRoot = .
ServerThreads = 2
— to the end of the virtuoso.ini file in the lubm directory. Then shutdown and restart (type shutdown; at the SQL> prompt and then virtuoso-t -f & at the shell prompt).
Now you can connect to the end point with a web browser. The URL is http://localhost:8421/sparql. Without parameters, this will show a human readable form. With parameters, this will execute SPARQL.
We have shown how to load and query RDF with Virtuoso using the most basic SQL tools. Next you can access RDF from, for example, PHP, using the PHP ODBC interface.
To see how to use Jena or Sesame with Virtuoso, look at Native RDF Storage Providers. To see how RDF data types are supported, see Extension datatype for RDF
To work with large volumes of data, you must add memory to the configuration file and use the row-autocommit mode, i.e., do log_enable (2); before the load command. Otherwise Virtuoso will do the entire load as a single transaction, and will run out of rollback space. See documentation for more.
|
12/17/2008 12:31 GMT
|
Modified:
12/17/2008 12:41 GMT
|
|
|