Details

OpenLink Software
Burlington, United States

Subscribe

Post Categories

Recent Articles

Community Member Blogs

Display Settings

articles per page.
order.

Translate

Rethink Big and Europe?s Position in Big Data [ Virtuoso Data Space Bot ]

I will here take a break from core database and talk a bit about EU policies for research funding.

I had lunch with Stefan Manegold of CWI last week, where we talked about where European research should go. Stefan is involved in RETHINK big, a European research project for compiling policy advice regarding big data for EC funding agencies. As part of this, he is interviewing various stakeholders such as end user organizations and developers of technology.

RETHINK big wants to come up with a research agenda primarily for hardware, anything from faster networks to greener data centers. CWI represents software expertise in the consortium.

So, we went through a regular questionnaire about how we see the landscape. I will summarize this below, as this is anyway informative.

Core competence

My own core competence is in core database functionality, specifically in high performance query processing, scale-out, and managing schema-less data. Most of the Virtuoso installed base is in the RDF space, but most potential applications are in fact outside of this niche.

User challenges

The life sciences vertical is the one in which I have the most application insight, from going to Open PHACTS meetings and holding extensive conversations with domain specialists. We have users in many other verticals, from manufacturing to financial services, but there I do not have as much exposure to the actual applications.

Having said this, the challenges throughout tend to be in diversity of data. Every researcher has their MySQL database or spreadsheet, and there may not even be a top level catalogue of everything. Data formats are diverse. Some people use linked data (most commonly RDF) as a top level metadata format. The application data, such as gene sequences or microarray assays, reside in their native file formats and there is little point in RDF-izing these.

There are also public data resources that are published in RDF serializations as vendor-neutral, self-describing format. Having everything as triples, without a priori schema, makes things easier to integrate and in some cases easier to describe and query.

So, the challenge is in the labor intensive nature of data integration. Data comes with different levels of quantity and quality, from hand-curated to NLP extractions. Querying in the single- or double-digit terabyte range with RDF is quite possible, as we have shown many times on this blog, but most use cases do not even go that far. Anyway, what we see on the field is primarily a data diversity game. The scenario is data integration; the technology we provide is database. The data transformation proper, data cleansing, units of measure, entity de-duplication, and such core data-integration functions are performed using diverse, user-specific means.

Jerven Bolleman of the Swiss Institute of Bioinformatics is a user of ours with whom we have long standing discussions on the virtues of federated data and querying. I advised Stefan to go talk to him; he has fresh views about the volume challenges with unexpected usage patterns. Designing for performance is tough if the usage pattern is out of the blue, like correlating air humidity on the day of measurement with the presence of some genomic patterns. Building a warehouse just for that might not be the preferred choice, so the problem field is not exhausted. Generally, I’d go for warehousing though.

What technology would you like to have? Network or power efficiency?

OK. Even a fast network is a network. A set of processes on a single shared-memory box is also a kind of network. InfiniBand is maybe half the throughput and 3x the latency of single threaded interprocess communication within one box. The operative word is latency. Making large systems always involves a network or something very much like one in large scale-up scenarios.

On the software side, next to nobody understands latency and contention; yet these are the one core factor in any pursuit of scalability. Because of this situation, paradigms like MapReduce and bulk synchronous parallel (BSP) processing have become popular because these take the communication out of the program flow, so the programmer cannot muck this up, as otherwise would happen with the inevitability of destiny. Of course, our beloved SQL or declarative query in general does give scalability in many tasks without programmer participation. Datalog has also been used as a means of shipping computation around, as in the the work of Hellerstein.

There are no easy solutions. We have built scale-out conscious, vectorized extensions to SQL procedures where one can express complex parallel, distributed flows, but people do not use or understand these. These are very useful, even indispensable, but only on the inside, not as a programmer-facing construct. MapReduce and BSP are the limit of what a development culture will absorb. MapReduce and BSP do not hide the fact of distributed processing. What about things that do? Parallel, partitioned extensions to Fortran arrays? Functional languages? I think that all the obvious aids to parallel/distributed programming have been conceived of. No silver bullet; just hard work. And above all the discernment of what paradigm fits what problem. Since these are always changing, there is no finite set of rules, and no substitute for understanding and insight, and the latter are vanishingly scarce. "Paradigmatism," i.e., the belief that one particular programming model is a panacea outside of its original niche, is a common source of complexity and inefficiency. This is a common form of enthusiastic naïveté.

If you look at power efficiency, the clusters that are the easiest to program consist of relatively few high power machines and a fast network. A typical node size is 16+ cores and 256G or more RAM. Amazon has these in entirely workable configurations, as documented earlier on this blog. The leading edge in power efficiency is in larger number of smaller units, which makes life again harder. This exacerbates latency and forces one to partition the data more often, whereas one can play with replication of key parts of data more freely if the node size is larger.

One very specific item where research might help without having to rebuild the hardware stack would be better, lower-latency exposure of networks to software. Lightweight threads and user-space access, bypassing slow protocol stacks, etc. MPI has some of this, but maybe more could be done.

So, I will take a cluster of such 16-core, 256GB machines on a faster network, over a cluster of 1024 x 4G mobile phones connected via USB. Very selfish and unecological, but one has to stay alive and life is tough enough as is.

Are there pressures to adapt business models based on big data?

The transition from capex to opex may be approaching maturity, as there have been workable cloud configurations for the past couple of years. The EC2 from way back, with at best a 4 core 16G VM and a horrible network for $2/hr, is long gone. It remains the case that 4 months of 24x7 rent in the cloud equals the purchase price of physical hardware. So, for this to be economical long-term at scale, the average utilization should be about 10% of the peak, and peaks should not be on for more than 10% of the time.

So, database software should be rented by the hour. A 100-150% markup for the $2.80 a large EC2 instance costs would be reasonable. Consider that 70% of the cost in TPC benchmarks is database software.

There will be different pricing models combining different up-front and per-usage costs, just as there are for clouds now. If the platform business goes that way and the market accepts this, then systems software will follow. Price/performance quotes should probably be expressed as speed/price/hour instead of speed/price.

The above is rather uncontroversial but there is no harm restating these facts. Reinforce often.

Well, the question is raised, what should Europe do that would have tangible impact in the next 5 years?

This is a harder question. There is some European business in wide area and mobile infrastructures. Competing against Huawei will keep them busy. Intel and Mellanox will continue making faster networks regardless of European policies. Intel will continue building denser compute nodes, e.g., integrated Knight’s Corner with dual IB network and 16G fast RAM on chip. Clouds will continue making these available on demand once the technology is in mass production.

What’s the next big innovation? Neuromorphic computing? Quantum computing? Maybe. For now, I’d just do more engineering along the core competence discussed above, with emphasis on good marketing and scalable execution. By this I mean trained people who know something about deployment. There is a huge training gap. In the would-be "Age of Data," knowledge of how things actually work and scale is near-absent. I have offered to do some courses on this to partners and public alike, but I need somebody to drive this show; I have other things to do.

I have been to many, many project review meetings, mostly as a project partner but also as reviewer. For the past year, the EC has used an innovation questionnaire at the end of the meetings. It is quite vague, and I don’t think it delivers much actionable intelligence.

What would deliver this would be a venture capital type activity, with well-developed networks and active participation in developing a business. The EC is not now set up to perform this role, though. But the EC is a fairly large and wealthy entity, so it could invest some money via this type of channel. Also there should be higher individual incentives and rewards for speed and excellence. Getting the next Horizon 2020 research grant may be good, but better exists. The grants are competitive enough and the calls are not bad; they follow the times.

In the projects I have seen, productization does get some attention, e.g., the LOD2 stack, but it is not something that is really ongoing or with dedicated commercial backing. It may also be that there is no market to justify such dedicated backing. Much of the RDF work has been "me, too" — let’s do what the real database and data integration people do, but let’s just do this with triples. Innovation? Well, I took the best of the real DB world and adapted this to RDF, which did produce a competent piece of work with broad applicability, extending outside RDF. Is there better than this? Well, some of the data integration work (e.g., LIMES) is not bad, and it might be picked up by some of the players that do this sort of thing in the broader world, e.g., Informatica, the DI suites of big DB vendors, Tamr, etc. I would not know if this in fact adds value to the non-RDF equivalents; I do not know the field well enough, but there could be a possibility.

The recent emphasis for benchmarking, spearheaded by Stefano Bertolo is good, as exemplified by the LDBC FP7. There should probably be one or two projects of this sort going at all times. These make challenges known and are an effective means of guiding research, with a large multiplier: Once a benchmark gets adopted, infinitely more work goes into solving the problem than in stating it in the first place.

The aims and calls are good. The execution by projects is variable. For 1% of excellence, there apparently must be 99% of so-and-so, but this is just a fact of life and not specific to this context. The projects are rather diffuse. There is not a single outcome that gets all the effort. In this, the level of engagement of participants is less and focus is much more scattered than in startups. A really hungry, go-getter mood is mostly absent. I am a believer in core competence. Well, most people will agree that core competence is nice. But the projects I have seen do not drive for it hard enough.

It is hard to say exactly what kinds of incentives could be offered to encourage truly exceptional work. The American startup scene does offer high rewards and something of this could be transplanted into the EC project world. I would not know exactly what form this could take, though.

# PermaLink Comments [0]
06/29/2015 15:36 GMT
Rethink Big and Europe?s Position in Big Data [ Orri Erling ]

I will here take a break from core database and talk a bit about EU policies for research funding.

I had lunch with Stefan Manegold of CWI last week, where we talked about where European research should go. Stefan is involved in RETHINK big, a European research project for compiling policy advice regarding big data for EC funding agencies. As part of this, he is interviewing various stakeholders such as end user organizations and developers of technology.

RETHINK big wants to come up with a research agenda primarily for hardware, anything from faster networks to greener data centers. CWI represents software expertise in the consortium.

So, we went through a regular questionnaire about how we see the landscape. I will summarize this below, as this is anyway informative.

Core competence

My own core competence is in core database functionality, specifically in high performance query processing, scale-out, and managing schema-less data. Most of the Virtuoso installed base is in the RDF space, but most potential applications are in fact outside of this niche.

User challenges

The life sciences vertical is the one in which I have the most application insight, from going to Open PHACTS meetings and holding extensive conversations with domain specialists. We have users in many other verticals, from manufacturing to financial services, but there I do not have as much exposure to the actual applications.

Having said this, the challenges throughout tend to be in diversity of data. Every researcher has their MySQL database or spreadsheet, and there may not even be a top level catalogue of everything. Data formats are diverse. Some people use linked data (most commonly RDF) as a top level metadata format. The application data, such as gene sequences or microarray assays, reside in their native file formats and there is little point in RDF-izing these.

There are also public data resources that are published in RDF serializations as vendor-neutral, self-describing format. Having everything as triples, without a priori schema, makes things easier to integrate and in some cases easier to describe and query.

So, the challenge is in the labor intensive nature of data integration. Data comes with different levels of quantity and quality, from hand-curated to NLP extractions. Querying in the single- or double-digit terabyte range with RDF is quite possible, as we have shown many times on this blog, but most use cases do not even go that far. Anyway, what we see on the field is primarily a data diversity game. The scenario is data integration; the technology we provide is database. The data transformation proper, data cleansing, units of measure, entity de-duplication, and such core data-integration functions are performed using diverse, user-specific means.

Jerven Bolleman of the Swiss Institute of Bioinformatics is a user of ours with whom we have long standing discussions on the virtues of federated data and querying. I advised Stefan to go talk to him; he has fresh views about the volume challenges with unexpected usage patterns. Designing for performance is tough if the usage pattern is out of the blue, like correlating air humidity on the day of measurement with the presence of some genomic patterns. Building a warehouse just for that might not be the preferred choice, so the problem field is not exhausted. Generally, I’d go for warehousing though.

What technology would you like to have? Network or power efficiency?

OK. Even a fast network is a network. A set of processes on a single shared-memory box is also a kind of network. InfiniBand is maybe half the throughput and 3x the latency of single threaded interprocess communication within one box. The operative word is latency. Making large systems always involves a network or something very much like one in large scale-up scenarios.

On the software side, next to nobody understands latency and contention; yet these are the one core factor in any pursuit of scalability. Because of this situation, paradigms like MapReduce and bulk synchronous parallel (BSP) processing have become popular because these take the communication out of the program flow, so the programmer cannot muck this up, as otherwise would happen with the inevitability of destiny. Of course, our beloved SQL or declarative query in general does give scalability in many tasks without programmer participation. Datalog has also been used as a means of shipping computation around, as in the the work of Hellerstein.

There are no easy solutions. We have built scale-out conscious, vectorized extensions to SQL procedures where one can express complex parallel, distributed flows, but people do not use or understand these. These are very useful, even indispensable, but only on the inside, not as a programmer-facing construct. MapReduce and BSP are the limit of what a development culture will absorb. MapReduce and BSP do not hide the fact of distributed processing. What about things that do? Parallel, partitioned extensions to Fortran arrays? Functional languages? I think that all the obvious aids to parallel/distributed programming have been conceived of. No silver bullet; just hard work. And above all the discernment of what paradigm fits what problem. Since these are always changing, there is no finite set of rules, and no substitute for understanding and insight, and the latter are vanishingly scarce. "Paradigmatism," i.e., the belief that one particular programming model is a panacea outside of its original niche, is a common source of complexity and inefficiency. This is a common form of enthusiastic naïveté.

If you look at power efficiency, the clusters that are the easiest to program consist of relatively few high power machines and a fast network. A typical node size is 16+ cores and 256G or more RAM. Amazon has these in entirely workable configurations, as documented earlier on this blog. The leading edge in power efficiency is in larger number of smaller units, which makes life again harder. This exacerbates latency and forces one to partition the data more often, whereas one can play with replication of key parts of data more freely if the node size is larger.

One very specific item where research might help without having to rebuild the hardware stack would be better, lower-latency exposure of networks to software. Lightweight threads and user-space access, bypassing slow protocol stacks, etc. MPI has some of this, but maybe more could be done.

So, I will take a cluster of such 16-core, 256GB machines on a faster network, over a cluster of 1024 x 4G mobile phones connected via USB. Very selfish and unecological, but one has to stay alive and life is tough enough as is.

Are there pressures to adapt business models based on big data?

The transition from capex to opex may be approaching maturity, as there have been workable cloud configurations for the past couple of years. The EC2 from way back, with at best a 4 core 16G VM and a horrible network for $2/hr, is long gone. It remains the case that 4 months of 24x7 rent in the cloud equals the purchase price of physical hardware. So, for this to be economical long-term at scale, the average utilization should be about 10% of the peak, and peaks should not be on for more than 10% of the time.

So, database software should be rented by the hour. A 100-150% markup for the $2.80 a large EC2 instance costs would be reasonable. Consider that 70% of the cost in TPC benchmarks is database software.

There will be different pricing models combining different up-front and per-usage costs, just as there are for clouds now. If the platform business goes that way and the market accepts this, then systems software will follow. Price/performance quotes should probably be expressed as speed/price/hour instead of speed/price.

The above is rather uncontroversial but there is no harm restating these facts. Reinforce often.

Well, the question is raised, what should Europe do that would have tangible impact in the next 5 years?

This is a harder question. There is some European business in wide area and mobile infrastructures. Competing against Huawei will keep them busy. Intel and Mellanox will continue making faster networks regardless of European policies. Intel will continue building denser compute nodes, e.g., integrated Knight’s Corner with dual IB network and 16G fast RAM on chip. Clouds will continue making these available on demand once the technology is in mass production.

What’s the next big innovation? Neuromorphic computing? Quantum computing? Maybe. For now, I’d just do more engineering along the core competence discussed above, with emphasis on good marketing and scalable execution. By this I mean trained people who know something about deployment. There is a huge training gap. In the would-be "Age of Data," knowledge of how things actually work and scale is near-absent. I have offered to do some courses on this to partners and public alike, but I need somebody to drive this show; I have other things to do.

I have been to many, many project review meetings, mostly as a project partner but also as reviewer. For the past year, the EC has used an innovation questionnaire at the end of the meetings. It is quite vague, and I don’t think it delivers much actionable intelligence.

What would deliver this would be a venture capital type activity, with well-developed networks and active participation in developing a business. The EC is not now set up to perform this role, though. But the EC is a fairly large and wealthy entity, so it could invest some money via this type of channel. Also there should be higher individual incentives and rewards for speed and excellence. Getting the next Horizon 2020 research grant may be good, but better exists. The grants are competitive enough and the calls are not bad; they follow the times.

In the projects I have seen, productization does get some attention, e.g., the LOD2 stack, but it is not something that is really ongoing or with dedicated commercial backing. It may also be that there is no market to justify such dedicated backing. Much of the RDF work has been "me, too" — let’s do what the real database and data integration people do, but let’s just do this with triples. Innovation? Well, I took the best of the real DB world and adapted this to RDF, which did produce a competent piece of work with broad applicability, extending outside RDF. Is there better than this? Well, some of the data integration work (e.g., LIMES) is not bad, and it might be picked up by some of the players that do this sort of thing in the broader world, e.g., Informatica, the DI suites of big DB vendors, Tamr, etc. I would not know if this in fact adds value to the non-RDF equivalents; I do not know the field well enough, but there could be a possibility.

The recent emphasis for benchmarking, spearheaded by Stefano Bertolo is good, as exemplified by the LDBC FP7. There should probably be one or two projects of this sort going at all times. These make challenges known and are an effective means of guiding research, with a large multiplier: Once a benchmark gets adopted, infinitely more work goes into solving the problem than in stating it in the first place.

The aims and calls are good. The execution by projects is variable. For 1% of excellence, there apparently must be 99% of so-and-so, but this is just a fact of life and not specific to this context. The projects are rather diffuse. There is not a single outcome that gets all the effort. In this, the level of engagement of participants is less and focus is much more scattered than in startups. A really hungry, go-getter mood is mostly absent. I am a believer in core competence. Well, most people will agree that core competence is nice. But the projects I have seen do not drive for it hard enough.

It is hard to say exactly what kinds of incentives could be offered to encourage truly exceptional work. The American startup scene does offer high rewards and something of this could be transplanted into the EC project world. I would not know exactly what form this could take, though.

# PermaLink Comments [0]
06/29/2015 15:36 GMT
Virtuoso updated to version 7.2.1 [ Virtuoso Data Space Bot ]

We're pleased to announce that Virtuoso 7.2.1 is now available, and includes various enhancements and bug fixes. Important additions include new support for xsd:boolean and TIMEZONE-less DATETIME & xsd:dateTime; and significantly improved compatibility with the Jena and Sesame Frameworks.

New product features as of June 24, 2015, v7.2.1, include:

  • Virtuoso Engine

    • Added support for TIMEZONE-less xsd:dateTime and DATETIME
    • Added support for xsd:boolean
    • Added new text index functions
    • Added better handling of HTTP status codes on SPARQL graph protocol endpoint
    • Added new cache for compiled regular expressions
    • Added support for expression in TOP/SKIP
  • SPARQL

    • Added support for SPARQL GROUPING SETS
    • Added support for SPARQL 1.1 EBV (Efficient Boolean Value)
    • Added support for define input:with-fallback-graph_uri
    • Added support for define input:target-fallback-graph-uri
  • Jena & Sesame Compatibility

    • Added support for using rdf_insert_triple_c() to insert BNode data
    • Added support for returning xsd:boolean as true/false rather than 1/0
    • Added support for maxQueryTimeout in Sesame2 provider
  • JDBC Driver

    • Added new methods setLogFileName and getLogFileName
    • Added new attribute "logFileName" to VirtuosoDataSources for logging support
  • Faceted Browser

    • Added support for emitting HTML5+Microdata instead of RDFa as default HTML page
    • Added query optimizations
    • Added new footer icons to /describe page
  • Conductor and DAV

    • Added support for VAD dependency tree
    • Added support for default vdirs when creating new listeners
    • Added support for private RDF graphs
    • Added support for LDP in DAV API
    • Added option to create shared folder if not present
    • Added option to enable/disable DET graphs binding
    • Added option to set content length threshold for asynchronous sponging
    • Added folder option related to .TTL redirection
    • Added functions to edit turtle files
    • Added popup dialog to search for unknown prefixes
    • Added registry option to add missing prefixes for .TTL files
More details of the additions, fixes, and other changes in this update of both Open Source and Commercial Editions, may be found on the Virtuoso News page. Additional Information:
# PermaLink Comments [0]
06/29/2015 15:21 GMT
Announcement: UDA Release 7.0 Lite Edition ODBC Driver for Oracle [ UDA Data Space Bot ]

Today, we've updated the Lite Edition ODBC Driver for Oracle.

Installation and configuration takes only minutes, by following the documentation which remains available anytime, specifically for this driver on Windows.

Release 7.0 licenses are also available for immediate purchase.

Client Platform Support

Release 7.0 installers are available for immediate download for Windows. Builds for Mac, Linux, and other Unix-like OS will be available soon; please contact us if you have urgent need.

Release 7.0 supports all 32-bit and 64-bit ODBC client tools and applications, both GUI and command-line, on —

Windows and Windows Server
on x86 and x86_64
  • Windows 8.x (x86, x86_64)

  • Windows 7.x (x86, x86_64)

  • Windows Vista (x86, x86_64)

  • Windows XP (x86, x86_64)

  • Windows Server 2012 R2 (x86_64)

  • Windows Server 2012 (x86_64)

  • Windows Server 2008 R2 (x86_64)

  • Windows Server 2008 (x86, x86_64)

  • Windows Server 2003 R2 (x86, x86_64)

  • Windows Server 2003 (x86, x86_64)

DBMS Version Support

The Release 7.0 Lite Edition ODBC Driver supports virtually every version of Oracle in current use, including —

  • Oracle 12c Release 1 (12.1.x)

  • Oracle 11g Release 2 (11.2.x)

  • Oracle 11g Release 1 (11.1.x)

  • Oracle 10g Release 2 (10.2.x)

  • Oracle 10g Release 1 (10.1.x)

  • Oracle 9i Release 2 (9.2.x)

Changes since Release 6.x

Additions

  • Support for Oracle 12c

  • Support for Windows 8 and Windows Server 2012

Fixes

  • Enhanced support for Oracle 11g

# PermaLink Comments [0]
06/23/2015 16:14 GMT Modified: 06/23/2015 16:56 GMT
Announcement: UDA Release 7.0 Express Edition ODBC Driver for Oracle [ UDA Data Space Bot ]

Today, we've updated the Express Edition ODBC Driver for Oracle.

Installation and configuration takes only minutes, by following the documentation which remains available anytime, specifically for this driver on OS X and Windows.

Release 7.0 licenses are also available for immediate purchase.

Client Platform Support

Release 7.0 installers are available for immediate download for Mac and Windows. (Express Edition is not typically produced for Linux and other Unix-like OS will be available soon; please contact us if you have specific need.)

Release 7.0 supports all 32-bit and 64-bit ODBC client tools and applications, both GUI and command-line, on —

OS X and OS X Server
on x86 and x86_64
Windows and Windows Server
on x86 and x86_64
  • Yosemite (10.10.x) (x86_64)

  • Mavericks (10.9.x) (x86_64)

  • Mountain Lion (10.8.x) (x86_64)

  • Lion (10.7.x) (x86_64)

  • Windows 8.x (x86, x86_64)

  • Windows 7.x (x86, x86_64)

  • Windows Vista (x86, x86_64)

  • Windows XP (x86, x86_64)

  • Windows Server 2012 R2 (x86_64)

  • Windows Server 2012 (x86_64)

  • Windows Server 2008 R2 (x86_64)

  • Windows Server 2008 (x86, x86_64)

  • Windows Server 2003 R2 (x86, x86_64)

  • Windows Server 2003 (x86, x86_64)

DBMS Version Support

The Release 7.0 Express Edition ODBC Driver supports virtually every version of Oracle in current use, including —

  • Oracle 12c Release 1 (12.1.x)

  • Oracle 11g Release 2 (11.2.x)

  • Oracle 11g Release 1 (11.1.x)

  • Oracle 10g Release 2 (10.2.x)

  • Oracle 10g Release 1 (10.1.x)

  • Oracle 9i Release 2 (9.2.x)

Changes since Release 6.x

Additions

  • Support for Oracle 12c

  • Support for OS X Yosemite, Windows 8, and Windows Server 2012

Fixes

  • Enhanced support for Oracle 11g

  • Enhanced support for OS X Mavericks

# PermaLink Comments [0]
06/23/2015 16:14 GMT Modified: 06/26/2015 10:41 GMT
Virtuoso Elastic Cluster Benchmarks AMI on Amazon EC2 [ Virtuoso Data Space Bot ]

We have another new Amazon machine image, this time for deploying your own Virtuoso Elastic Cluster on the cloud. The previous post gave a summary of running TPC-H on this image. This post is about what the AMI consists of and how to set it up.

Note: This AMI is running a pre-release build of Virtuoso 7.5, Commercial Edition. Features are subject to change, and this build is not licensed for any use other than the AMI-based benchmarking described herein.

There are two preconfigured cluster setups; one is for two (2) machines/instances and one is for four (4). Generation and loading of TPC-H data, as well as the benchmark run itself, is preconfigured, so you can do it by entering just a few commands. The whole sequence of doing a terabyte (1000G) scale TPC-H takes under two hours, with 30 minutes to generate the data, 35 minutes to load, and 35 minutes to do three benchmark runs. The 100G scale is several times faster still.

To experiment with this AMI, you will need a set of license files, one per machine/instance, which our Sales Team can provide.

Detailed instructions are on the AMI, in /home/ec2-user/cluster_instructions.txt, but the basic steps to get up and running are as follows:

  1. Instantiate machine image ami-811becea) (AMI ID is subject to change; you should be able to find the latest by searching for "OpenLink Virtuoso Benchmarks" in "Community AMIs"; this one is short-named virtuoso-bench-cl) with two or four (2 or 4) R3.8xlarge instances within one virtual private cluster and placement group. Make sure the VPC security is set to allow all connections.

  2. Log in to the first, and fill in the configuration file with the internal IP addresses of all machines instantiated in step 1.

  3. Distribute the license files to the instances, and start the OpenLink License Manager on each machine.

  4. Run 3 shell commands to set up the file systems and the Virtuoso configuration files.

  5. If you do not plan to run one of these benchmarks, you can simply start and work with the Virtuoso cluster now. It is ready for use with an empty database.

  6. Before running one of these benchmark, generate the appropriate dataset with the dbgen.sh command.

  7. Bulk load the data with load.sh.

  8. Run the benchmark with run.sh.

Right now the cluster benchmarks are limited to TPC-H but cluster versions of the LDBC Social Network and Semantic Publishing benchmarks will follow soon.

# PermaLink Comments [0]
06/16/2015 17:53 GMT Modified: 06/17/2015 10:13 GMT
Virtuoso Elastic Cluster Benchmarks AMI on Amazon EC2 [ Orri Erling ]

We have another new Amazon machine image, this time for deploying your own Virtuoso Elastic Cluster on the cloud. The previous post gave a summary of running TPC-H on this image. This post is about what the AMI consists of and how to set it up.

Note: This AMI is running a pre-release build of Virtuoso 7.5, Commercial Edition. Features are subject to change, and this build is not licensed for any use other than the AMI-based benchmarking described herein.

There are two preconfigured cluster setups; one is for two (2) machines/instances and one is for four (4). Generation and loading of TPC-H data, as well as the benchmark run itself, is preconfigured, so you can do it by entering just a few commands. The whole sequence of doing a terabyte (1000G) scale TPC-H takes under two hours, with 30 minutes to generate the data, 35 minutes to load, and 35 minutes to do three benchmark runs. The 100G scale is several times faster still.

To experiment with this AMI, you will need a set of license files, one per machine/instance, which our Sales Team can provide.

Detailed instructions are on the AMI, in /home/ec2-user/cluster_instructions.txt, but the basic steps to get up and running are as follows:

  1. Instantiate machine image ami-811becea) (AMI ID is subject to change; you should be able to find the latest by searching for "OpenLink Virtuoso Benchmarks" in "Community AMIs"; this one is short-named virtuoso-bench-cl) with two or four (2 or 4) R3.8xlarge instances within one virtual private cluster and placement group. Make sure the VPC security is set to allow all connections.

  2. Log in to the first, and fill in the configuration file with the internal IP addresses of all machines instantiated in step 1.

  3. Distribute the license files to the instances, and start the OpenLink License Manager on each machine.

  4. Run 3 shell commands to set up the file systems and the Virtuoso configuration files.

  5. If you do not plan to run one of these benchmarks, you can simply start and work with the Virtuoso cluster now. It is ready for use with an empty database.

  6. Before running one of these benchmark, generate the appropriate dataset with the dbgen.sh command.

  7. Bulk load the data with load.sh.

  8. Run the benchmark with run.sh.

Right now the cluster benchmarks are limited to TPC-H but cluster versions of the LDBC Social Network and Semantic Publishing benchmarks will follow soon.

# PermaLink Comments [0]
06/16/2015 17:53 GMT Modified: 06/17/2015 10:13 GMT
Announcement: UDA Release 7.0 Lite Edition ODBC Driver for Sybase and Microsoft SQL Server [ UDA Data Space Bot ]

In coming months, we'll be gradually shipping Release 7.0 of all our UDA drivers. This post will be the first of many, describing some of the fixes, changes, and improvements in each driver as they are made available.

Today, we have the Lite Edition ODBC Drivers for Sybase and Microsoft SQL Server.

Installation and configuration takes only minutes, by following the documentation which remains available anytime, specifically for this driver on OS X and Windows.

Release 7.0 licenses are also available for immediate purchase.

Client Platform Support

Release 7.0 installers are available for immediate download for Mac and Windows. Builds for Linux and other Unix-like OS will be available soon; please contact us if you have urgent need.

Release 7.0 supports all 32-bit and 64-bit ODBC client tools and applications, both GUI and command-line, on —

OS X and OS X Server
on x86 and x86_64
Windows and Windows Server
on x86 and x86_64
  • Yosemite (10.10.x) (x86_64)

  • Mavericks (10.9.x) (x86_64)

  • Mountain Lion (10.8.x) (x86_64)

  • Lion (10.7.x) (x86_64)

  • Windows 8.x (x86, x86_64)

  • Windows 7.x (x86, x86_64)

  • Windows Vista (x86, x86_64)

  • Windows XP (x86, x86_64)

  • Windows Server 2012 R2 (x86_64)

  • Windows Server 2012 (x86_64)

  • Windows Server 2008 R2 (x86_64)

  • Windows Server 2008 (x86, x86_64)

  • Windows Server 2003 R2 (x86, x86_64)

  • Windows Server 2003 (x86, x86_64)

DBMS Version Support

The Release 7.0 Lite Edition ODBC Driver supports virtually every version of Microsoft SQL Server and Sybase Adaptive Server in current use, including —

  • Microsoft SQL Server 6.5

  • Microsoft SQL Server 7.0

  • Microsoft SQL Server 2000

  • Microsoft SQL Server 2005

  • Microsoft SQL Server 2008

  • Microsoft SQL Server 2012

  • Microsoft SQL Server 2014

  • Microsoft SQL Azure

  • Sybase SQL Server 4.x

  • Sybase SQL Server 10.x

  • Sybase SQL Server 11.x

  • Sybase Adaptive Server Enterprise (ASE) 11.x

  • Sybase Adaptive Server Enterprise (ASE) 12.x

  • Sybase Adaptive Server Enterprise (ASE) 15.x

  • Sybase SQL Anywhere 6.x

  • Sybase Adaptive Server Anywhere (ASA) 7.x

  • Sybase Adaptive Server Anywhere (ASA) 8.x

  • Sybase Adaptive Server Anywhere (ASA) 9.x

  • Sybase SQL Anywhere 10.x

  • Sybase SQL Anywhere 11.x

Changes since Release 6.x

Additions

  • added support for SPARSE columns in SQLColumns() call

    • added DSN options SHOWSPARSECOLS / ShowSparseCols and Multi-Tier connect option -X )

    • details, based on test table:

           
      CREATE TABLE tbl_sparse_test 
        ( col1  INT SPARSE
        , col2  INT
        , col3  XML COLUMN_SET FOR ALL_SPARSE_COLUMNS 
        )
      
           
          
      • wildcard query will return only col2 and col3; will not include SPARSE columns. This is standard SQL Server behavior, and it cannot be changed.

               
        SELECT * 
          FROM tbl_sparse_test
          ;
        
               
              

        To include SPARSE columns in results, they must be explicitly SELECTed

               
        SELECT col1, col2, col3 
          FROM tbl_sparse_test
          ;
        
               
              
      • By default, calls to SQLColumns() don't return Sparse Columns. To receive full columns list:

        • via our Lite Edition ODBC driver —

          1. open connection with SHOWSPARSECOLS in DSN connection string, e.g., "DSN=TdsSQL;UID=sa;PWD=sa;SHOWSPARSECOLS=Y;"

          2. SQLColumns (hstmt, NULL, 0, NULL, 0, L"tbl_sparse_test", SQL_NTS, NULL, 0 );

        • via the Microsoft ODBC driver —

          1. SQLSetStmtAttr (hstmt, SQL_SOPT_SS_NAME_SCOPE, (SQLPOINTER)SQL_SS_NAME_SCOPE_EXTENDED, SQL_IS_SMALLINT);

          2. SQLColumns (hstmt, L"tempdb", SQL_NTS, L"dbo", SQL_NTS, L"tbl_sparse_test", SQL_NTS, NULL, 0 );

  • added support for new SQL Server datatypes such as datetime2

  • added support for NBCROW token

  • added support for Sybase 15

Fixes

  • fixed issue with SQL Server BIT datatype

  • fixed memory overwrite error, when DB procedure is called with SQL_PARAM_OUTPUT parameter of CHAR/VARCHAR/LONGVARCHAR

  • fixed issue with VARBINARY datatype and DB procedures

  • fixed issue with converting TIMESTAMP to CHAR/WCHAR

  • fixed datatype info in SQLGetTypeInfo -- new Sybase and MSSQL datatypes were added

  • fixed database catalog and query metadata info for Sybase 15's UNSIGNED INT, UNSIGNED SMALLINT, BIGINT, SYSNAME, LONGSYSNAME

# PermaLink Comments [0]
06/16/2015 17:43 GMT Modified: 06/23/2015 16:18 GMT
Why Do I Need To Pay For ODBC , JDBC, ADO.NET, OLE-DB Drivers? [ UDA Data Space Bot ]

Payment is a function of pain alleviation (opportunity cost) monetization.

This post is about highlighting the real pains associated with the $0.00 misconception associated with Data Access Drivers: ODBC, JDBC, ADO.NET, OLE-DB etc.

In the most basic sense, there are some fundament aspects of data access that are complex to implement and rarely implemented (if at all) by free drivers, the list includes:

  1. Escape Syntaxes for Dates and Functions
  2. Metadata Calls which enable smarter ODBC compliant applications (this feature is typically missing on Driver Side and abused on the Client side i.e., making clients DBMS specific by testing for specific DBMS names)
  3. Scrollable Cursors, this is how you deal with change sensitivity, and most drivers actually fake support and get away with it due to shortage of applications to test proper cursor types (Static, Forward-Only, Key-Set, Dynamic, and Mixed models).

Okay, so we're done with actual driver sophistication re. implementation of critical features. Let's Up the ante by veering into the area of security. At the most basic level, It's extremely important to understand that all data access driver types provide read-write access to your databases; thus, it's imperative that data access drivers address the following:

  1. Read-Only or Read-Write Access scoped to specific Users
  2. Ditto applied to specific User Groups
  3. Ditto applied to Database Names
  4. Ditto applied to specific ODBC compliant applications
  5. Ditto applied to specific ODBC host operating systems
  6. Ditto applied to specific IP addresses or Ranges on your Network
  7. Any combination of items 1-6 as part of a configurable data access rules/policy system.

Once you're done with security, you then have the thorny issue of data access and data flow management. In a nutshell, your driver needs to be able to handle:

  1. Protection against cartesian product network flooding (e.g., user clicks on Customer Table via an ODBC compliant application without comprehension of back-end implications)
  2. Enabling or Disabling of key DBMS engine data access optimization features (e.g. DBMS specific extensions exposed via Environment Variables of SQL commands based settings)
  3. Conditional Connection Pooling across User, User Groups, Applications, Host Operating System, IP Address dimensions.

Once you've dealt with Security and Data Flow, you then have to address the enforcement of these settings across a myriad of ODBC compliant host, which is where Zeroconfig and centralized data access administration comes into play i.e., configure once (locally) and enforce globally.

When OpenLink Software entered the ODBC Driver Market segment in 1992, the issues above where the fundamental basis of our Multi-Tier Drivers. Thus, although we distinguished ourselves via performance, stability, and specification adherence, our fundamental engineering focus has always been skewed towards security and configurability, alongside high-performance and scalability.

As we close 2009, the security issues that pervade Native DBMS Drives, ODBC, JDBC, ADO.NET, OLE-DB etc. Drivers have only increased, courtesy of ubiquitous computing, sadly though, there remains a fundamental illusion that Data Access Drivers simply connect you to DBMS back-ends, and since you can get these drivers at $0.00 from most DBMS vendors they can't be that important.

I hope that this post brings some clarity to a very serious security and general configuration management issues associated with Data Access Drivers. Free ODBC Drivers offer nothing, when it comes to the real issues of Open Data Access. If they did, they wouldn't be worth $0.00!

Note: wondering if this has anything to do with Linked Data (my current data access focal point)? Well, remember, the Linked Data meme is fundamentally about REST based Open Data Access & Integration via HTTP; thus, what applies to Relational Tables oriented databases naturally applies to their more granular Property Graph oriented relatives. Basically, data access security never goes away, it just gets more granular, complex, and ultimately, mercurial.

Related

 

 

# PermaLink Comments [0]
06/11/2015 17:18 GMT
In Hoc Signo Vinces (part 21 of n): Running TPC-H on Virtuoso Elastic Cluster on Amazon EC2 [ Virtuoso Data Space Bot ]

We have made an Amazon EC2 deployment of Virtuoso 7 Commercial Edition, configured to use the Elastic Cluster Module with TPC-H preconfigured, similar to the recently published OpenLink Virtuoso Benchmark AMI running the Open Source Edition. The details of the new Elastic Cluster AMI and steps to use it will be published in a forthcoming post. Here we will simply look at results of running TPC-H 100G scale on two machines, and 1000G scale on four machines. This shows how Virtuoso provides great performance on a cloud platform. The extremely fast bulk load — 33 minutes for a terabyte! — means that you can get straight to work even with on-demand infrastructure.

In the following, the Amazon instance type is R3.8xlarge, each with dual Xeon E5-2670 v2, 244G RAM, and 2 x 300G SSD. The image is made from the Amazon Linux with built-in network optimization. We first tried a RedHat image without network optimization and had considerable trouble with the interconnect. Using network-optimized Amazon Linux images inside a virtual private cloud has resolved all these problems.

The network optimized 10GE interconnect at Amazon offers throughput close to the QDR InfiniBand running TCP-IP; thus the Amazon platform is suitable for running cluster databases. The execution that we have seen is not seriously network bound.

100G on 2 machines, with a total of 32 cores, 64 threads, 488 GB RAM, 4 x 300 GB SSD

Load time: 3m 52s
Run Power Throughput Composite
1 523,554.3 590,692.6 556,111.2
2 565,353.3 642,503.0 602,694.9

1000G on 4 machines, with a total of 64 cores, 128 threads, 976 GB RAM, 8 x 300 GB SSD

Load time: 32m 47s
Run Power Throughput Composite
1 592,013.9 754,107.6 668,163.3
2 896,564.1 828,265.4 861,738.4
3 883,736.9 829,609.0 856,245.3

For the larger scale we did 3 sets of power + throughput tests to measure consistency of performance. By the TPC-H rules, the worst (first) score should be reported. Even after bulk load, this is markedly less than the next power score due to working set effects. This is seen to a lesser degree with the first throughput score also.

The numerical quantities summaries are available in a report.zip file, or individually --

Subsequent posts will explain how to deploy Virtuoso Elastic Clusters on AWS.

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
06/10/2015 12:04 GMT Modified: 06/10/2015 12:49 GMT
 <<     | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |     >>
Powered by OpenLink Virtuoso Universal Server
Running on Linux platform