Open source business intelligence player Pentaho has announced availability of Pentaho Data Integration for Hadoop and the Pentaho BI Suite for Hadoop - the Apache-driven software framework said to support data-intensive distributed applications under a free license.
Pentaho said its new products should help enterprises overcome the steep technical learning curve, lack of skills and deployment options when it comes to data integration and BI for Hadoop.
"We've had this in beta for three months and the goal is to help make Hadoop easier for enterprise customers, more and more of which are turning to Hadoop to help with their 'big data' challenges," Richard Daley, founder and CEO of Pentaho told CBR. "Whether companies are using Hadoop in the cloud such as with Amazon's Elastic MapReduce, or using Hadoop on their own premises, this saves developers from manual coding in Java and gives them easy to use BI and data integration out of the box."
Daley claimed that Hadoop, which has been around as an Apache project since it was created by Doug Cutting - who named it after his son's stuffed elephant and was originally developed to support distribution for a search engine project - has really picked up speed in the last year. "We don't need to add any hype to the Hadoop project, it has a momentum all of its own and was actually brought to our attention by community members," Daley said.
Pentaho claims its new technology enable companies to more easily move data in and out of Hadoop; co-ordinate, execute and schedule Hadoop tasks in the context of existing ETL and BI workflows; design and execute scalable ETL jobs in Hadoop using the 200+ out-of-the-box ETL steps and finally integrate with cloud-based distributions including Amazon Elastic MapReduce, Cloudera Distribution for Hadoop (CDH) and Apache Hadoop.
On the BI side it is claimed to help firms perform production, operational and batch reporting against the full set of data in Hadoop using Hive; provide ad hoc reporting against data in Hadoop without knowledge of Hadoop or SQL and spin off data marts for interactive analysis and dashboarding using Pentaho Agile BI.
One of the risks with Hadoop of course is that the various distributions from the likes of Amazon and Cloudera could see a 'forking' of the core platform. Yahoo, for instance - which uses Hadoop for the Yahoo Search Webmap that runs on a 10,000 core Linux cluster and produces data that is now used in every Yahoo web search - runs a modified version of Hadoop although it does contribute back all work it does on Hadoop to the open-source community.
"The biggest risk is making sure that there are not too many versions of this," Daley said. "You have Amazon, Cloudera, Apache. We've got to make sure that we don't end up with a bunch of Hadoop versions." The concern for vendors is that there become too many versions that their own offerings need to support, while enterprises could become concerned that there is not a single, stable platform.
Pentaho also announced partnerships with Cloudera and Amazon to ensure compatibility with their Hadoop distributions, as well as with Impetus, an R&D and services outfit that has incorporated Pentaho Agile BI and the Pentaho BI Suite for Hadoop into its Large Data Analytics practice.