Taking the hard work out of Apache Hadoop



Why has IBM created its own distribution of Apache Hadoop and Apache Spark, and what makes it stand out from the competition? We asked Prasad Pandit, Program Director, Product Management, Hadoop and Open Analytics Systems, at IBM to give us a tour of the reference architecture for IBM Open Platform with Apache Hadoop.

 

Why did IBM decide to create its own Hadoop and Spark distribution, and why does it need a reference architecture?

The ability to collect, manage and analyze big data is one of the key tenets of the IBM cognitive business strategy, as well as being central to the Internet of Things. We see a lot of clients who are eager to adopt big data technologies such as Apache Hadoop and Apache Spark, but the proliferation of components within the Hadoop ecosystem makes it difficult to know how to take the first step. 

http://www.seenews.info/wp-content/uploads/2016/11/hardworkhadoop_embed.jpgFor example, a typical Hadoop deployment might involve as many as 20 interdependent open source components, each of which is developed separately by a different community. As a result, there are often compatibility issues among different versions of components. By creating our own distribution, we can test and certify each component and assure our clients that everything will work correctly—providing peace of mind and lowering the barrier to entry. 

Our distribution—the IBM Open Platform with Apache Hadoop—is a carefully curated selection of Hadoop ecosystem components that we have handpicked to support the widest possible range of big data use cases, and that we can support to ensure that they work together seamlessly. Our reference architecture helps us explain to clients which capabilities these components provide, how they fit together and what the basic configuration would be to get them up and running. It provides a reliable, low-risk starting point that clients can use to take their first steps with Hadoop and Spark.

If you compare the IBM Open Platform with some of the other leading vendors’ distributions, what sets it apart?

Subtle differences between the distributions are provided by each of the major vendors in this space. For example, some vendors replace some of the core open source components with their own proprietary technology, which means that once you have built your Hadoop environment on top of their products you are locked in and it’s difficult to change to a different distribution. 

Other vendors go to the opposite extreme and offer nothing but the vanilla open source components, which means their distribution is more generic. But on the other hand, they aren’t adding much unique value, except for their technical support and services. 

The IBM Open Platform with Apache Hadoop offers the best of both worlds. Every component within the reference architecture is 100 percent open source, so you can build a full-featured Hadoop environment with no concerns about lock in. But with the IBM BigInsights portfolio, we also offer a range of sophisticated proprietary tools that can be used to augment the standard Hadoop ecosystem and provide advanced functionalities on top. In addition, BigInsights and the IBM Open Platform with Apache Hadoop, together with other IBM solutions, form a solid foundation for each of the stages that customers embark on in their journey to transform themselves into cognitive businesses. 

For example, Open Platform with Apache Hadoop contains open source tools such as Apache Phoenix, which enables high-performance SQL queries on top of Apache HBase. But if you want to do more than just queries, we also offer IBM Big SQL, which makes it easy to migrate your Oracle or IBM DB2 database custom development projects into a Hadoop context—an enterprise-class capability that most other vendors simply don’t offer.

A common perception exists that IBM is first and foremost a proprietary software vendor. And yet, here we see IBM making a major commitment to open source. Is this a change of direction for IBM?

Not really because the issue is not black and white—both open source and proprietary software have their place. The important thing is to combine the two without reducing the client’s freedom and flexibility to choose what is right for their use case, and that’s what we’re trying to provide with the IBM Open Platform with Apache Hadoop and BigInsights. 

“Both open source and proprietary software have their place. The important thing is to combine the two without reducing the client’s freedom and flexibility to choose what is right for their use case, and that’s what we’re trying to provide with the IBM Open Platform with Apache Hadoop and BigInsights.” —Prasad Pandit, Program Director, Product Management, Hadoop and Open Analytics Systems, at IBM 

But it’s certainly true that IBM is making a significant commitment to the development of open source components within the Hadoop ecosystem. For example, we are one of the biggest contributors to the Spark codebase, and we’ve released IBM SystemML, our machine learning library for Spark, under a fully open source license. We’ve opened more than a dozen Spark labs worldwide, as well as our Spark Technology Center in San Francisco to foster innovation, and we’re committing more than 3,500 researchers and developers to work on Spark-related projects. 

“IBM is making a significant commitment to the development of open source components within the Hadoop ecosystem. For example, we are one of the biggest contributors to the Spark codebase.” —Prasad Pandit, Program Director, Product Management, Hadoop and Open Analytics Systems, at IBM

That IBM sees Spark as a key technology for the future is well known in the industry, but is that the only example of IBM supporting the Hadoop ecosystem?

Not at all. Look at Apache Ambari, for example, a project that came out of Hortonworks originally. IBM has made important contributions to its development, too. That’s a great example of a project that started out as a very basic management component and has blossomed into a much richer and more powerful solution. Ambari has already come a long way. And we’re excited to see where the project is heading with features such as Ambari metrics and visualization, simplified security configurations and frameworks such as Ambari Views that allow developers to plug in UI components. 

Learn more about IBM Open Platform.

Follow @IBMBigData

This entry was posted in Big Data. Bookmark the permalink.