Experimenting with Hadoop is now easier than ever before


Ayesha Zaka is a technical sales specialist at IBM, who focuses on its big data portfolio in general and Apache Hadoop in particular. Zaka works closely with organizations in all industries to provide technical enablement and help develop proofs of concept that reveal new ways for companies to profit from big data technologies. Zaka discusses her experience helping organizations with their big data analytics journeys in this Q&A with Andrea Braida.

Apache Hadoop can pose a notoriously complex technical challenge to IT teams who don’t already have the right expertise in house, which makes many companies wary of building and maintaining their own cluster. Moreover, many initial use cases for Hadoop revolve around building a data refinery or exploring and experimenting with big data to identify possible future areas of business value.

As a result, building a strong business case on the back of this type of experimental initiative or prototyping exercise can be challenging. What’s needed is a way to remove the effort of building a Hadoop cluster, eliminate the need for up-front investment and free data scientists and data engineers to play around with their data. And they need to be able to do so without fear of wasting time and money on unproductive research.

We spoke to Ayesha Zaka, technical sales specialist at IBM, who focuses on its big data portfolio in general and Apache Hadoop in particular, about her experience helping organizations get over the initial hurdles. Zaka also shares how IBM cloud data services offerings are making Hadoop adoption easier than ever before.

What are some real-world challenges customers face, for example, around accessing and integrating data? Why is expanding into Hadoop important, and what are some of the use cases available for Hadoop?

We’ve been working with a number of customers who are uncertain about exactly what needs to be analyzed. They need to explore the volumes of data they have gathered so that they can discover the parts that might be valuable.

Many of these companies have large volumes of dark data—data sets that are considered too large, time-sensitive, varied or unreliable to analyze using current methods. Even though such data sets may contain many useful tidbits of information, ingesting them into traditional data warehouse platforms, where storage capacity comes at a high price, often isn’t economically viable. As a result, this data tends to lie dormant, or may even get discarded without ever being explored or analyzed.

To solve this problem, we’re finding that our customers are looking to integrate their dark data into a big data platform that can accelerate their decision-making processes without overburdening the operations of their existing businesses. And that’s where Hadoop comes in.

For example, imagine you are a consumer products manufacturer or a retailer. You want to be able to analyze all the tweets and social media posts that are created each day, so that you can figure out what people are saying about your products. Plus, you can see who the key influencers are within your target demographics.

If you can get this social media data into Hadoop and mine it, you might be the first to spot a gap in the market or identify a new trend. This positioning enables you to develop new products or launch new campaigns before your competitors even realize the opportunity exists.

Or perhaps you’re a healthcare organization. Monitoring systems in your hospitals’ intensive care units (ICUs) can capture millions of sensor readings every hour and alert physicians if a patient’s vital signs suddenly change. However, the data from these systems is typically stored for only a few hours and then discarded. So it’s only useful as an immediate snapshot of how the patient is doing at any given moment—the data is not used as a full record of the patient’s condition over time.

By ingesting this data into Hadoop and analyzing patterns within it, you might be able to identify subtle indications that a patient’s condition was beginning to deteriorate. You could then match these patterns to the live feed of data coming from your sensors, and detect when a patient is at risk of deterioration, days earlier than is possible with traditional techniques. Hadoop can literally help to save lives in the ICU.

We know digital transformation is vital for business growth, but Gartner research this year tells us that the big data analytics community—whether comprising architects, data product managers or data scientists—is still working out what to do and how to do it. In your view, how does Hadoop as a Service (HaaS) support tackling these challenges?

A large percentage of consumers now want the companies and services they interact with to provide a more holistic digital experience. And it doesn’t matter if it’s in an app such as Uber, an online customer service provided by a bank or an airline, or a more personalized experience when shopping both online and in the store.

From a data perspective, companies that are in the process of embracing digital transformation are generally in one of four areas: 

We’re seeing a shift in the use of data toward self-service analytics and monetization through new business models. An accelerated focus on web and mobile applications aims to enhance customer integration in the digital world, and this approach is changing from the traditional one of leveraging data in operational systems and data warehouses.

Many companies today lie in the center of this value model. They include self-service analytics or insight-driven business models in some parts of their business, but these projects tend to be in silos and not ingrained in the business culture across the entire organization.

Hadoop can be a key tool in taking the next step by creating a platform for the aggregation, management, exploration and analysis of all the types of data that don’t fit into a traditional analytics landscape. This approach helps to complement existing data and analytics tools, and provides a comprehensive analytics hub that supports digital transformation by adding a new ability to extract value from dark data.

In particular, managed HaaS offerings can be an enabler and an accelerator of this kind of initiative. A major skill shortage around the Hadoop ecosystem exists, and many companies don’t have Hadoop experts on staff, which can make getting started, or managing the Hadoop ecosystem as needs grow, very challenging.

Even among the largest enterprises, building up this kind of in-house capability can be a daunting task. We spoke to one bank that has taken more than a year to get up and running with its Hadoop cluster. In this bank’s case, the investment is worthwhile because the bank knows its going to be a heavy user of Hadoop across the organization.

However, generally an easier way to prove the investment is worthwhile is available without expending so much time and effort. When you adopt a managed service approach, your service provider simply spins up a Hadoop cluster of whatever size you need, and handles all the complexities around maintenance and upgrades. This approach means your data scientists and engineers can focus on what they’re good at—ingesting and analyzing data—and identifying new opportunities for business value. 

“We’re seeing a shift in the use of data toward self-service analytics and monetization through new business models. An accelerated focus on web and mobile applications aims to enhance customer integration in the digital world, and this approach is changing from the traditional one of leveraging data in operational systems and data warehouses.” —Ayesha Zaka 

One of the trends we’re seeing in the market today in data and analytics strategy is that IT and the line of business both participate in buying decisions and deployments. What have been your biggest learnings about this trend so far?

The increasing involvement of line-of-business teams in data and analytics solution design and procurement goes hand in hand with the trend toward business users becoming the owners of their data and their desire for self-service analytics capabilities.

We’ve seen this trend for years in the traditional analytics space, where users are increasingly empowered to generate their own reports and dashboards or run their own queries against structured data. Now, business users want the same capabilities in the big data space, and the trend is even more pronounced because big data and dark data haven’t traditionally been owned by IT—no legacy of IT control is in place to overcome.

Data scientists and developers want the latest big data tools for iterative prototyping and development and test environments. This desire means your IT teams need to keep up with the constant evolution of new tools including Hadoop, Apache Spark, Apache Kafka and other frameworks.

A place to sandbox and to quickly scale up or down in a few easy clicks is required. You need to avoid the hassle, delay and expense of setting up complicated clusters to answer questions, particularly when those questions are exploratory in nature, and no guarantee exists that the answers will be valuable for the business.

For that reason, we’re seeing increasing numbers of organizations that are ready to embrace cloud platforms for big data analytics. Their IT teams are able to get started quickly without the effort required to install and manage a Hadoop environment, and they can focus on serving the data to their stakeholders. 

“Data scientists and developers want the latest big data tools for iterative prototyping and development and test environments. A place to sandbox and quickly scale up or down in a few easy clicks is required. For that reason, we’re seeing increasing numbers of organizations that are ready to embrace cloud platforms for big data analytics.” —Ayesha Zaka 

Yes, and from a line-of-business perspective, the ability to rapidly prototype analytics tools and applications is a huge benefit too. That brings me to my next question: can you give some examples of how the big data community is using HaaS to fail fast and accelerate business?

Rapid prototyping is fast becoming a necessity for all successful companies that are data driven. You want the ability to quickly test out a hypothesis or explore new business models without too much overhead.

Taking a cloud-first, managed service approach helps with that need because even if a particular project doesn’t produce any useful results, you don’t have a lot of sunk costs. You didn’t spend months building a cluster, and you don’t have to continue paying for it after the project has finished. So the cost and risk of failure is minimized. This outcome makes the approval and execution of such experimental projects a lot easier.

As an example, we’ve been working with a global hotel chain to build a cloud platform for rapid prototyping, based on our managed service for Hadoop and IBM BigInsights on Cloud. The platform provides the chain’s data science community with the ability to spin up new Hadoop environments with a few clicks. It can test hypotheses and prototype ideas before productizing. Each project can have its own environment within the cluster, and when the project is completed, it can scale the cluster back down again so the hotel chain only pays for what it needs.

We have also been helping one of the US’s leading food retailers that wanted to start experimenting with big data without building a complex environment in house. By combining BigInsights on Cloud with other IBM cloud data services such as dashDB and Cloudant, it now has a comprehensive platform that enables quick focus on the business problem at hand, instead of investing time and effort in the underlying technology.

What are the scenarios that you expect to see going forward with integrated cloud data services?

The next wave of Internet of Things business models is being fueled by data from sensors, smart devices, cell phones and so on. All this data needs to be captured, monitored and processed in real time. Companies starting on their journey to capture and gain value from this data need to modernize their infrastructure. Instead of just running their systems in isolation, they need to build out a platform that services the entire organization, capturing and leveraging all this information at the right pace to meet ever-changing business requirements.

Hadoop as a managed service is a starting point to provide a cost-effective and simple way to get started with these types of use cases, and to grow as the needs of the organization evolve. Once you have grown comfortable with Hadoop, a world of opportunities opens up to combine it with other cloud data services to build truly innovative analytics-driven applications.

For example, we’ve been working on a scenario in which a retailer captures data from customers’ smartphones and stores it in a NoSQL database. It mixes that data with the structured sales data stored in an online data warehouse and uses it to quickly analyze which departments in the stores do not have enough foot traffic.

This insight can then be sent to a data scientist, who can combine it with social media and other data in Hadoop and investigate the root causes of the problem. The data scientist can then determine actionable outcomes such as targeted promotions or even re-structuring the layout of the store to improve sales.

These kinds of complex big data analytics use cases become easy to deliver when you take a cloud-first approach because these services can integrate with each other easily. And the data can be moved seamlessly into whichever environment is most appropriate for analyzing it. 

“Once you have grown comfortable with Hadoop, a world of opportunities opens up to combine it with other cloud data services to build truly innovative analytics-driven applications. Complex big data analytics use cases become easy to deliver when you take a cloud-first approach.” —Ayesha Zaka 

Companies that are taking a cloud-first approach certainly look to have a bright future. What would you recommend as next steps, if our readers want to find out more about HaaS?

You can visit our IBM BigInsights website to learn more about Hadoop offerings from IBM. In addition, BigInsights on Cloud Basic Plan is a great way to get started, whether you need an environment in which to rapidly prototype or you are just taking your first steps with Hadoop. It gives you instant access to an industry-standard, open source Hadoop cluster, and helps you start experimenting.

Once you’ve proven the value, you can then easily upgrade to one of our more advanced services, which opens up new possibilities with some of the unique proprietary tools from IBM, such as Big SQL. And by using BigInsights on Cloud through the IBM Bluemix platform, you can easily integrate Hadoop with our other cloud data services to help develop the kind of advanced analytics applications we’ve just discussed.

Follow @IBMBigData

This entry was posted in Big Data. Bookmark the permalink.