Stream processing and the IBM Open Platform


Competitive advantage can depend on the ability to analyze streams of big data in real time, but a bewildering array of stream processing engines are available to choose from. We recently explored streaming alternatives with the assistance of Roger Rea and Jacques Roy at IBM.

Roger Rea is responsible for driving market share growth for IBM Streams and leading development, finance, marketing, sales, service and support. Previously, Rea has held a variety of educational, management, marketing, sales and technical positions at IBM, Skill Dynamics and Tivoli Systems.

 

Jacques Roy is a member of the IBM analytics competitive and product strategy team, specializing in big data streaming analytics and Apache Spark. During his more than 30 years in the industry, Roy has worked in many technology areas including application development, databases and operating systems. He is the author of multiple books including Streaming Analytics with IBM Streams: Analyze More, Act Faster, and Get Continuous Insights (Wiley, November 2015). Roy presents at many conferences including the IBM Insight at World of Watson event.

Both experts on streaming data spoke to why the right choice of stream processing engine depends on the specific use case.

Analyzing streaming data for specific use cases 

Many of today’s business opportunities depend on the ability to analyze or process big data—not only large volumes of data, but also data that is produced at high velocity. For example, a consumer products company might want to analyze trending topics on social media the second they emerge, so that it can place appropriate ads to appeal to consumers’ current moods. Or a healthcare organization might want to capture and analyze incoming data from medical devices, such as electrocardiograms or insulin pumps, and process it to recognize patterns indicating when a patient is likely to suffer a heart attack or hypoglycemic episode. 

What these use cases have in common is the need to process large streams of incoming data quickly, in as close to real time as possible. In many cases, the data is also likely to be received in an unstructured or semistructured format, which may require transformation on the fly before it can be analyzed successfully. In computer science terms, the common way to handle such problems is known as stream processing—a paradigm that uses multiple parallel computation units to apply operations to each element individually in the data stream. The parallelization enables very rapid throughput of even large amounts of data, allowing high-velocity streams to be processed in or near-real time.

Finding the right engine

As a result of the volume, velocity and variety of the data, a big data platform such as Apache Hadoop is typically required to ingest and store the data. However, MapReduce—Hadoop’s core processing engine—is not at all suitable for stream processing: it is too slow, and it is fundamentally designed to process data in large batches over long periods, not to handle individual transactions. 

In consequence, a number of other technologies have sprung up to solve the problem of stream processing in a Hadoop environment.

Currently, at least a dozen open source and commercial stream processing languages, services and technologies exist, and at least five top-level Apache projects provide stream processing capabilities.

 

Open source availability

Top-level Apache status granted

Contributors on GitHub (12/2016)

Defining features

Spark Streaming

2010*

February 2014*

1,006*

Microbatching engine, large community

Flink

2010

December 2014

256

Exactly once guarantees, flexible windowing

Storm

2011

September 2014

234

Real-time computation, use with any language

Apex

2012

April 2016

37

Enterprise-grade batch and streaming engine

Samza

2013

January 2015

63

Kafka between jobs, and good state handling

*Data for Spark; Spark Streaming is an extension of Spark, first released as an alpha distribution in 2013 

Many of these technologies take one of two approaches: microbatching, in which incoming records are grouped into small batches and then processed, or true stream processing, in which each record can be processed individually as soon as it arrives.

Microbatching

A well-known technology that uses the microbatching approach is Spark Streaming, an extension of the high-performance Apache Spark in-memory, batch-processing engine. Spark is included in all Open Data Platform initiative (ODPi)–compliant Hadoop distributions—including the IBM Open Platform, which means that Spark Streaming is readily available to many mainstream Hadoop users. 

Spark Streaming is easy to use and is well supported by the large and well-funded Spark community. Moreover, it requires little additional expertise for a programmer to learn how to convert a standard Spark batch processing job into a Spark Streaming microbatching job. As a result, Spark Streaming is an obvious choice for data practitioners who want to take their first steps into stream processing. 

Nevertheless, Spark Streaming may not be the right tool for the job in all occasions. For one thing, the microbatching approach is necessarily limited in terms of latency.

Spark Streaming’s smallest batching window is half a second, which means if a particular use case requires results to be generated at a higher resolution, it won’t be able to deliver. For example, when using stream processing to monitor a manufacturing process requiring tight tolerances for conditions such as humidity, pressure or temperature, the ability to react within milliseconds to maintain appropriate values has a big impact on the final product’s quality. In such cases, latency can be a significant threat to the effectiveness of the entire monitoring infrastructure.

A second limitation comes from the batching process itself. If you have a relatively low volume of incoming records, you can meet performance requirements by processing a batch of records instead of processing each record individually. However, as the volume rises, you will reach a point—typically around 100,000 records per second—where the performance advantages are outweighed by the overhead of creating the batches themselves. This offset puts a limit on the scalability of the solution. 

Scalability can also be limited by the nature of Spark’s core architecture. Spark works by creating resilient distributed data sets (RDDs), and applying functions to transform the data they contain. Each time a function operates on an RDD, the output is a new RDD—and many Spark jobs involve the creation of a series of intermediate RDDs to produce an end result. To make reasoning about the lineage of a data set that has passed through one or more RDDs easier, the RDDs themselves are designed to be immutable: once created, they cannot be updated. If you need to change something about an RDD, you have to make a new copy of it. 

This immutability can be a problem if your streaming process needs to maintain state, that is, keeping track of some variables that change over the time that the stream is being processed. Imagine you are monitoring a network of one million sensors. At each interval, you get new state information for up to 10,000 sensors. Although you only need to update one percent of the records during each interval, since you have to copy the entire RDD you still pay 100 percent of the cost. In such cases, the performance overhead can act as a limiting factor, especially when the number of variables becomes large. And although the Spark project team has implemented some new techniques, such as the mapWithState function to alleviate this problem, it does not completely solve it—immutable RDDs remain a core foundation of Spark.

Stream processing

If microbatching with Spark Streaming isn’t always the best option for large-scale stream processing use cases, what are the options for stream processing with Hadoop? As is so often the case with the Hadoop ecosystem, many open source projects are available that can potentially provide an answer. Among them are Apache Apex, Apache Flink, Apache Samza and Apache Storm. 

Currently, Flink is probably the project that is getting the most attention. It makes handling event time easy, which enables you to process events in the order they occur, even if they don’t arrive at the processing engine in that order.

Event time is a concept that most streaming technologies can support, but they often require a lot of custom logic to do so. With Flink, event time is built into its core capabilities. 

Flink also helps ensure that each event is processed exactly once, which makes it possible to maintain state. It has a lightweight system for snapshots that distributes copies of the current state to help ensure fault tolerance in the case of failure. And if needed, it can also be used for batch processing, which is treated simply as a special case of stream processing. As a result, the recommended default setting with Flink is to use a 100 ms buffer to gather small batches of data before sending them across the network for processing. While setting this buffer to zero and transferring elements individually for stream processing is possible, the documentation warns that this approach can cause performance degradation.

Making the leap to wide adoption

On the negative side, Flink has to compete for attention not only with Apex, Samza and Storm but also with Spark. Because it is a comparatively new project with a relatively small community, whether or not it can achieve widespread popularity or become generally adopted as a core project in major vendors’ Hadoop distributions is not yet clear. 

To a greater or lesser extent, the same is true of many other open source options—for example, although Twitter originally championed Storm, it announced in 2015 that it had moved to a homegrown solution called Heron because of various technical concerns. Apex only became a top-level Apache project in April 2016, and Samza still seems relatively immature in terms of interoperability; it only works with YARN and Apache Kafka, and it only supports languages that use Java byte code. 

According to Geoffrey Moore’s technology marketing bible, Crossing the Chasm (HarperBusiness, August 2006), the most challenging transition for a new technology is to make the leap from early adoption by visionaries and enthusiasts to wider adoption by the more pragmatic majority. Clearly, Hadoop and Spark, for example, have both crossed this chasm and will be mainstream technologies for years to come. But for the other open source stream processing projects, we will have to wait and see if any of them gain enough traction to be considered a de facto standard.

Exploring other streaming data analysis options

In the meantime, IBM Streams is another option that offers real-time streaming capabilities and has been implemented in many large-scale enterprise use cases with Hadoop. Although Streams is neither open source nor part of the Hadoop ecosystem per se, it integrates seamlessly with both cloud and on-premises Hadoop clusters, including IBM BigInsights and other major distributions. IBM Streams can use YARN as a resource manager and can easily interact with Hadoop data stores including Hadoop Distributed File System (HDFS), Apache HBase, Apach Hive and Apache Parquet. 

IBM Streams is also fast and scalable. In one benchmark for a leading retailer, it performed 20 times better than Storm on the same hardware. By dramatically reducing hardware requirements, Streams can deliver savings that may be many times greater than its licensing costs:

 

Open source

Extensible platform

Managed service

Batch and streaming

Command-line interface

Web and JMX management

At least once

Exactly one

State

Windows

Back pressure

Machine learning

Model scoring

Video and image

Geospatial

Text analytics

Visual development

Automated high availability

Enterprise adapters

Open source adapters

Apex

 

 

 

 

 

 

 

 

Flink

 

 

 

 

 

 

 

 

IBM Streams

 

Samza

 

 

 

 

 

 

 

 

 

 

 

Spark Streaming

 

 

 

 

 

 

Storm

 

 

 

 

 

 

 

 

 

 

 

Making the right choice

The option well suited for stream processing with Hadoop depends on your use case. If you have a relatively small volume of events to process, are not concerned about latency and have perhaps already developed a set of Spark batch jobs to reuse for near-real-time event processing, Spark Streaming might be the simplest option. On the other hand, if you need a more powerful, scalable system and are prepared to invest in widely implemented enterprise-class technology, Streams provides a real-time stream processing engine with full support and all the trimmings. Alternatively, emerging open source technologies such as Flink can offer the right combination of performance, scalability and cost-efficiency, provided you’re willing to rely on the community and potentially invest additional time and effort in development and support. 

Learn more about IBM offerings for Hadoop and Spark. And stay tuned for additional articles about IBM Open Platform in future blog posts. 

Follow @andrea_braida

Follow @IBMBigData

This entry was posted in Big Data. Bookmark the permalink.