Data in the emerging world of stream processing


neha-narkhedeconfluent-co-foundercto-jpg.jpg

This guest post comes from Neha Narkhede, co-founder and CTO at Confluent, a startup focused on Apache Kafka and founded by its creators.

Data systems in the modern world aren’t islands that stand on their own; data often flows between databases, offline data stores and search systems, as well as to stream processing systems. But for a long time, data technology in companies was fairly homogeneous; data mostly resided in two popular locations: operational data stores and the data warehouse. And a substantial portion of data collection and processing that companies did ran as big batch jobs — CSV files dumped out of databases, log files collected at the end of the day, etc.

But businesses operate in real time and the software they run is catching up. Rather than processing data only at the end of the day, why not react to it continuously as the data arrives? This idea underpins the emerging world of stream processing.

Getting real (time)
The most obvious advantage of stream processing is its ability to move many analytical or reporting processes into real time. Stream processing applications treat data not as static tables or files, but as a never-ending infinite stream that goes from what happened in the past into what will happen in the future. In database terms, instead of running a query on data collected in the past, stream processing involves running data as it arrives through a query so the results are incrementally generated as a continuous operation.

The excitement around stream processing goes well beyond just faster analytics or reporting. What stream processing really enables is the ability to build a company’s business logic and applications around data that was previously available only in batch form, from the data warehouse, and to do that in a continuous fashion rather than once a day. For example, a retailer can analyze and report on their sales in real time, and also build core applications that re-order products, and adjust prices by region, in response to incoming sales data.

Does it stream?
But stream processing only becomes possible when the fundamental data capture is done in a streaming fashion; after all, you can’t process a daily batch of CSV dumps as a stream. This shift towards stream processing has driven the popularity of Apache Kafka. The adoption of Kafka has been remarkable. From the Silicon Valley tech crowd — the Ubers, AirBnBs, Netflixes, Ebays and Yahoos of the world — to retail, finance, healthcare, and telecom. For thousands of companies around the globe, Kafka has become a mission-critical cornerstone of their data architecture.

kafka-figure.png

Credit: Confluent

My own experience in this area came about while working at LinkedIn during its early days. Back in 2009, my co-workers and I created Apache Kafka to help LinkedIn collect all its data and make it available to the various products and systems built to process it. The idea was to provide the user with a real-time experience — after all, the website was used 24 hours a day, so there was no reason to process and analyze data only once a day. In the years that followed, we put Kafka into production at LinkedIn, running at increasingly large scale, and built out the rest of LinkedIn’s stream data platform. We poured into it a stream of data for everything happening in the company–every click, search, email, profile update, and so on. These days, Kafka at LinkedIn handles over a trillion updates per day.

Drinking it all in
This transformation towards stream data and processing at LinkedIn is relevant to every organization in any industry; streams are everywhere – be they streams of stock ticker data for finance companies, never-ending orders and shipments for retail companies, or user clicks for Web companies. Making all the organization’s data available centrally, as free-flowing streams, enables business logic to be represented as stream processing operations. This has a profound impact on what is now possible with all the data that was previously locked up in silos.

The same data that went into your offline data warehouse is now available for stream processing. All the data collected once is available for storage or access in the various databases, search indexes, and other systems in the company. Data to drive critical business decisions is available in continuous fashion versus once a day at midnight. Anomaly and threat detection, analytics, and response to failures can be done in real-time versus when it is too late. And all of this is possible by deploying a single platform at the heart of your datacenter, vastly simplifying your operational footprint.

At Confluent, we believe strongly that this new type of data architecture, centered around real-time streams and stream processing, will become ubiquitous in the years ahead.

This entry was posted in Big Data. Bookmark the permalink.