Subscribe to DSC Newsletter

Stream Processing and Streaming Analytics – How It Works

Summary:  Understanding the basic operating characteristics of Stream Processing and In-Stream Analytics will help clarify whether it’s useful in your situation.

Recently we started exploring the basics of Event Stream Processing (ESP) in our article Stream Processing - What Is It and Who Needs It.  There we explained ESP capabilities, technologies, platforms, and business cases.  There’s one more piece of information that you need to fully assess whether ESP will work for you and that’s how it works.  Our discussion will be non-technical and will be an overview of what pretty much all ESP platforms offer in terms of operating capabilities and functions.

ESP and DB Platform Basics

In this article, when we refer to ESP we’ll be talking about an in-memory special purpose platform that exists at the edge of your system where the signals or data first enter and before they move on to being persisted in storage.  That longer term storage can be low cost disk drives or the new in-memory DBs built on DRAM and/or SSD/Flash memory.  The presence of an in-memory DB for long term storage is not a requirement of ESP, but your ESP platform will have enough in-memory storage to process the streaming data passing through.  So ESP platforms are themselves a type of in-memory DB but that’s not where your data will stay.

All the same, there are many major and mid-size providers ranging from SAP, Oracle, IBM and Microsoft, to VoltDB, MemSQL, and Kognitio (plus many more) who are packaging ESP with their in-memory DB offerings.  You could also acquire an ESP-only platform from someone like SAS and link it to any number of DBs including, most likely, the one you already have, to persist the data.

Some of these will require extensive programming knowledge to configure.  Others like SAS are known for their drag-and-drop user interfaces.

The data stream throughput you expect will dictate the size and capability of the ESP platform you create.  Toward the high end are systems sized to handle upwards of 5 million transactions per second flowing through to 100 terabytes of distributed storage.  Yours probably won’t be nearly that big.

Input – Processing - Output

The easiest way to understand ESP architecture is to see it as three layers or functions, input, processing, and output.

Input accepts virtually all types of time-based streaming data and multiple input streams are common.  In the main ESP processor occurs a variety of actions called programs.  And those results are passed to the subscriber interface which can send alerts via human interfaces or create machine automated actions, and pass the data to persistent stores and importantly, to analytic stores.

One Technical Issue

There are three paradigms for data handling in ESP and you will hear claims and counter claims for all three.  They are:

Atomic:  (aka one-key-value-at-a-time) Processes each inbound data event as a separate element.  Seems logical but this is the most computationally expensive design.  For example, it’s used to guarantee fastest processing of individual events with least delay in transmitting the event to the subscriber.  Seen often for customer transactional inputs so that if some element of the event block fails the entire block is not deleted but moved to a bad record file that can later be processed further.  Apache Storm uses this paradigm.

Micro batching:  Yes these are batches of events but typically those that occur within only a few milliseconds.  You can adjust the time window.  This makes the process somewhat more efficient.  Spark Streaming uses this paradigm.

Windowing:  Similar to batching, Windowing allows the design of micro batches that may be simple time-based batches, but also allows for many more sophisticated interpretations such as sliding windows (e.g. everything that occurred in the last X period of time).  This can be very useful for aggregating events or determining outliers when compared to averages or standard deviation.

There’s no particular winner here as they all have their sweet spot.  There is some difference in latency but the need for low latency is very much dependent on the specific use case and volume.

The Inner Workings

Lest you think that ESP is a system that just takes a quick glance at the data and passes it through, there are quite a number of neat tricks that can occur inside the ESP box.  We’ll use a hospital health monitoring example in which we want real time alerts to specific conditions of our patients.

We start with the hallmark of ESP which is the Continuous Query pulling in multiple data streams from various sensor enabled devices monitoring things like pulse, respiration, blood ox, and temperature.  These don’t need to be transmitting at the same intervals.  Some may be every 10 milliseconds, some just once per second or even longer. ESP will match them up.

Importantly, note that we are also bringing in static data from the patient, for example the multiple results from recent blood tests which are likely to be only daily.

Sensor data can be notoriously dirty, missing time signatures and the like.  Much of the cleansing and transform can occur directly in ESP, much like MDM.  And the various data streams can be joined for analysis.

In order to minimize latency, we may wish to run multiple analytic routines on the data so first we will copy the data so that it can run on multiple tracks simultaneously.

The steps labeled ‘last one minute’ and ‘last one hour’ are sliding time windows so that we can compare what is happening now with what has happened recently.  These windows can be much longer, even weeks or months and will persist that data within the ESP memory.  The aggregation and summary steps allow us to calculate values such as min/max, average, or standard deviation on the fly.

These are then brought together in a join and compare operation that embeds rules that tell ESP when to notify the medical staff that attention is required.

To illustrate even more sophisticated decisioning within ESP we’ve added two other parallel analytic steps.  The first is pattern matching.  In this context pattern matching is like business rules for medicine.  For example if variable X is above a certain value and variable Y is below a certain value and variable Z is outside a certain range, then send an alert.

Finally, and perhaps the most powerful capability of all is to take a predictive model developed externally on an analytic DB and move it into ESP.  This might be a scoring model looking at dozens of variables simultaneously and developing a score previously determined to require attention.

The key thing to understand about what is called ‘streaming analytics’ is that it applies analytics to the stream of data but the insights and rules used had to be developed external to ESP in a good old fashion advanced analytics workspace.

Updating those analytic insights can be quick but not real time and not within the scope of ESP.

Here’s a quick rundown on the different actions you can take within ESP:

  • Compute
  • Copy, to establish multiple processing paths – each with different retention periods of say 5 to 15 minutes
  • Aggregate
  • Counter
  • Filter – allows you to keep only the data from the stream that is useful and discard the rest, greatly reducing storage.
  • Function (transform)
  • Join
  • Notification email, text, or multimedia
  • Pattern (detection) (specify events of interest EOIs)
  • Procedural (apply advanced predictive model)
  • Text context – could detect for example Tweet patterns of interest
  • Text Sentiment – can monitor for positive or negative sentiments in a social media stream

Do you always need ESP when dealing with streaming data?  Technically no.  You could stream the data directly into a persistent store like Hadoop.  If the frequency of the data is low or the needs of real time are fairly long you could then conduct similar types of analysis using batch jobs.

But for ecommerce, financial trading, fraud detection, and many types of customer experience interactions where your response must be essentially immediate to be effective, Event Stream Processing is your best tool.


October 29, 2015

Bill Vorhies, President & Chief Data Scientist – Data-Magnum - © 2015, all rights reserved.

About the author:  Bill Vorhies is President & Chief Data Scientist at Data-Magnum and has practiced as a data scientist and commercial predictive modeler since 2001.  Bill is also Editorial Director for Data Science Central.  He can be reached at:

[email protected] or [email protected]

The original article can be seen here.

Views: 5297


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Sione Palu on May 30, 2016 at 1:45pm

Since streaming data is temporal, Signal Processing algorithms or their variants have been adopted in streaming data analytics. Some examples here are:

The use of Wavelet & Kalman Filter in text-mining & topic modelling:

1) "Dynamic Topic Models", by D. Blei, et al.

2) "Twitter event detection: combining wavelet analysis and topic inference summarization" by M. Cordeiro

3) "Fast Approximate Wavelet Tracking on Streams" by Graham Cormode, et al.

and many more that have been published in the literature. This is no surprise since Signal Processing is the analytics or the modelling of evolution of temporal-based data.

Comment by William Vorhies on November 11, 2015 at 3:45pm

Not that I'm specifically aware of.  You may have to go to each developer to read more.  For example see SAS Event Stream Processing and also some whitepapers at VoltDB.

Comment by Kening Ren on November 5, 2015 at 7:41am
This is very enlightening. Thank you. Is there any book or tech references which we can take a look to get more info?

Follow Us


  • Add Videos
  • View All


© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service