Subscribe to DSC Newsletter

Stream Processing – What Is It and Who Needs It

Summary:  Stream Processing and In-Stream Analytics are two rapidly emerging and widely misunderstood data science technologies.  In this article we’ll focus on their basic characteristics and some business cases where they are useful.

There are five relatively new technologies in data science that are getting a lot of hype and generating a lot of confusion in the process.  They are:

  1. Stream Processing
  2. In Stream Analytics
  3. Real Time Analytics
  4. In Data Base Analytics, and
  5. In Memory Analytics

 

Gartner displays this as only three fast-rising trends but in the literature today you will see all five.  These are not simple to sort out but over this article and probably the next several we’ll try to help you understand what they’re good for, how they work, and importantly what they won’t do.

Let’s start with ‘Stream Processing’ and ‘In-Stream Analytics’.  The full formal name for this technology is Event Stream Processing (ESP) so we’ll use that shorthand here.

As you can tell from the name, the first requirement is that there is a stream of data.  Almost always this means time series data.  That is events that happen in sequence denoted by a specific time such as a string of sensor readings in IoT applications or trigger events (also denoted by time) such as when your customer’s mobile device is detected by your Wi-Fi system indicating that he’s close by.

ESP is a ‘real time’ processing technique.  So two things should be immediately evident, 1.) Events you want to track should happen frequently and probably close together in time, and 2.) There must be an important business reason for detecting and responding to the event quickly.

Real Time:

While real time can mean many things in different environments and can be microseconds to hours or even days in duration, if your time horizon is overnight or every few days or longer then you can do just as well with batch processing and you don’t need ESP.  For example, you are monitoring the flow of social media comments about your business but the rate at which they are coming in is relatively slow, say a few per hour.  You may elect to store them and have your marketing team analyze and respond the next day in batch mode.  The aggregate trending of social media on a daily basis is actually pretty fast so general trends in batch mode should be plenty especially if there are not that many comments to evaluate.  However, if you’re an ecommerce giant getting a fast stream of comments and are concerned that you respond to or address every negative comment within say minutes or hours then you probably need ESP. 

Trigger Events:

The need to respond quickly to a trigger event may trump frequency especially if it falls in the category of very rare events.  These might be systems monitoring patients’ vital signs in a hospital, sudden changes in machine operating characteristics that you previously determined mean that the equipment may fail soon, or the detection of a fraudulent transaction.

This also highlights that the information in a single piece of data is not particularly informative.  It is often by comparing that data to other data in the stream or to mathematical norms like averages or standard deviations that signals are detected.  So you may also want to define a time window of data that is held in memory for comparison.  That window may be only a few seconds but it may also be much longer.

ESP is always ‘in memory’. 

The single or multiple streams of in-bound data are said to be processed ‘at the edge’ of the system and in memory before being persisted in storage.  In the next article we’ll talk more about the technology.  For now it is sufficient to know that very dense streams of data numbering millions of events per second can be processed with latencies of only milliseconds by well-designed ESP systems.  The processing steps within ESP are relatively simple and can be handled in-memory as they arrive, including distributing them among multiple processors in shared-nothing MPP systems.

There can actually be a number of steps in ESP processing such as filtering, splitting into multiple streams, creating notifications, joins with existing data, and the application of business rules or scoring algorithms, all of which happens ‘in memory’ at the ‘edge’ of the system before the data is passed into storage.

Technologies and Platforms:

You can use Apache Stream or Apache Spark as the basis of your system using custom code to design the processing steps.  You can also use proprietary systems such as SAS Event Stream Processing that have much easier to manipulate drag-and-drop interfaces and don’t require coding.  Gartner reviews ESP vendors and that’s a good place to start.

In-Stream Analytics: 

Here’s one area where ESP can mislead new users.  In-Stream Analytics is a feature of ESP and cannot exist separately from ESP.  ESP can apply business rules or even sophisticated predictive analytic models like scoring models to the data stream and take action on the data based on those scores or rules.  However, brand new insights derived from analytics do not occur here.  They occur as always in separate analytic data stores, some in-memory but most simply in-data base where data scientists can examine them and run analytic workloads against them, developing new models, new optimization routines, and new insights.

If you have a unique business need that requires that you be able to create new predictive models or refresh existing predictive models using the most current in-bound streaming data, then providers like SAS have very high performance in-memory analytic platforms that enable data scientists to make these new discoveries and updates in minutes or hours even on massive quantities of data.  These can then be fed back into the ESP system in ‘near real time’.  It’s important to understand however that the development of new analytic insights occurs in analytic data stores and not directly in ESP.

Business Cases:

Let’s talk about some specific business cases where ESP is proving useful.

Fraud Detection

Fraud detection is a good place to begin our discussion since it deals with rare events that are difficult to detect and illustrates some of the limits of ESP. 

Even the most sophisticated methods of fraud detection tend to create large numbers of false positives.  The haystack gets much smaller but not small enough to take automatic action say by blocking your customer’s credit card transaction.  Typically there is a team of humans who may be evaluating flagged fraud-likely events and making a judgement call in near-real-time.  There may also be an additional layer of investigative analysts, for example evaluating redirect sites that may or may not be the source of watering-hole malware attacks which requires both significant time and labor.

However, some but not all rules of fraud detection do rise to the level of near automated action.  For example, an in-bound card-present credit card transaction that takes place close in time to a second one for the same card but physically far apart in geography has a high probability of one of the two being fraudulent. 

ESP could enhance the analysis of the case by adding rules and scoring based on historic customer transaction information, profiles and even technical information coming from the Internet sessions of customers. This allows the bank to set rules to automatically block the transaction or automatically send a text message to the subscriber with a query.  One bank reported increasing its fraud intercept rate to 95% with accompanying improvements in revenue, decreases in the cost of fraud detection, and improved customer trust and satisfaction.

Another interesting example illustrating streaming analytics is in the potentially fraudulent authorization of gift cards.  One organization set rules comparing the number of cards authorized to the number sold in that location during the previous few days, and added a comparison of the volume to the standard deviation of transactions authorized over a similar period.  If these business rules were detected in ESP to be violated, further authorizations at that site could be shut down until a human could evaluate the situation.

Financial Markets Trading

There are two applications that gave ESP its earliest start.  One is financial markets trading and the other is the monitoring of capital intensive equipment where sensor data has long been captured and analyzed.

Automated high-frequency trading systems now account for between 75% and 85% of the volume on all major exchanges.  They compete on the accuracy of their algorithms and also on the time needed to receive, analyze, and act on new data.  Trading advantages are often measured in milliseconds.

This illustrates two features of ESP.  First its ability to ingest multiple high volume, high speed streams of input such as the stream of transactions from each major exchange.  Second, automated high-frequency traders may have hundreds of models that all need simultaneous access to the data (sequential evaluation would be unacceptably slow).  ESP has the ability to split the incoming stream into multiple copy streams each of which can be run simultaneously against its own scoring model.

IoT and Capital Equipment Intensive Industries

Not all modern sensor applications occur in capital equipment intensive industries.  Some are simply paying attention to your thermostat settings, particular operations of your car, or the number of steps you are taking.  But capital equipment intensive industries like power generation, mining and extraction, transportation, and heavy manufacturing were among the first to use networked sensors and collect that data for analysis.  Over the last decade sensors have become smaller, cheaper, more adaptable, and better at communication.  While that data was originally evaluated by operational historians off line and in batch, ESP now allows real-time detection of fault conditions.  The most frequent examples are in predictive asset maintenance.

Using historical data in analytic data stores, data scientists and engineers develop models that signal the onset of a condition that may shortly lead to an unplanned failure or interruption.  Applying that model through the ESP stream allows earliest possible detection and failure prevention.

Separately developed models can also be used for optimization of complex systems of networked devices or resources.  ESP is extensively used in networking optimization of power grids and even in traffic control systems to speed your commute home.

Health and Life Sciences

Moving more toward direct benefit to humans, ESP is being rapidly adopted by major hospitals.  Those bedside devices that measure vital signs are now networks of sensors feeding data via ESP into a central evaluation system.  The central system evaluates the stream of data based on business rules (well established medical guidelines such as specific blood pressure measures combined with pulse combined with respiration rate requires immediate attention).  It also uses more sophisticated predictive models looking separately at the data with the similar intent, to send an alert to the right set of doctors and nurses to take action at exactly the right time.

Marketing Effectiveness

Trigger events aren’t just heart attacks and machine malfunctions.  They can be specific customer actions.  We all know the role of predictive models in improving cross sell, upsell, and churn prevention.  In the past these models were used to predict which customers would be most likely to respond based on their historical behavior.  The behavior that we analyzed could cover weeks, months, or even longer.  Then we implemented these models through campaigns at times of our choosing, not necessarily the time when the customer was most receptive.  ESP changes that by offering to let us fine tune and implement our models based on very specific customer behavior in near real time.

In one example a major telecommunication company found that a specific upsell model could be made much more accurate and effective by tying it to the time their customers were recharging their prepaid accounts.  Once the model was developed it could be implemented through ESP so that when the prepaid recharge transaction was detected an SMS promotion could be sent to the customer while the transaction was still under way.

Separately they were able to greatly enhance the effectiveness of other scoring models by incorporating cellphone usage patterns.  When those patterns are detected in the ESP stream those triggers are used to generate individualized offers, with great success.

Retail Optimization

Here are just a few of the ideas for ESP that have been implemented in the retail world:

Promoting In-Store Shopping Frequency and Cross Sell

To promote in-store shopping, the retailer sends customers personalized, optimized email promotions with sales and offers based on each customer’s shopping history and local store quantities. ESP monitors in-store routers and detects when customers (and their mobile devices) enter the store and looks up customer details and histories. Existing promotional models evaluate the customer’s history and determine an optimal set of offers to push to the in-store customer via SMS or email.

In-Store Price Checking:  Increasingly customers are using their mobile devices to compare competitive prices while in-store.  One retailer monitors in-store Wi-Fi clickstreams to detect when a customer accesses a price comparison site; retrieves the IP address, device ID and phone number; uses this information to look up existing customer profiles; and determines if the customer is a candidate for a promotion. Existing models identify the best offer and sends it to the customer within seconds.

Creating New Sales from Product Returns:  When customers come to stores to return an item, details are instantly retrieved from scanned receipts by ESP.  Existing models analyze the customer’s history, recommend a specific sales staff interaction, and generate and send coupon codes to the customer’s mobile device for alternative replacements currently in stock. In addition, other promotions that the customer may find interesting are sent.

So our takeaway is this.  ESP is real time and also in-memory.  In-Stream analytics are part of ESP and can score, select, and send preferred actions to a customer also in real time.  However, the creation of new analytic insights does not occur within ESP but rather in traditional analytic data stores.  These are not typically real time but can have cycles as short as a few hours which then can be reintroduced into ESP.

ESP is an extremely powerful new technology that lets us get closer to events and to our customers.  If your needs are truly not real time then ESP may not be for you.  But perhaps some of these examples, particularly in retail and marketing may spur you to take your analytics to the next level.

 

About the author:  Bill Vorhies is President & Chief Data Scientist at Data-Magnum and has practiced as a data scientist and commercial predictive modeler since 2001.  Bill is also Editorial Director for Data Science Central.  He can be reached at:

[email protected] or [email protected]

 The original blog can be seen here.

Views: 7037

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Michael Tamillow on October 27, 2015 at 6:00pm

Gartner is the master of creating models with no scientific support.

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service