Subscribe to DSC Newsletter

A methodology for solving problems with DataScience for Internet of Things - Part One


This two part blog is based on my forthcoming book:  Data Science for Internet of Things.

It is also the basis for the course I teach  Data Science for Internet of Things Course.  I will be syndicating sections of the book on the Data Science Central blog.  Welcome your comments.  Please email me at ajit.jaokar at  - Email me also for a pdf version if you are interested in joining the course


Here, we start off with the question:  At which points could you apply analytics to the IoT ecosystem and what are the implications?  We then extend this to a broader question:  Could we formulate a methodology to solve Data Science for IoT problems?  I have illustrated my thinking through a number of companies/examples.  I personally work with an Open Source strategy (based on R, Spark and Python) but  the methodology applies to any implementation. We are currently working with a range of implementations including AWS, Azure, GE Predix, Nvidia etc.  Thus, the discussion is vendor agnostic.

I also mention some trends I am following such as Apache NiFi etc

The Internet of Things and the flow of Data

As we move towards a world of 50 billion connected devices,  Data Science for IoT (IoT  analytics) helps to create new services and business models.  IoT analytics is the application of data science models  to IoT datasets.  The flow of data starts with the deployment of sensors.  Sensors detect events or changes in quantities. They provide a corresponding output in the form of a signal. Historically, sensors have been used in domains such as manufacturing. Now their deployment is becoming pervasive through ordinary objects like wearables. Sensors are also being deployed through new devices like Robots and Self driving cars. This widespread deployment of sensors has led to the Internet of Things.


Features of a typical wireless sensor node are described in this paper (wireless embedded sensor  architecture). Typically, data arising from sensors is in time series format and is often geotagged. This means, there are two forms of analytics for IoT: Time series and Spatial analytics. Time series analytics typically lead to insights like Anomaly detection. Thus, classifiers (used to detect anomalies) are commonly used for IoT analytics to detect anomalies.  But by looking at historical trends, streaming, combining data from multiple events(sensor fusion), we can get new insights. And more use cases for IoT keep emerging such as Augmented reality (think – Pokemon Go + IoT)


Meanwhile,  sensors themselves continue to evolve. Sensors have shrunk due to technologies like MEMS. Also, their communications protocols have improved through new technologies like LoRA. These protocols lead to new forms of communication for IoT such as Device to Device; Device to Server; or Server to Server. Thus, whichever way we look at it, IoT devices create a large amount of Data. Typically, the goal of IoT analytics is to analyse the data as close to the event as possible. We see this requirement in many ‘Smart city’ type applications such as Transportation, Energy grids, Utilities like Water, Street lighting, Parking etc

IoT data transformation techniques


Once data is captured through the sensor, there are a few analytics techniques that can be applied to the Data. Some of these are unique to IoT. For instance, not all data may be sent to the Cloud/Lake.  We could perform temporal or spatial analysis. Considering the volume of Data, some may be discarded at source or summarized at the Edge. Data could also be aggregated and aggregate analytics could be applied to the IoT data aggregates at the ‘Edge’. For example,  If you want to detect failure of a component, you could find spikes in values for that component over a recent span (thereby potentially predicting failure). Also, you could correlate data in multiple IoT streams. Typically, in stream processing, we are trying to find out what happened now (as opposed to what happened in the past).  Hence, response should be near real-time. Also, sensor data could be ‘cleaned’ at the Edge. Missing values in sensor data could be filled in(imputing values),  sensor data could be combined to infer an event(Complex event processing), Data could be normalized, we could handle different data formats or multiple communication protocols, manage thresholds, normalize data across sensors, time, devices etc



Applying IoT Analytics to the Flow of Data



Here, we address the possible locations and types of analytics that could be applied to IoT datasets.

Some initial notes:

Some initial thoughts:

  • IoT data arises from  sensors and ultimately resides in the Cloud.
  • We  use  the  concept  of  a  ‘Data  Lake’  to  refer  to  a repository of Data
  • We consider four possible avenues for IoT analytics: ‘Analytics  at  the  Edge’,  ‘Streaming  Analytics’ , NoSQL databases and ‘IoT analytics at the Data Lake’
  • For  Streaming  analytics,  we  could  build  an  offline model and apply it to a stream
  • If  we  consider  cameras  as  sensors,  Deep  learning techniques could be applied to Image and video datasets (for example  CNNs)
  • Even when IoT data volumes are high, not  all  scenarios  need  Data  to  be distributed. It is very much possible to run analytics on a single node using a non-distributed architecture using Python or R systems.
  • Feedback mechanisms are a key part of IoT analytics. Feedback is part of multiple IoT analytics modalities ex Edge, Streaming etc
  • CEP (Complex event processing) can be applied to multiple points as we see in the diagram


We now describe various analytics techniques which could apply to IoT datasets

Complex event processing


Complex Event Processing (CEP) can be used in multiple points for IoT analytics (ex : Edge, Stream, Cloud et).


In general, Event processing is a method of tracking and  analyzing  streams  of  data and deriving a conclusion from them. Complex event processing, or CEP, is event processing that combines data from multiple sources to infer events or patterns that suggest more complicated circumstances. The goal of complex event processing is to identify meaningful events (such as opportunities or threats) and respond to them as quickly as possible.


In CEP, the data is at motion. In contrast, a traditional Query (ex an RDBMS) acts on Static Data. Thus, CEP is mainly about Stream processing but the algorithms underlining CEP can also be applied to historical data


CEP relies on a number of techniques including for Events: pattern detection, abstraction, filtering,  aggregation and transformation. CEP algorithms model event hierarchies and detect relationships (such as causality, membership or timing) between events. They create an abstraction of an  event-driven processes. Thus, typically, CEP engines act as event correlation engines where they analyze a mass of events, pinpoint the most significant ones, and trigger actions.


Most CEP solutions and concepts can be classified into two main categories: Aggregation-oriented CEP and Detection-oriented CEP.  An aggregation-oriented CEP solution is focused on executing on-line algorithms as a response  to  event  data  entering  the  system  –  for example to continuously calculate an average based on data in the inbound events. Detection-oriented CEP is focused on detecting combinations of events called events patterns or situations – for example detecting a situation is to look for a specific sequence of events. For IoT, CEP techniques are concerned with deriving a higher order value / abstraction from discrete sensor readings. 


CEP uses techniques like Bayesian    networks,    neural    networks,     Dempster- Shafer methods, kalman filters etc. Some more background at Developing a complex event processing architecture for IoT


Streaming analytics

Real-time systems differ in the way they perform analytics. Specifically,  Real-time  systems  perform  analytics  on  short time  windows  for  Data  Streams.  Hence, the scope  of  Real Time analytics is a ‘window’ which typically comprises of the last few time slots. Making Predictions on Real Time Data streams involves building an Offline model and applying it to a stream. Models incorporate one or more machine learning algorithms which are trained using the training Data. Models are first built offline based on historical data (Spam, Credit card fraud etc). Once built, the model can be validated against a real time system to find deviations in the real time stream data. Deviations beyond a certain threshold are tagged as anomalies.


IoT ecosystems can create many logs depending on the status of IoT devices. By collecting these logs for a period of time and analyzing the sequence of event patterns, a model to predict a fault can be built including the probability of failure for the sequence. This model to predict failure is then applied to the stream (online). A technique like the Hidden Markov Model can be used for detecting failure patterns based on the observed sequence. Complex Event Processing can be used to combine events over a time frame (ex in the last one minute) and co-relate patterns to detect the failure pattern.

Typically, streaming systems could be implemented in Kafka and spark


Some interesting links on streaming I am tracking:


 Newer versions of kafka designed for iot use cases

Data Science Central: stream processing and streaming analytics how...

Iot 101 everything you need to know to start your iot project – Par...

Iot 101 everything you need to know to start your iot project – Par...



Part two will consider other  more technologies including Edge processing and Deep learning

If you want to be a part of my course please see the testimonials at Data Science for Internet of Things Course.   

Views: 9712


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by ajit jaokar on August 7, 2016 at 11:39am
Comment by Manish Kurse on August 7, 2016 at 11:36am

Thanks for the article, Ajit. This is insightful. I look forward to Part 2!

Comment by Sione Palu on July 25, 2016 at 9:08am

ajit jaokar, the paper you cited above : "wireless embedded sensor architecture" is from the field of electronics not data science per se. Electronics is a different field although, there's some overlap with data science in the domain of signal processing, image processing, control system.

This is why I stated in my previous message that the title is almost meaningless. A lot of readers here haven't got backgrounds to electronics, so they may confuse coming to DS Central to read about data analytics & yet encounter articles on electronics. Hey, its your blog, which you can write about anything, but since data science is new, such article like this may confuse readers. What is data science? Data analytics? Or Digital circuit design? Which one?

Comment by ajit jaokar on July 22, 2016 at 8:27pm

@sione - thanks for your comment. This is a two part article because of the length. the title will be clearer when you read both parts. I also take a vendor agnostic view i.e. abstracting the common elements of solving problems in an independent way kind rgds Ajit

Comment by Sione Palu on July 22, 2016 at 6:17pm

The title of this article is meaningless?

"A methodology for solving problems with DataScience for Internet of Things"

It is already happening right now in the real world. Look no further than Google, Microsoft, Amazon, and others who are leading in this domain. And researchers at those companies don't hang around Data-Science-Central asking meaningless questions to members of DSC for tips on how to "solving problems with DataScience for Internet of Things". They just simply do it.

Follow Us


  • Add Videos
  • View All


© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service