Subscribe to DSC Newsletter

Data Science for Internet of Things methodology - Evolving CRISP-DM - Part One

By Jean-Jacques Bernard and Ajit Jaokar

This set of blog posts is part of the book/course on Data Science for the Internet of Things. We welcome your comments at jjb at cantab dot net.  Jean-Jacques Bernard  has been a founding member of the Data Science for Internet of Things Course. Please email at ajit.jaokar at futuretext.com if you are interested in joining the course.

Introduction

The rapidly expanding domain of the Internet of Things (IoT) requires new  analytical tools. In a previous post by Ajit Jaokar, we addressed the need for a generic methodology for  IoT Analytics. Here, we expand on those ideas further.

Thus, the aim of this document is to capture the specific elements that make up Data Science in an Internet of Things (IoT) context. Ultimately, we will provide a high level methodology with key phases and activities with links to specific templates and contents for each of those activities.

We believe that one the best methodologies for undertaking Data Science is CRISP-DM. This seems to be the views of a majority of data scientists as the latest KDnuggets poll shows. Therefore, we have loosely based the methodology on CRISP-DM.

We have also linked the methodology to the technical framework proposed  [above] (Data Science for Internet of Things – A Problem Solving Methodology) which aims at providing a technical framework for solving IoT problems with Data Science.

The methodology we propose is divided in the following 4 phases:

  1. Problem Definition
  2. Preparation
  3. Modelling
  4. Continuous Improvement


We describe the first 3 phases in this post before describing the last phase and  the detailed activities and deliverables encompassing each phase in upcoming posts.

The relationships between the first three phases are presented in the figure below.

High level description of the methodology

Problem definition

This first phase is concerned with the understanding of the problem. It is important to define the terms here in the context of IoT.

By problem, we really mean something that needs to be solved, addressed or changed, either to suppress or revert a situation or create a new situation. Of course the end situation should be better than the initial situation. In the context of IoT, solving a problem means solving the initial problem and providing incremental feedback.

In any business context, to manage scarce resources, it is necessary to provide a business case for projects. Thus, it is important to define an IoT Analytics business case, which will provide a baseline understanding (i.e. measurement of the initial situation) of the problem through Key Performance Indicators (KPIs). In addition, the business case must provide a way to measure the impact of the project on the defined KPIs as well as a project timeline and project management methodology. The timeline and project management methodology should include deployment, a critical activity for large scale IoT Analytics projects.
The baseline and the measurement of the impact will be used to understand whether the IoT Analytics has reached its goals.

For instance, in the case of a Smart City project aiming at reducing road congestion, KPIs like number of congestion points and average duration of congestion at those points can be used to understand whether the project had a positive impact or not.

However, defining the problem and understanding how to measure it might be harder than it sounds as pointed here and here.

Preparation

The second phase of the methodology is concerned with data collection, preparation (i.e. cleaning) and exploration. However, in the context of IoT, the sources of data are more diverse than in other data science set-ups, and there are also elements of architecture to consider before starting more classic exploratory type of work.

Therefore, we believe there are three types of activities in this phase:

  1. Define the data requirements
  2. Select and design the IoT architecture
  3. Collect, clean, explore and build data

 
And those three activities are to be conducted in an iterative manner, until the data build fits the problem we are trying to solve.

First we need to define the data needed to solve the problem defined previously as well as its characteristics. We need to do so in the context of the IoT vertical we are working in. Examples of IoT verticals include (not exhaustive):

  • Smart homes
  • Retail
  • Healthcare
  • Smart cities
  • Energy
  • Transportation
  • Manufacturing (Industrie 4.0 or Industrial Internet)
  • Wearables


The selection and design of the IoT architecture focuses on two parts: the network of devices and the processing infrastructure.

The first part is concerned with the set-up of the network of devices that will be used to measure and monitor some parameters of the environment of the problem. The design of this network is outside the scope of this article, but it is nonetheless important (for more information on this topic, you can refer to this article from Deloitte University Press). Some of the key considerations are:

  • Availability & security
  • Latency & timeliness
  • Frequency
  • Accuracy & reliability
  • Dumb vs. smart devices


Those elements will determine some of the characteristics of the data that will be collected from the network of devices (in essence, this is a kind of meta-data). Those characteristics will be used to establish the processing infrastructure.

For instance, these are those characteristics which will help in choosing whether edge devices need to be used, whether event collectors are best suited, etc. For more information, see our article for an in-depth treatment of what potential processing infrastructures for IoT can be.

Then the final activities are the collection, cleaning and exploration of the data available. This is typical data analytics type of work, where the practitioner clean up the data available, and explore what are the properties of the data, etc. It is also a step where additional data can be build on the basis of the data available. However, this is also a step where it can become clear that the data provided by the IoT architecture is neither enough nor processed in a correct way.

This is why this phase is an iterative one, the learnings from the last step can be used to refine the IoT architecture until the data fits the problem to be solved.

Modelling

The modelling phase is the phase where a models are built and evaluated, both from a statistical standpoint and from a problem solving standpoint. It is composed of three activities:

  1. Design model to solve the problem
  2. Evaluate the model
  3. Deploy the model and the architecture


Like for the preparation phase, those activities are to be conducted in an iterative manner, until the model:

  • Is statistically sound;
  • Shows potential to solve the problem (i.e. impact on KPIs defined in the problem definition phase).


In this phase, the data scientist will choose among different types of algorithms, depending on the problem to solve and build models using those.

As described in the previous article, many algorithms and techniques are applicable, such as time series analytics, complex event processing (CEP) or deep learning. An important element, linked to the activities from the previous phase, is where will the analytics be applied. While this should be part of the design of the IoT architecture, this while guide the choice of algorithm to apply. Indeed, we can apply analytics at:

  • The device (if we use smart devices)
  • The edge
  • The data lake / Cloud
  • Etc.


In addition, the type of analytics will depend on the type of processing we are focusing on: batching vs. streaming.

When the model has been designed, then comes the evaluation activities, which should first evaluate the model using classic statistical and data science techniques: using training, validation and testing datasets, minimizing bias and variance and trading-off precision and recall.

Then, the model should be evaluated from a business point of view: does it improve the KPIs that were set during the problem definition phase? It might not be obvious to measure the improvement until the model is deployed, thus, it is important to keep an improvement loop over this phase and the previous. If the model does not improve the KPIs defined in the problem definition phase, then it is necessary to rework from the preparation phase since some of the assumptions underlying the data may be wrong.

When the model is considered as sound and solves the problem it was designed to solve, it is time to deployed it together with its IoT Architecture. The deployment of the architecture and model is a project in itself and should come with its own project management structure and timeline, as defined in the problem definition phase.

In upcoming posts, we will present the continuous improvement phase and explore the detailed activities and deliverables of each of the phases presented here.

To conclude on this post, we welcome your comments. Please email ajit.jaokar at futuretext.com if you are interested in joining the course.

Views: 6569

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Som Shahapurkar on August 31, 2016 at 10:21am

I seem to have embedded the wrong link into "Here". The correct reference is below...

A survey of Knowledge Discovery and Data Mining process models - by Mariscal et.al.

Also appeared in KDnuggets: http://www.kdnuggets.com/2010/06/pub-survey-data-mining-knowledge-d...

Comment by ajit jaokar on August 29, 2016 at 3:02am

@Som - thanks again. sorry I missed this too. we are connected on linkedin and very happy to share more with you

Comment by Jean-Jacques Bernard on August 29, 2016 at 2:53am

@Som, thanks for the comment, and sorry for the late reply, I was on vacation :-). I fully agree on deployment, this is where the rubber hits the road and it is critical to ensure that what has been designed can be deployed.

In addition to having DfD, I believe having a well structured project helps (and the larger the scope is, the more critical it becomes), so strong project management is important here. We will address this in an upcoming post.

Comment by Som Shahapurkar on August 19, 2016 at 6:55am

@JJ, nice start - I particularly like the part where a large emphasis is given to the Evaluation phase as it determines if all the investment of the previous phases is going to bring in a return. However, I think the deployment phase is equally important and is often neglected - it is the true execution phase. Just like machines and hardware are designed-for-test (DFT) and designed-for-manufacturing (DFM) it is high time analytics (algorithms) are designed-for-deployment (DFD).

Here is a great survey of existing process-models/methodologies/frameworks that culminates in pulling out the key elements of most significant methodologies.

@Susan - one of the surveyed methods by Maraban attempts a mapping to ISO12207 and IEEE1074 standards for software engineering.

@Ajit - I am game to join the worthy cause of renewing CRISP-DM for IoT and Big-data - we are already connected on LinkedIn so PM me if interested.

Comment by ajit jaokar on August 6, 2016 at 5:43am

:) thanks Susan. This was the exact thought process I followed as well! CRISP has not been updated since 99 .. but still a good framework which we can extend. ASUM - very IBM specific .. ie ties to their product set. but if we(ie me and JJ and a couple more of our dev team) could create an open source(and vendor agnostic) version - that may have some value. many thanks for your feedback and very happy to share more. We expect to release our early github repos in the fall. (I have sent you a linkedin request - so can email you in advance) kind rgds Aj

Comment by Susan M. Meyer on August 6, 2016 at 5:29am

Yes, it will take years to build out IoT standards if we look at the way these have developed in existing industry verticals like those you mentioned--telecoms and so on.  I come from the payments space with a similar story. 

Sometimes when I've presented CRISP-DM as a baseline methodology, I've gotten questions about why it hasn't been updated since 1999, aside from IBM's ASUM-DM update. Open methodologies such as your framework will help us build analytics teams that can hit the ground running & communicate their requirements to analytics support teams...so this is indeed a welcome proposal. 

Comment by ajit jaokar on August 4, 2016 at 10:35pm

thanks JJ. One more thing to add @susan - from my experience in telecoms .. like teleoms we have a situation in Iot where multiple verticals converge. each have standards in their own domain but no one has standards across domains  ex for mobile payments - NFC(standardized by the transport guys) - SIM(by telecoms) and RFID(historically by the supply chain folks) - but making it all work together across verticals is hard .. so the idea here is for IoT analytics - to operate both at a process/design/systems level(evolve CRISP-DM) but also look at the actual implementations through open source (see the previous post from me on DSC for this)

welcome comments! 

Comment by Jean-Jacques Bernard on August 4, 2016 at 10:29pm

@susan & @Ajit, indeed, it would be nice if standard setting bodies could catch up on this, and we would be happy to contribute, but as far as we are concerned, the goal is to get something done that we will detail and refine thanks to the feedback of the community.

Comment by ajit jaokar on August 4, 2016 at 9:50pm

@susan - thanks for your comments. from our point oif view: we take an Open source perspective ie will create the implementations in Open source and let the community hack them. I am not quite sure where the standards bodies are at on this .. but for us .. its all about capturing and encapsulating a body of knowkedge in an open source implementation and letting it evolve dynamically. JJ can add more also. happy to share more as we develop this. 

Comment by Susan M. Meyer on August 4, 2016 at 9:59am

Thank you! Glad to see CRISP-DM cited as a methodology.  When will a standards group emerge to update it so that it evolves into an ISO standard?

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service