I was given 3 GB of Machine Generated data being fed by 120 sensors (5 records every second) in an excel format. The task in hand was to mine out interesting patterns, if any, from the data.

I fed the data in R in my local machine and performed various descriptive and exploratory analysis to have some insights. Customer was also looking for some low cost maintenance mechanisms for their machines. So I thought if I could study the outliers and provide some information about system health. This could also be monitored real time using dashboards and if possible could forecast at a near future time point for early alarm and predictive maintenance.

So this became a case of outlier detection in 120 dimensional space. Now, as I studied, values in around 90 columns were found to be constant over the entire time period and were contributing nothing towards system noise. So I dropped them.

Mahalanobis Distance is an age old time tested distance function in a multivariate setup. It weighs the individual euclidean distances with the inverse of the sample variance matrix.

Mahalanobis Distance follows F-Distribution when properly scaled. It requires Normality assumption too.

My first attempt to calculate the Mahalanobis Distances from the central value failed due to singular variance matrix. I found 6 pairs of highly correlated columns that I then dropped, one from each pair, from analysis to reduce multicollinearity. The piece of code ran successfully.

The above empirical distribution of p-values shows in around 7% of entire time points the machine / system is in extreme state producing outliers.

A two hour window is deployed to calculate the Outlier Percentages and Lead Regression is used for forecasting.

Performance of Logistic Regression and Decision Tree as the classifiers of the outliers in forecasting outlier percentages could be visualized above.

The entire game was pretty satisfying.

© 2019 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central