This blog is a review of two books. Both are available for free from the MapR site, written by Ted Dunning and Ellen Friedman (published by O Reilly) : About Time Series Databases: New ways to store and access data andA new look at Anomaly Detection

The MapR platform is a key part of the Data Science for the Internet of Things (IoT) course – University o... and I shall be covering these issues in my course

In this post, I discuss the significance of Time series databases from an IoT perspective based on my review of these books. Specifically, we discuss Classification and Anomaly detection which often go together for typical IoT applications. The books are easy to read with analogies like HAL (Space Odyssey ) and I recommend them.

The idea of time series data is not new. Historically, time series data can be stored even in simple structures like flat files. The difference now is the huge volume of data and the future applications possible by collecting this data – especially for IoT. These large scale time series databases and applications are the focus of the book. Large scale time series applications typically need a NoSQL database like Apache Cassandra, Apache HBase, MapR-DB etc. The book’s focus is Apache HBase and MapR-DB for the collection, storage and access of large-scale time series data.

Essentially, time series data involves measurements or observations of events as a function of the time at which they occurred. The airline ‘black box’ is a good example of a time series data. The black box records data many times per second for dozens of parameters throughout the flight including altitude, flight path, engine temperature and power, indicated air speed, fuel consumption, and control settings. Each measurement includes the time it was made. The analogy applies to sensor data. Increasingly, with the proliferation of IoT, Time series data is becoming more common and universal. The data so acquired through sensors is typically stored in Time Series Databases. The TSDB (Time series database) is optimized for best performance for queries based on a range of time

Time series databases apply to many IoT use cases for example:

**Trucking**, to reduce taxes according to how much trucks drive on public roads (which sometimes incur a tax). It’s not just a matter of*how many*miles a truck drives but rather which miles.**A smart pallet**can be a source of time series data that might record events of interest such as when the pallet was filled with goods, when it was loaded or unloaded from a truck, when it was transferred into storage in a warehouse, or even the environmental parameters involved, such as temperature.- Similarly,
**commercial waste containers**, called dumpsters in the US, could be equipped with sensors to report on how full they are at different points in time. **Cell tower traffic**can also be modelled as a time series and anomalies like flash crowd events that can be used to provide early warning.**Data Center Monitoring**can be modelled as a Time series to predict outages, plan upgrades- Similarly,
**Satellites, Robots and many more devices**can be modelled as Time series data

From these readings captured in a Time Series database, we can derive analytics such as:

**Prognosis:** What are the short- and long-term trends for some measurement or ensemble of measurements?

**Introspection: **How do several measurements correlate over a period of time?

**Prediction:** How do I build a machine-learning model based on the temporal behaviour of many measurements correlated to externally known facts?

**Introspection:** Have similar patterns of measurements preceded similar events?

**Diagnosis:** What measurements might indicate the cause of some event, such as a failure?

The books gives examples of usage of Anomaly detection and Classification for IoT data.

For Time series IoT based readings, anomaly detection and Classification go together. Anomaly detection determines what normal looks like, and how to detect deviations from normal.

When searching for anomalies, we don’t know what their characteristics will be in advance. Once we know characteristics, we can use a different form of machine learning i.e. classification

Anomaly in this context just means different than expected—it does not refer to desirable or un‐ desirable. Anomaly detection is a discovery process to help you figure out what is going on and what you need to look for. The anomaly-detection program must discover interesting patterns or connections in the data itself.

Anomaly detection and classification go together when it comes to finding a solution to real-world problems. Anomaly detection is used first in the discovery phase—to help you figure out what is going on and what you need to look for. You could use the anomaly-detection model to spot outliers, then set up an efficient classification model to assign new examples to the categories you’ve already identified. You then update the anomaly detector to consider these new examples as normal and repeat the process

The book goes on to give examples of usage of these techniques in EKG

For example, for the challenge of finding an approachable, practical way to model normal for a very complicated curve such as the EKG, we could use a type of machine learning known as deep learning.

Deep learning involves letting a system learn in several layers, in order to deal with large and complicated problems in approachable steps. Curves such as the EKG have repeated components separated in time rather than superposed. We take advantage of the repetitive and separated nature of an EKG curve in order to accurately model its complicated shape to detect normal patterns using Deep learning

The book also refers to a Data structure called *t*-Digest for Accurate Calculation of Extreme Quantiles *t*-digest was developed by one of the authors, Ted Dunning, as a way to accurately estimate extreme quantiles for very large data sets with limited memory use. This capability makes *t*-digest particularly useful for selecting a good threshold for anomaly detection. The *t*-digest algorithm is available in Apache Mahout as part of the Mahout math library. It’s also available as open source at*https://github.com/tdunning/t-digest*

* *

Anomaly detection is a complex field and needs a lot of data.

For example: what happens if you only save a month of sensor data at a time, but the critical events leading up to a catastrophic part failure happened six weeks or more before the event?

To conclude, much of the complexity for IoT analytics comes from the management of Large scale data.

Collectively, Interconnected Objects and the data they share make up the Internet of Things (IoT).

Relationships between objects and people, between objects and other objects, conditions in the present, and histories of their condition over time can be monitored and stored for future analysis, but doing so is quite a challenge.

However, the rewards are also potentially enormous. That’s where machine learning and anomaly detection can provide a huge benefit.

**For Time series, **the book covers themes such as

Storing and Processing Time Series Data

The Direct Blob Insertion Design

Why Relational Databases Aren’t Quite Right

Architecture of Open TSDB

Value Added: Direct Blob Loading for High Performance

Using SQL-on-Hadoop Tools

Using Apache Spark SQL

Advanced Topics for Time Series Databases(Stationary Data, Wandering Sources, Space-Filling Curves )

**For Anomaly detection:**

Windows and Clusters

Anomalies in Sporadic Events

Website Traffic Prediction

Extreme Seasonality Effects

Etc

Links again:

About Time Series Databases: New ways to store and access data and

A new look at Anomaly Detection by Ted Dunning and Ellen Friedman (published by O Reilly).

Also the link for Data Science for the Internet of Things (IoT) course – University o... where I hope to cover these issues in more detail in context of MapR

© 2019 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central