Hello, friends. In this blog post, an interpretable design of unsupervised deep convolutional neural network & lstm autoencoders based real-time anomaly detection from high-dimensional heterogeneous/homogeneous time series data is presented.
What’s new in MSDA v1.10.0?
MSDA is an open source
low-code time-series featured library in Python that aims to reduce the hypothesis to insights cycle time in a time-series, multi-sensor data analysis & experiments. MSDA is
easy to use and
low-code. It enables users to perform end-to-end proof-of-concept experiments quickly and efficiently. The module identifies events in the multidimensional time series by capturing the variation and trend to establish a relationship aimed towards identifying the correlated features helping in feature selection from raw sensor signals. Also, it provides a provision to precisely detect the anomalies in real-time streaming data an unsupervised deep convolutional neural network & also a lstm autoencoders based detectors are designed to run on GPU/CPU. Finally, a game theoretic approach is used to explain the output of the built anomaly detector model.
The package includes:-
MSDA is an open source library that anybody can use. In our view, the ideal target audience of MSDA is:
What is an anomaly, and why should it be of any concern? In layman terms, “Anomalies” or “outliers” are the data points in a data space, which are abnormal, or out of trend. Anomaly detection focuses on identifying examples in the data that somehow deviate from what is expected or typical. Now, the question is, “How do you define something is abnormal or outlier?” The quick rationale answer is all those points that don’t follow the trend of the neighboring points in the sample space.
For any business domain, detecting suspicious patterns from a huge set of data in very critical. Say, for example in banking domain the fraudulent transactions pose a serious threat & loss/liabilities to the bank. In this blog, we will try to learn about detecting anomalies from data without training the model before-hand, because you can’t train a model on data, which we don’t know about! That’s where the whole idea of unsupervised learning helps. We will see two network architectures for building real-time anomaly detector, i.e., a) Deep CNN b) LSTM AutoEncoder
These network suits for detecting a wide range of anomalies, i.e., point anomalies, contextual anomalies, and discords in time series data. Since, the approach is unsupervised, it requires no labels for anomalies. We use the unlabeled data to capture, and learn the data distribution that is used to forecast the normal behavior of a time-series. The first architecture is inspired from the IEEE paper DeepAnT, it consists of two components: time series predictor and anomaly detector. The time series predictor uses deep convolutional neural network (CNN) to predict the next time stamp on the defined horizon. This component takes a window of time series (used as a reference context) and attempts to predict the next time stamp. The predicted value is then passed to the anomaly detector component, which is responsible for labeling the corresponding time stamp as Non-Anomaly or Anomaly.
The second architecture is inspired from this Nature paper Deep LSTM-based Stacked Autoencoder for Multivariate Time Series
Let first understand simply what is AUTOENCODER neural network. The autoencoder architecture is used to learn efficient data representation in an unsupervised manner. There are three components to an autoencoder: an encoding (input) portion that compresses the data, in the process learns a representation (encoding) for the set of data, a component that handles the compressed data (size reduction), and a decoder (output) portion that reconstructs the learned representation as close as possible to the original input from the compressed data while minimizing the overall loss function. So, simply when the data is fed into an autoencoder, it is encoded and then compressed down to a smaller size, and further that smaller representation is decoded back to original input. Next, let us understand, why LSTM is appropriate here? What is LSTM? Long short-term memory (LSTM) is a neural network architecture capable of learning order dependencies in sequence prediction problems. A LSTM network is a type of recurrent neural network (RNN). The RNN mainly suffers from vanishing gradients. Gradients contain information, and over time, if the gradients vanish, then important localized information is lost. This is where LSTM is handful as it helps remember the cell states preserving the information. The basic idea is that the LSTM network has multiple “gates” inside of it with trained parameters. Some of these gates control the modules “output” and other gates control “forgetting.” LSTM networks are good fit for classifying, processing and making predictions based on time series data, since there can be lags of unknown duration between important events in a time series.
An LSTM Autoencoder is an implementation of an autoencoder for sequence data using an Encoder-Decoder LSTM network architecture.
Now, that we have seen the basic concepts of each network, let us go through the design of our both network as shown below. The DeepCNN consists of two convolutional layers. Typically, CNN consists of a sequence of layers which includes convolutional layers, pooling layers, and fully connected layers. Each convolutional layer normally has two stages. In the first stage, the layer performs the mathematical operation called convolution which results in linear activations. In the second stage, a non-linear activation function is applied on each linear activation. Like other neural networks, the CNN also uses training data to adapt its parameters (weights and biases) to perform the learning task. The parameters of the network are optimized using ADAM optimizer. The kernel size, number of filters can be tuned further to perform better depending on the dataset. Further, the dropout, learning rate, etc. can be fine tune to validate the performance of the network. The loss function used was the MSELoss (squared L2 norm) that measures the mean squared error between each element in the input ‘x’ and target ‘y’. The LSTMAENN consists of stacked multiple LSTM layers with input_size — The number of expected features in the input x, hidden_size — The number of features in the hidden state h, num_layers — Number of recurrent layers (Default:1), etc. For more details refer here. To avoid the scope of interpreting the detected noise in the data as anomalies, we can tune the additional hyper-parameters like ‘lookback’ (time series window size), units in hidden layers, and many more.
(conv1d_1_layer): Conv1d(10, 16, kernel_size=(3,), stride=(1,))
(maxpooling_1_layer): MaxPool1d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(conv1d_2_layer): Conv1d(16, 16, kernel_size=(3,), stride=(1,))
(maxpooling_2_layer): MaxPool1d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(dense_1_layer): Linear(in_features=80, out_features=40, bias=True)
(dropout_layer): Dropout(p=0.25, inplace=False)
(dense_2_layer): Linear(in_features=40, out_features=26, bias=True)
inspiration from this IEEE paper - https://ieeexplore.ieee.org/document/8581424
(lstm_1_layer): LSTM(26, 128)
(dropout_1_layer): Dropout(p=0.2, inplace=False)
(lstm_2_layer): LSTM(128, 64)
(dropout_2_layer): Dropout(p=0.2, inplace=False)
(lstm_3_layer): LSTM(64, 64)
(dropout_3_layer): Dropout(p=0.2, inplace=False)
(lstm_4_layer): LSTM(64, 128)
(dropout_4_layer): Dropout(p=0.2, inplace=False)
(linear_layer): Linear(in_features=128, out_features=26, bias=True)
inspiration from here - https://www.nature.com/articles/s41598-019-55320-6
Now, that we have designed the network architectures. Next, we will go through the further steps with hands-on demonstration as given below.
The easiest way to install msda is using pip.
pip install msda
$ git clone https://github.com/ajayarunachalam/msda
$ cd msda
$ python setup.py install
!pip install msda
Here, we will use the climate data from here. This dataset is compiled from several public sources. The dataset consists of daily temperatures and precipitation from 13 Canadian centres. Precipitation is either rain or snow (likely snow in winter months). In 1940, there is daily data for seven out of the 13 centres, but by 1960 there is daily data from all 13 centres, with the occasional missing value. We have around 80 years records (daily frequency of data), and we want to identify the anomalies from that climate data. As seen below this data has 27 features, and around 30K records.
df = pd.read_csv('Canadian_climate_history.csv')
We start by checking for missing values, and impute those missing values.
The functions missing(), and impute() from Preprocessing & ExploratoryDataAnalysis class can be used to find missing values, and filling the missing information. We are replacing the missing values with the mean values (hence, modes=1). There are several utility functions within these classes that can be used for profiling your dataset, manual filtering of outliers, etc. Also, other options provided include datetime conversions, getting descriptive stats of the data, normality distribution test, etc. For more details peek here
Impute missing values with impute function (modes=0,1, 2, else use backfill)
0: impute with zero, 1: impute with mean, 2: impute with median, else impute with backfill method
Next, we are inputting data with no missing values, removal of unwanted fields, assert the timestamp field, etc. Here, the user can input the column to drop with their index value, and assert the timestamp field with their index value too. This returns two dataframes, one will have all the numerical fields without timestamp index, while the other will have all the numerical fields with timestamp indexing. We need to use one with the timestamp as index of data for further steps.
Anamoly.read_data(data=df_no_na, column_index_to_drop=0, timestamp_column_index=0)
The time window size (lookback size) is given as input to the function data_pre_processing from the Anamoly class.
X,Y,timesteps,X_data = Anamoly.data_pre_processing(df=anamoly_df, LOOKBACK_SIZE=10)
With this function, we are also normalizing the data within the range of
[0,1] and then modifying the dataset by including ‘time-steps’ as another additional dimension. The idea is to convert two-dimensional data set of the dimension from
[Batch Size, Features] to three-dimensional data set
[Batch Size, Lookback Size, Features]. For more details inspect here.
Using the set_config() function the user can select from the deep network architectures, set time window size, tune the kernel size. The available models — Deep Convolutional Neural Network, LSTM AUTOENCODERS, that can be given with possible values [‘deepcnn’, ‘lstmaenn’]. We choose the time-series window size=10, and use the kernel size of 3 for the convolutional network.
MODEL_SELECTED, LOOKBACK_SIZE, KERNEL_SIZE = Anamoly.set_config(MODEL_SELECTED='deepcnn', LOOKBACK_SIZE=10, KERNEL_SIZE=3)
MODEL_SELECTED = deepcnn
LOOKBACK_SIZE = 10
KERNEL_SIZE = 3
One can train the model with either GPU/CPU based on availability. The compute function will use GPU, if available, otherwise, it will use the CPU resources. The google colab uses NVIDIA TESLA K80 which is the most most popular GPU, while NVIDIA TESLA V100 is the First Tensor Core GPU. The number of epochs for training can be custom set. The device being used will be outputted on the console.
Anamoly.compute(X, Y, LOOKBACK_SIZE=10, num_of_numerical_features=26, MODEL_SELECTED=MODEL_SELECTED, KERNEL_SIZE=KERNEL_SIZE, epocs=30)
Training Loss: 0.2189370188678473 - Epoch: 1
Training Loss: 0.18122351250783636 - Epoch: 2
Training Loss: 0.09276176958476466 - Epoch: 3
Training Loss: 0.04396845106961693 - Epoch: 4
Training Loss: 0.03315385463795454 - Epoch: 5
Training Loss: 0.027696743746250377 - Epoch: 6
Training Loss: 0.024318942805264566 - Epoch: 7
Training Loss: 0.021794179179027335 - Epoch: 8
Training Loss: 0.019968783528812286 - Epoch: 9
Training Loss: 0.0185430530715746 - Epoch: 10
Training Loss: 0.01731374272046384 - Epoch: 11
Training Loss: 0.016200231966590112 - Epoch: 12
Training Loss: 0.015432962290901867 - Epoch: 13
Training Loss: 0.014561152689542462 - Epoch: 14
Training Loss: 0.013974714691690522 - Epoch: 15
Training Loss: 0.013378228182289321 - Epoch: 16
Training Loss: 0.012861106097943028 - Epoch: 17
Training Loss: 0.012339938251426095 - Epoch: 18
Training Loss: 0.011948177564954476 - Epoch: 19
Training Loss: 0.011574006228333366 - Epoch: 20
Training Loss: 0.011185694509874397 - Epoch: 21
Training Loss: 0.010946418002639517 - Epoch: 22
Training Loss: 0.010724217305010896 - Epoch: 23
Training Loss: 0.010427865211985524 - Epoch: 24
Training Loss: 0.010206768034701313 - Epoch: 25
Training Loss: 0.009942568653453904 - Epoch: 26
Training Loss: 0.009779498535478721 - Epoch: 27
Training Loss: 0.00969111187656911 - Epoch: 28
Training Loss: 0.009527427295318766 - Epoch: 29
Training Loss: 0.009236675929400544 - Epoch: 30
Once the training is completed, the next step is to find the anomalies. Now, this brings us back to our fundamental question, i.e., how exactly can we estimate & trace what is an anomaly?. One can use Anomaly Score, Anomaly Likelihood, and some recent metrics like Mahalanobis distance-based confidence score etc. The Mahalanobis confidence score assumes that the intermediate features of pre-trained neural classifiers follow class conditional Gaussian distributions whose covariances are tied for all distributions, and the confidence score for a new input is defined as the Mahalanobis distance from the closest class conditional distribution. Anomaly Score is the fraction of active columns that were not predicted correctly. In contrast, Anomaly Likelihood is the likelihood that a given anomaly score represents a true anomaly. In any dataset, there will be a natural level of uncertainty that creates a certain “normal” number of errors in prediction. Anomaly likelihood accounts for this natural level of error. Since, we don’t have the ground truth anomaly label, so in our case, we cannot use this metric. The find_anamoly() is used to detect anomalies by generating the hypothesis, and calculating losses, which are the anomaly confidence scores for individual time stamps given in the data set.
loss_df = Anamoly.find_anamoly(loss=loss, T=timesteps)
Next, we need to visualize the anomalies, the samples are assigned anomaly confidence score for each timestamp record. The plot_anamoly_results function can be used to plot the anomaly score with respect to frequencies (bins) & confidence score for every timestamp record.
From the above graphs, one can preasume that the timestamps/instances, which has anomaly confidence scores greater equal to 1.2 are likely examples that deviate from what is expected or typical, and thus can be treated as potential anomalies.
Finally, a prototype of Explainable AI for the built time-series predictor is designed. Before, we go through this step, let us understand what is need of interpretable models/explainable models.
Data is everywhere, and machine learning can mine it for information. Representation learning would become more valuable & highly significant, if also the results generated by machine learning models could be easily understood, interpreted, and trusted by humans. That is where Explainable AI comes in, thereby making things no longer a black box.
The explainable_results() uses the game theoretic approach to explain the output of model. To understand, interpret, and trust the results on the deep models at individual/samples level, we use the Kernel Explainer. One the fundemental properties of Shapley values is that they always sum up to the difference between the game outcome when all players are present, and the game outcome when no players are present. For machine learning models, this means that SHAP values of all the input features will always sum up to the difference between baseline (expected) model output, and the current model output for the prediction being explained. The explainable_results function takes the input value for the specific row/instance/sample prediction that was made to be interpreted. It also takes the number of input features (X), and the time-series window size difference (Y). We can get the explainable results at the individual instance level, and also at the batch of data size (say for example first 200 rows, last 50 samples, etc.)
Anamoly.explainable_results(X=anamoly_data, Y=Y, specific_prediction_sample_to_explain=10,input_label_index_value=16, num_labels=26)
The above graph is the result for the 10th example/sample/record/instance. It can be seen that the features that contributed significantly to the corresponding resulted anomaly confidence score were due to the temperature readings from the weather stations of Vancouver, Toronto, Saskatoon, Winnipeg, Calgary.
Feel free to connect. You can reach me at [email protected]
Always, keep learning. Knowledge is the beginning of wisdom :)