Subscribe to DSC Newsletter

Discovery of Temporal Neighborhoods through Discretization Methods and Markov Model

This article describes a few approaches and algorithms for temporal neighborhood discretization from a couple of papers accepted and published in the conference ICDM (2009) and in the journal IDA (2014), authored by me and my fellow professors Dr. Aryya Gangopadhyay and Dr. Vandana Janeja at UMBC. Almost all of  the texts / figures in this blog post are taken from the papers.

Neighborhood discovery is a precursor to knowledge discovery in complex and large datasets such as Temporal data, which is a sequence of data tuples measured at successive time instances. Hence instead of mining the entire data, we are interested in dividing the huge data into several smaller intervals of interest which we call as temporal neighborhoods. In this article we propose 4 novel of algorithms to generate temporal neighborhoods through unequal depth discretization:

  1. Similarity-based simple Merging (SMerg)
  2. Markov Stationary distribution based Merging (StMerg)
  3. Greedy Merging (GMerg)
  4. Optimal Merging with Dynamic Programming (OptMerg).

The next figure shows the schematic diagram of the four algorithms:
im0.png 150w, 300w, 768w, 790w" sizes="(max-width: 676px) 100vw, 676px" />

All the algorithms start by dividing a given time series data into a few (user-chosen) initial number of bins (with equal-frequency binning) and then they proceed merging similar bins (resulting a few possibly unequal-width bins) using different approaches, finally the algorithms terminate if we obtain the user-specified number of output (merged) bins .

Both the algorithms SMerg and STMerg start by setting a Markov Model with the initial bins as the states (each state being summarized using the corresponding bin’s mean and variance) and Transition Matrix computed with the similarity (computed using a few different distribution-similarity computation metrics) in between the bins, as shown in the following figure:

im0_1.png 150w, 300w, 761w" sizes="(max-width: 676px) 100vw, 676px" />

The following distribution-distance measures are used to compute the similarity in between the bins (we assume the temporal data in each of the bins as normally distributed with the bin mean and the variance).

im0_1_1.png 150w, 300w, 768w, 1024w, 1130w" sizes="(max-width: 777px) 100vw, 777px" />
We discuss the results for a one dimensional special case here, however, our approach is generalizable to multidimensional temporal data.

The Markov Transition Matrix is defined as shown in the following figure as a block-diagonal matrix with a given Temporal window size w.

im_0_1_2.png 150w, 300w, 768w, 823w" sizes="(max-width: 676px) 100vw, 676px" />
The following figure shows the pre-processing step Init-bin for setting up the Markov Model

im1.png 150w, 300w, 748w" sizes="(max-width: 676px) 100vw, 676px" />

  1. Similarity based merging (SMerg)In this approach we iteratively merge highly similar bins (based on a threshold). At every iteration the transition matrix is recomputed since bin population has changed.im2.png 150w, 300w, 707w" sizes="(max-width: 676px) 100vw, 676px" />

2. Markov Stationary distribution based Merging (StMerg)

Given the initial equal-depth bins and the transition matrix T, we compute the steady-state vector at the stationary distribution, using power-iteration method. Once we have this vector we need to detect the spikes in the vector which indicate the split points such that the data on either side of any split point is very dissimilar and the data within two split points is highly similar. In order to detect these spikes we use Discrete Fourier Transform (DFT) and Inverse Discrete Fourier Transform (IDFT).

The following figure shows how the steady state vector approximate the pattern in the original data for a few synthetically generated and public transportation datasets, thereby achieving data compression (since number of states in the steady state vector =
# output bins obtained after bin merging  # initial bins).

im4.png 116w, 232w" sizes="(max-width: 652px) 100vw, 652px" />

im3.png 132w, 264w, 768w, 903w" sizes="(max-width: 676px) 100vw, 676px" />

3. Greedy Approach (GMerg)

This algorithm starts with the initial bins and starts merging the most similar adjacent bins right away (does not require a Markov Model). It is greedy in the sense that it chooses the current most similar intervals for merging and hence may find a local optimum instead of obtaining a globally optimal merging.

im5.png 150w, 300w, 688w" sizes="(max-width: 676px) 100vw, 676px" />

4. Optimal merging (OPTMerg)

Starting with the initial equal depth intervals (bins), we introduce the notion of optimal merging by defining optimal partitioning,  as described in the following figures:

im5_6.png 150w, 300w, 761w" sizes="(max-width: 676px) 100vw, 676px" />
Next we use dynamic programming to find optimal partition, as shown in the next figure:

im6.png 120w, 239w" sizes="(max-width: 665px) 100vw, 665px" />

im7 114w, 228w, 725w" sizes="(max-width: 676px) 100vw, 676px" />
The following figures show the optimal partition trees obtained on a few datasets:

im8.png 120w, 239w" sizes="(max-width: 607px) 100vw, 607px" />

im9.png 87w, 175w, 768w, 596w, 1135w" sizes="(max-width: 882px) 100vw, 882px" />

Experimental Results with Synthetic and CAT Lab data

The following figures show the output bins obtained for few of the datasets using few of the algorithms and distance measures.

From the regions in the following figures marked (a, b) we can see that the weekend days are clearly different from the weekdays marked as (d, e). Further if we consider a very high level granularity such as the region (c) using StMerg with the KL distance we can entirely demarcate the weekends as compared to the weekdays (the global outliers can be detected, marked by a circle).

US-50-East 150w, 300w, 725w" sizes="(max-width: 676px) 100vw, 676px" />

US-50-West 150w, 300w, 726w" sizes="(max-width: 676px) 100vw, 676px" />

In the next figure, temporal interval (a) and (b) obtained after merging the intervals shows that there is an anomaly in the data from 05/12/2009 23 : 03 : 39 to 05/13/2009 00 03 39. We indeed find that speed value recorded by the censor is 20, constant all over the time period, and we verified from CATT lab representative that this is due to a possible ”free flow speed condition” where the sensor may go into a powersave mode if the traffic intensity is very minimal or close to zero.

I-395 150w, 300w, 727w" sizes="(max-width: 676px) 100vw, 676px" />

We also have extended the existing well-known and popular technique SAX (Symbolic Aggregate Approximation) to SAXMerg (where we merge the adjacent bins represented by same symbol) as shown below.

sax.png 150w, 300w" sizes="(max-width: 646px) 100vw, 646px" />

We then compare our results with those obtained using SAXMerg, as shown in the following figures.

It can be seen from the merged output bins from the next figure, the OPTMerg with Mahalanobis distance measure is more robust to local outliers than the SAXMerg is.

im10 150w, 300w, 768w, 886w" sizes="(max-width: 676px) 100vw, 676px" />

saxcomp 145w, 290w, 768w, 1007w" sizes="(max-width: 676px) 100vw, 676px" />

Some results with precision and recall:

im11.png 96w, 191w" sizes="(max-width: 430px) 100vw, 430px" />

Views: 1468


You need to be a member of Data Science Central to add comments!

Join Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service