Capacity planning is an arduous, ongoing task for many operations teams, especially for those who rely on Virtual Machines (VMs) to power their business. At Pivotal, we have developed a data science model capable of forecasting hundreds of thousands of models to automate this task using a multivariate time series approach. Open to reuse for other areas such as industrial equipment or vehicles engines, this technique can be applied broadly to anything where regular monitoring data can be collected.
For the full blog, please see the original article, Multivariate Time Series Forecasting for VM Capacity Planning, on the Pivotal blog.
The Objective & The Plan
The goal of our model is to forecast monitoring metrics from a virtual machine and predict when the capacity is going to hit a capacity threshold so outages can be prevented or SLAs can be met.
In our example, the data used is obtained from vCenter adapter, including CPU usage, disk, memory and network related metrics. We collected a total of 34 metrics averaged over 5 minutes for each VM across a total of 78 days.
The steps for building this model and creating the forecast include:
Variable selection to reduce the total number of variables
Generate multivariate time series from the metrics by sampling them at regular intervals
Forecasting the multivariate time series signal using Vector Autoregressive (VAR) model
Threshold setting to create alerts
To start we need to determine what variables are important to incorporate into the model. A Pearson’s correlation works well to determine if two features x and y are highly correlated enough to be considered redundant for forecasting.
Fig. 1: Pearson's correlation between 13 metrics retained in the model. The labels on the x-axis are same as labels on the y-axis
Pearson correlation coefficient was calculated to reflect the degree of linear relationship between metrics. This was done in a distributed manner using the MADlib function.
Coefficients vary from -1 to 1, where 1 implies perfect correlation where variables increase together, 0 indicates no correlation, and -1 means perfect anti-correlation. Among the complete set of 34 metrics, we were able to narrow down our variables to 13 metrics. A threshold of 0.8 was chosen randomly.
Generating Multivariate Time Series
Next, we need to formulate the right model and learn the model coefficients from the training data. Using the Vector Autoregressive (VAR) model for forecasting the multivariate time series data, we are able to capture the linear interdependencies between multiple variables. Each variable has a regression like equation, where it is regressed against its own lagged values and the lagged values of other variables.
Our VAR is comprised of a set of K variables . The VAR (p) process is than defined for ‘p’ lags as
where Ai are (K x K) coefficient matrices for i = 1, 2, … p and ut is a K- dimensional process with E(ut) = 0 and a time invariant positive definite covariance matrix E(ututT ) = ∑u
In our example, since we have 13 VM performance metrics, the value of K is 13. Let be the vector of the 13 performance metrics for a specific VM at a particular time of the day ‘h’ on day ‘t’. For each metric, , at a particular time of the day ‘h’ on day ‘t’, the current value of the metric is regressed against its own lagged values at the time of the day ‘h’ for the previous ‘p’ days and the lagged values of other metrics at time of the day ‘h’ for the previous ‘p’ days. Since only the first 68 days were used for training data, the value of ‘p’ is 67.
Forecasting Using The VAR Model
Our VAR model is comprised of the performance metrics for each VM, corresponding to a time of the day ‘h’. Let VAR(p)h(k) denote the model developed for time of the day ‘h’ using data from the past ‘p’ days using the performance metrics from the ‘k’ th VM.
where denotes the vector of performance metrics for the ‘k’ th VM at a particular time of the day ‘h’ on day ‘t’ and denotes similar vector at the time of the day ‘h’ on day ‘t-1’.
The total number of models depend on ‘k’ and ‘h’, where ‘k’ indicates the total number of VMs, and ‘h’ is the number of periods per day we are looking to forecast.
Each model built for a specific VM ‘k’ at hour ‘h’ is independent of each other. As a result, we can take advantage of the MPP architecture of the Greenplum Database & HAWQ to run each model in a separate Greenplum segment, making it computationally feasible. This is done by distributing the data corresponding to a virtual machine and time of the day via PL/R.
The pseudo code for this is shown below. The function var_model is the PL/R function which calculates the coefficients of the VAR model based on the input data.
Code 1: Function to evaluate the VAR model in a distributed manner
The number of model errors does start increasing if we are forecasting far into the future because each forecasted value (yt+1) is used to obtain the forecast for the next value (yt+2). Fig. 3 shows the fit and residual values for CPU_usage_average where blue dotted lines in the top plot are forecasted values and solid lines are actual values. The black line in the middle plot shows the residuals for the corresponding forecasts. The bottom two plots are the autocorrelation (ACF) and partial autocorrelation (PACF) of the residuals. Significant autocorrelation in the residuals indicates the model can be improved further, however our model appears to be accurate.
Fig. 3: Fitted and residual values from a particular VAR model.
Generating Thresholds For Alerts
Next, thresholds need to be developed to flag alerts. Hotelling T2 statistic measures the deviation of a given vector from the sample mean vector. It is generally used to monitor a multivariate time series signal and is defined as
where is the vector of the VM performance metrics, m is the mean vector obtained from the training data, S is the covariance matrix.
The T2 statistic is calculated using both the training data as well as the forecasted data. Once the forecasted values are summarized using the T2 statistic, a threshold can be chosen based on T2 values from the training data. The Phase I threshold for single observations is calculated based on the number of observations and beta distribution. This threshold determines the Upper Control Limit (UCL) and Lower Control Limit (LCL) for each variable. If the statistic is either higher than the UCL or lower than LCL, than it needs to be investigated for a possible out-of-control signal. The values for the T2 statistic for a particular VM are shown in Fig. 4 and shows the threshold for this VM is 29.2.
The forecasted value of T2 statistic on 70th day is an indication this VM will have a problem on that day.
Fig. 4: T2 statistic values for a particular VM for both training (first 68 days) and forecasted data (day 69 on). High T2 statistic values in forecasted data are flagged as outliers. Upper Control Limit (UCL) is 29.2 and Lower Control Limit (LCL) is 0.
Scalability Of The Model
Since the computations for each VM is independent of other VMs, this model we developed is well suited to run in MPP data stores and is therefore very scalable. For example, to generate models for 10,000 VMs for each hour of the day, we would need to build 240,000 models (10,000 VMs*24 hours). By making using of the MPP architecture of Greenplum, each of these models could be run in parallel on an independent Greenplum segment, making it infinitely scalable.