Capacity Forecasting in a Backup Storage Environment

VenueCategory
LISA'11Workload Analysis

Capacity Forecasting in a Backup Storage Environment1. SummaryMotivation of this paperData Domain Implementation and Evaluation2. Strength (Contributions of the paper)3. Weakness (Limitations of the paper)4. Future Works

1. Summary

Motivation of this paper

Many system administrators already have historical data for their systems and thus can predict full capacity events in advance.

It needs a proactive tool

  1. predicts the date of full capacity and provides advance notification.
  2. there seems to be little previous work discussing applications of predictive modeling to data storage environments.

This paper presents the predictive model employed internally at EMC to forecast system capacity.

generate alert notification months before systems reach full capacity.

Data Domain

  1. employ inline deduplication technology on disk.

Customers can configure their Data Domain systems to send an email everyday with detailed diagnostic information.

  1. Most customers choose to send autosupports to EMC

the historical data enables more effective customer support.

Two variables of capacity forecasting:

  1. Total physical capacity of the system (changes over time)
  2. Total physical space used by the system
  1. the most common methods employed in predictive modeling is linear regression
  1. This is challenging because behavior changes
  2. blind application of regression to the entire data set often leads to poor predictions. 1562591190121
  1. select a subset of recent data choose a subset of recent data such as the prior 30 days
  1. eliminates the influence of the older data and improves the accuracy of the model's predictions.
  1. How to reduce the error rate of the original linear regression model
  1. applying the regression to a data subset that best represents the most recent behavior.
  1. How to find the best subset of data?
  1. the boundary must be determined where the recent behavior begins to deviate.
  2. "goodness-of-fit" of a linear regression: indicates perfectly linear data
  1. select the subset with maximum , from .
  2. the calculated boundary occurs near the discontinuity of the truc function.

1562596227030

  1. Goodness-of-fit
  2. positive slope

Implementation and Evaluation

  1. Analysis of the quality of forecasts

false positive: hardware changes, software changes from a statistical perspective, it is unknown whether the recent data points are signal or noise.

  1. Capacity forecasting example In Data Domain storage systems

2. Strength (Contributions of the paper)

  1. This paper shows that there is a trade-off in predicative model eliminating reasonable models vs. generating false positives

By requiring more data for models, it can gain higher confidence in their predictions, but reduce the advanced notification for true positives.

3. Weakness (Limitations of the paper)

  1. If historical data does not demonstrate linear growth, then obviously linear regression would be a poor.

4. Future Works

  1. In this paper, it mentions there exists many other models which can be applied to time series data.
  1. weighted linear regression
  2. logarithmic regression
  3. auto-regressive (AR) model

There is an open question whether the remaining systems can be modeled by other methods.

  1. How to improve this model to be compatible with some other systems? or find other applications of this predicative model? predicate bandwidth throughput, load-balancing or I/O capacity.
  2. The paper shows that the majority systems exhibit very linear behavior since the linear model had a very good fit the datasets.