Venue | Category |
---|---|
ATC'13 | Deduplication Sample |
Estimating Duplication by Content-based Sampling1. SummaryMotivation of this paperContent-based Sample Implementation and Evaluation2. Strength (Contributions of the paper)3. Weakness (Limitations of the paper)4. Future Works
The benefit of deduplication in a primary storage system varies for different workloads.
For a certain workload that have a low level of deduplication, one would to turn off the deduplication feature to avoid its effect on I/O performance and to avoid the metadata overhead of deduplication.
It is necessary for the estimator to allow customers to quickly estimate the deduplication benefit on their primary data.
existing deduplication estimators are either not fast enough or not accurate enough.
A block fingerprint passes the filter and is added to the sample Mod == , is the filter divisor.
Idea: split the fingerprint space into partitions, and to use one of the partitions in the estimation.
The estimation of distinct block size in the whole data set: can also consider the case where the size of block is different.
This paper provides a very simple sample method based on fingerprint, i.e., content-based, this is not very novel from my perspective. And this paper also presents the whole theory to support its idea. The insight is for large dataset, it is possible to estimate the deduplication ratio with sampling.