Venue | Category |
---|---|
MSST'12 | Deduplication Estimation |
Estimation of Deduplication Ratios in Large Data Sets1. SummaryMotivation of this paperMethod NameImplementation and Evaluation2. Strength (Contributions of the paper)3. Weakness (Limitations of the paper)4. Future Works
This paper studies the problem of accurately estimating the data reduction ratio achieved by deduplication and compression on a specific data set.
this paper focuses on what can be done when scanning the entire data set. using RAM as little as possible without reading most of the actual data.
The method of this work can be used to:
- estimating the number of disks to buy
- choosing a deduplication technique
- deciding whether to dedupe or not dedupe
how to estimate the deduplication and compression in an efficient manner.
heaviest: access to the disk, and compression.
Hash values and compression rates are computed for each element in the sample.
From the entire dataset, it would choose elements (a configurable parameter in advance), and compute the corresponding hash vale and add it to a set as the base sample.
record the count
Rationale: in this paper's case, an element that two replicas in the dat set has double the probability of being included in the base sample.
Sample Method:
- choose random numbers in
- generate a random number according to binomial distribution.
The hash is computed for each element but it is only recorded if it matches a hash of an element in the base sample.
The entire dataset is scanned, for each element, its hash signature is computed.
If this signature is matched in the base sample, then the corresponding count is incremented by 1.
It finally estimates the
achieves less than optimal deduplication ratios, yet it is easy to implement and can perform sufficiently well in some workloads.
In this paper, the actual data needs to be read only for a small fraction of the file, which is related to the base sample. In handling of files, each such byte has independent probability , where is the total number of byte in the dataset.
in many file system, the first block resides in the i-node of the file. thus can be read quickly during a metadata scan without the addition of extra disk seeks.
- the length of the file
- A hash signature of the first block of the file.
If those two messages of a file can match, then compute the full hash of this file, and count the frequency.
both analytical and in practice does not suffice to distinguish if the replication is a local phenomena or a global one.
In order to determine the sample size, this method needs to give a bound of deduplication ratio
can give arbitrarily skewed results. whether randomly or according to various sampling methodologies.
"educated-sampling"
the overall size of the data set can be computed by a standard traversal of the file system e.g., the unix command
the sampling should choose exact offsets in the file, and then choose the chunk which contains this offset. this can relieve the need to read entire file