Venue | Category |
---|---|
ATC'16 | chunking |
FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication1. SummaryMotivation of this paperFastCDCImplementation and Evaluation2. Strength (Contributions of the paper)3. Weakness (Limitations of the paper)4. Some Insights (Future work)
Motivation
Existing CDC-based chunking introduces heavy CPU overhead
By using Gear function, the bottleneck has shifted to the hash judgment stage.
Three key designs
Simplified but enhanced hash judgment
Sub-minimum chunk cut-point skipping
Normalized chunking
normalizes the chunk size distribution to a small specified region
Gear hash function
an array of 256 random 64-bit integers to map the values of the byte contents in the sliding window.
using only three operations (i.e., +, <<, and an array lookup)
Optimizing hash judgement
Gear-based CDC employs the same conventional hash judgment used in the Rabin-based CDC
FastCDC enlarges the sliding window size by padding a number of zero bits into the mask value
change the hash judgment statement
involve more bytes in the final hash judgment
simplifying the hash judgment to accelerate CDC
Cut-point skipping
avoid the operations for hash calculation and judgment in the skipped region.
the cumulative distribution of chunk size in Rabin-based CDC (without the maximum and minimum chunk size requirements) follows an exponential distribution.
Normalized chunking
solve the problem of decreased deduplication ratio facing the cut-point skipping approach.
After normalized chunking, there are almost no chunks of size smaller than the minimum chunk size
By changing the number of '1' bits in FastCDC, the chunk-size distribution will be approximately normalized to a specific region (always larger than the minimum chunk size, instead of following the exponential distribution)
The whole algorithm
Evaluation standard
Compared with
Evaluation of optimizing hash judgement
Evaluation of cut-point skipping
Evaluation of normalized chunking
Comprehensive evaluation of FastCDC
enhanced hash judgment sub-minimum chunk cut-point skipping normalized chunking
algorithmic-oriented CDC optimizations hardware-oriented CDC optimizations