Venue | Category |
---|---|
SoCC'19 | Chunking |
RapidCDC: Leveraging Duplicate Locality to Accelerate Chunking in CDC-based Deduplication Systems1. SummaryMotivation of this paperRapidCDCImplementation and Evaluation2. Strength (Contributions of the paper)3. Weakness (Limitations of the paper)4. Some Insights (Future work)
CDC is compute-intensive and time-consuming
Duplicate locality implication
Main idea
Leverage duplicate locality to remove the need of byte-by-byte window rolling in the determination of chunk boundaries.
Quantitative analysis of duplicate locality
use the number of contiguous deduplicatable chunks immediately following the first deduplicatable chunk to quantify the locality.
the majority of duplicate chunks are in the LQ sequence.
Design idea
exploit the duplicate locality in the datasets to enable a chunking method which detects chunk boundaries without a byte-by-byte window rolling.
Allow a list of next-chunk sizes (size list) to be attached to a fingerprint
2 bytes to record the chunk size
simpler relationship chain
If the position is accepted, it can avoids rolling the window one byte at a time for thousands of times to reach the next chunk boundary.
If not accepted, it will try another next-chunk size in the size list of the duplicate chunk's fingerprint.
Accepting suggested chunk boundaries
Maintaining list of next-chunk size
Datasets
Evaluation
Impact of modification count and distribution
Impact of minimum chunk sizes and hash functions
Throughput of multi-threaded
For deduplication production system NetApp ONTAP system Dell EMC Data Domain: 4KB, 8KB, 12KB
LBFS: 2KB, 16KB, 64KB
The boundary-shift issue for insertion or deletion at the beginning of a store file.
CDC chunking: a chunk boundary is determined at a byte offset which can satisfy a predefined chunking condition.
does not need to be collision-resistant