Venue | Category |
---|---|
MASCOTS'10 | Deduplication Chunking |
Frequency Based Chunking for Data De-Duplication 1. SummaryMotivation of this paperFrequency Based ChunkingImplementation and Evaluation2. Strength (Contributions of the paper)3. Weakness (Limitations of the paper)4. Future Works
This paper proposes a frequency based chunking algorithm, which utilizes the chunk frequency information in the data stream to enhance the data deduplication gain.
Especially when the metadata overhead is taken into consideration.
The popular baseline CDC algorithm employs a deterministic distinct sampling technique to determine the chunk boundaries based on the data content.
Drawback of CDC: Although CDC is scalable and efficient, content defined chunking is essentially a random chunking algorithm.
does not guarantee the appeared frequency of the resultant chunks may not be optimal for data dedup purpose.
The only option for the CDC algorithm is to reduce the average chunk size
when the average chunk size is below a certain value, the gain in the reduction of redundant chunks is diminished by the increase of the metadata cost.
Main idea: after a coarse grained CDC algorithm, they then performs a second level (fine-grained) chunking by identifying the chunks with a frequency greater than a predefined threshold.
this method requires a strong assumption on the knowledge of the individual chunk frequency. two steps: chunk frequency estimation + chunking
here, is the number of times a unique chunk appears in the stream.
is the minimal length of a chunk pointer that points to each unique chunk. 20 is the 20-byte SHA-1 hash.
How to reduce the oveahead of estimation? This algorithm may only require the knowledge of high-frequency chunk candidates.
One interesting observation is that a majority of the chunk candidates are of low frequencies.
Based on this idea, this paper applies a filtering process to eliminate as many low-frequency chunks as possible.
reserve resource to obtain an accurate frequency estimate for the high-frequency chunks.
has a parameter to control the sample rate for example, , makes a chunk candidate pass if its Rabin's fingerprint value modulo 32 == 1
If it can find this chunk in each of the bloom filters, then it lets this chunk pass through the parallel filter and start to count its frequency. Otherwise, it records the chunk in one of the bloom filters randomly.
Without considering the false positive caused by bloom filters, only chunk candidates with frequency greater than can possibly pass the parallel filtering process.
Due to the efficiency of the bloom filter data structure, it only consumes a small amount of memory, in this way they can filter out majority of the low frequency chunk candidates with small amount of resource.
It mentions that although there is a rich literature for detecting high frequency items in a data stream.
However, for this problem, the threshold for high frequency chunks is usually as low as 15. The classical heavy hitter detection algorithms do not scale under such a low threshold.
two datasets (Linux and Nytimes): data streams exhibit high degree of redundancy one dataset (Mailbox): dataset with low detectable redundancy
Evaluation Criteria:
Duplicate Elimination Ratio Average Chunk Size
achieving a better dedup gain producing much less number of chunks
considers the frequency information of data segments during chunking process.
with small memory footprint
it needs to design a one-pass algorithm which conducts the frequency estimation and the chunking process simultaneously. An intuitive idea: tracking top k frequency segments instead of tracking all segments with frequency greater than a certain threshold.
this threshold is determined based on domain knowledge on the datasets.