A Study of Practical Deduplication

VenueCategory
FAST'11Deduplication workload analysis

A Study of Practical Deduplication1. SummaryMotivation of this paperMethod NameImplementation and Evaluation2. Strength (Contributions of the paper)3. Weakness (Limitations of the paper)4. Future Works

1. Summary

Motivation of this paper

Deduplication can work at either the sub-file or whole-file level.

More fine-grained deduplication creates more opportunities for space savings. Drawback: reduces the sequential layout of some files, impacts the performance when disks are used for storage.

This paper consists of 857 file systems spanning 162 terabytes of disk over 4 weeks.

from a broad cross-section of employees

This paper also conducts a study of metadata adn data layout.

  1. storage being consumed by files of increasing size continues unabated.
  2. file-level fragmentation is not widespread.

Method Name

in order to reduce the size of the data set. had the largest number of unique hashes, somewhat more than 768M, expect that about two thousand of those (0.0003%) are false matches due the truncated hash.

Novel 2-pass algorithm: First pass: if it tried to insert a value that was already in the Bloom filter, then it inserts it into a second bloom filter of equal size. Second pass: comparing the each hash to the second bloom filter only, if it was not found in the second filter, it was certain that the hash had been seen exactly once and could be omitted from the database. (very simple)

  1. Deduplication in primary storage
  1. Deduplication in backup storage Performance in secondary storage is less critical than in that of primary, so the reduced sequentiality of a block-level deduplicated store is of lesser concern.

age, capacity, fullness the number of files and directories

It argues that it is not true that file system performance changes over time, largely due to fragmentation. Because of the defragmenter in modern operating system

Implementation and Evaluation

windows, windows vista, windows server

2. Strength (Contributions of the paper)

  1. this paper leverages a two-pass to filter out those chunks which only appears once.
  2. The main contribution of this work is it also tracks file system fragmentation and data placement, which hash not been analyzed previously or at large scale.

file system data metadata data layout out

3. Weakness (Limitations of the paper)

4. Future Works

  1. This paper follows the manner that show the CDF and histogram in the same picture, which is a good to show the data distribution.