Venue | Category |
---|---|
FAST'11 | Deduplication workload analysis |
A Study of Practical Deduplication1. SummaryMotivation of this paperMethod NameImplementation and Evaluation2. Strength (Contributions of the paper)3. Weakness (Limitations of the paper)4. Future Works
Deduplication can work at either the sub-file or whole-file level.
More fine-grained deduplication creates more opportunities for space savings. Drawback: reduces the sequential layout of some files, impacts the performance when disks are used for storage.
This paper consists of 857 file systems spanning 162 terabytes of disk over 4 weeks.
from a broad cross-section of employees
This paper also conducts a study of metadata adn data layout.
- storage being consumed by files of increasing size continues unabated.
- file-level fragmentation is not widespread.
in order to reduce the size of the data set. had the largest number of unique hashes, somewhat more than 768M, expect that about two thousand of those (0.0003%) are false matches due the truncated hash.
Novel 2-pass algorithm: First pass: if it tried to insert a value that was already in the Bloom filter, then it inserts it into a second bloom filter of equal size. Second pass: comparing the each hash to the second bloom filter only, if it was not found in the second filter, it was certain that the hash had been seen exactly once and could be omitted from the database. (very simple)
age, capacity, fullness the number of files and directories
It argues that it is not true that file system performance changes over time, largely due to fragmentation. Because of the defragmenter in modern operating system
windows, windows vista, windows server
file system data metadata data layout out