Cluster and Single-Node Analysis of Long-Term Deduplication Patterns

VenueCategory
ToS'18Distributed Deduplication

Cluster and Single-Node Analysis of Long-Term Deduplication Patterns1. SummaryMotivation of this paperCluster and Single-node analysisImplementation and Evaluation2. Strength (Contributions of the paper)3. Weakness (Limitations of the paper)4. Some Insights (Future work)

1. Summary

Motivation of this paper

Cluster and Single-node analysis

need to detect newly added and modified files.

comparing two consecutive snapshot, it can identify whether a file was newly added checking the mtime (modified time)

Follow the work of FAST'12 is the size of each chunk's metadata divided by the chunk size is the size before deduplication is the raw data size afterwards

, is the size of a file's recipe, is the size of the hash index

larger chunk sizes reduce the number of metadata entries. reduce the amount of metadata also reduces the number of I/Os to the chunk index.

the skew in chunk popularity has also been found in primary storage and HPC systems Identifying such popular chunks would be useful in optimizing performance.

accelerate chunk indexing and improve cache hit ratios

Chunk-level routing is required to achieve exact deduplication

resulting in more CPU and memory consumption.

Implementation and Evaluation

2. Strength (Contributions of the paper)

  1. classify data-routing algorithms and implement seven published algorithms that adopt different strategies

provide a detailed comparative analysis.

  1. study a locally collected dataset that spans a period of 2.5 years (4000 daily user snapshot)

3. Weakness (Limitations of the paper)

  1. Does not consider aging and fragmentation effects in this trace
  2. Does not consider the restore performance, especially for deduplication clusters.

4. Some Insights (Future work)

  1. Because of the size of the chunk index itself, smaller chunk sizes are not always better at saving space

the impact of chunk sizes

  1. Even similar users behave quite differently, and this should be taken into account in future deduplication systems.
  2. In distributed deduplication, the routing algorithm that can achieve a good physical load balance may lead to a huge skew in their logical distribution.
  3. Deduplication trace SYSTOR'09: Virtual machine disk images ATC'11: Microsoft primary storage: MS 857 snapshots over 4 weeks. (primary storage) FAST'12: EMC's Data Domain backup system (backup) SYSTOR'12: file analysis, file type and high redundancy SC'12: HPC trace
  4. Some insights Routing by file types can generally achieve a better deduplication ratio than other schemes