A Simulation Analysis of Redundancy and Reliability in Primary Storage Deduplication

VenueCategory
TC'18Deduplication Reliability

A Simulation Analysis of Redundancy and Reliability in Primary Storage Deduplication1. SummaryMotivation of this paperMethod NameImplementation and Evaluation2. Strength (Contributions of the paper)3. Weakness (Limitations of the paper)4. Future Works

1. Summary

Motivation of this paper

mitigate the possibility of data loss by reducing storage footprints. amplifies the severity of each data loss event, which may corrupt multiple chunks or files that share the same lost data.

add redundancy via replication or erasure coding to post-deduplication data. propose quantitative methods to evaluate deduplication storage reliability.

loss variations: different failures (device failures, latent sector errors), different granularities of storage (chunk, files) repair strategies: repair strategies determine whether import data copies are repaired first (affect reliability in different ways)

Method Name

  1. FSL: pick nine random snapshots with raw size at least 100GB each

Mac OS X server: user011 - user026 (eight users): taken from different users' home directories with various types of files. 1564405750463

  1. MS: collected at Microsoft and publicized on SNIA. Focus on a total 903 file system snapshots taht are collected in a single week

time: September 18, 2009 the average chunk size: 8KB

Also, consider the notion of a deduplication domain (a set of file system snapshots over which it performs deduplication).

deduplication domain size specifics the number of file system snapshots included in a deduplication domain generate 10 random deduplication domains for each deduplication domain size.

  1. Reference counts Intuition: the important of a chunk is proportional to its reference count.

the majority of chunks have small reference counts. (e.g., referenced by exactly once) losing the highly referenced chunks may lead to severe loss of information as well as high deviations in the reliability simulations.

  1. How to determine similar files?
  1. share the same minimum chunk fingerprint (Minhash, Broder's theorem)
  2. share the same maximum chunk fingerprint
  3. have the same extension (provide an additional indicator if the two files are similar)

FSL shows significant fractions of intra-file redundancy. The most common redundancy source are duplicate files

whole-file deduplication is effective.

  1. failure patterns
  2. metadata
  3. data layout

1564457932100

Highly referenced chunks only account for a small fraction of physical capacity after deduplication the chunk reference counts show a long-tailed distribution It is possible to allocate a small dedicated physical area for storing extra copies of highly refenced chunks

Solution:

  1. allocate the first 1% of physical sectors for the highly referenced chunks
  2. sort the chunks by their reference counts, and fill the dedicated sectors with the top 1% most highly referenced chunks.

do this process offline (no need to change the write/read path) incur moderate stroage overhead

Implementation and Evaluation

  1. observation-1 Deduplication will significantly alter the expected amounts of corrupted chunks by USEs when compared to without deduplication.
  2. observation-2 The logical repair progress is affected by the placement of highly referenced chunks and the severity of chunk fragmentation.
  3. observation-3 If it does not carefully place highly referenced chunks and repair them preferentially, deduplication can lead to more corrupted chunks in the presence of UDFs.

2. Strength (Contributions of the paper)

  1. This paper studies the redundancy characteristics of the file system snapshots from two aspects:

the reference counts of chunks the redundancy sources of duplicate chunks minimum hash is better to determine similar files losing a chunk may not necessarily imply the corruptions of many files

  1. propose a trace-drvien, deduplication-aware simulation framework to analyze and compare storage system reliability with and without deduplication.
  2. apply this simulation framework and get some key findings of its reliability analysis.

3. Weakness (Limitations of the paper)

4. Future Works

skew distribution. can assign a small dedicated physical area (with only 1% of physical capacity) for the most referenced chunks and first reparis the physical area to improve them. (incurring only limited storage overhead)

  1. intra-file redundancy
  2. duplicate files
  3. similar files