Sliding Look-Back Window Assisted Data Chunk Rewriting for Improving Deduplication Restore Performance

VenueCategory
FAST'19Deduplication Restore

Sliding Look-Back Window Assisted Data Chunk Rewriting for Improving Deduplication Restore Performance1. SummaryMotivation of this paperSliding Look-back windowImplementation and Evaluation2. Strength (Contributions of the paper)3. Weakness (Limitations of the paper)4. Future Works

1. Summary

Motivation of this paper

data fragmentation: data chunks are scattered read amplification: the size of data being read is larger than the size of data being restored

This paper focuses on reducing the data fragmentation and read amplification of container-based deduplication system.

improve data chunk locality make a better trade-off between the deduplication ratio and the number of required container reads (compared with capping-FAST'13) reducing the number of container reads is a major task for restore performance improvement.

storing individual data chunk directly cannot efficiently utilize the storage bandwidth for storage systems with low random read/write performance. (HDD) typically accumulates a number of data chunks in a container before writing them out together.

Sliding Look-back window

avoiding the need to read these duplicate chunks from other old containers.

  1. Main idea: apply varied capping levels for different segments

further reduce container reads achieve even fewer data chunk rewrites

  1. how to decide the "capping level" for different segments? Instead of using a fixed capping level as the selection threshold, it uses a value of CNRC (the number of data chunks in the segment that belongs to an old container).

rewrite duplicate data chunks from old containers that have CNRCs lower than the threshold.

The actual capping level is decided by

the threshold the distribution of CNRCs of these segments

Two bounds for

  1. bound for deduplication ratio reduction a targeted deduplication ratio reduction limit

bound the number of rewrite chunks

  1. bound for container reads: a targeted number of container reads in one segment

Use those two bounds to determine the .

existing wasted rewrite chunks (since restore cache)

The rewrite decisions made with the statistics only from the current segment are less accurate.

The LBW acts as a recipe cache that maintains the metadata entries of data chunks in the order covered by the LBW in the byte stream.

metadata entry:

  1. chunk metadata
  2. offset in the byte stream
  3. container ID/address
  4. the offset in the container

With both past and future information in the LBW, a more accurate decision can be made.

The whole process of rewrite selection:

  1. Step 1 (move LBW) When LBW moves forward for one container size

a container size of data chunks (added container) will be added to the front of the LBW one container size of data chunks (evicted container) will be removed from the end of the LBW

  1. Step 2 (process the added container) classify the data chunks into three categories:

unique chunks non-rewrite chunks (duplicate data chunks that will not be rewritten) candidate chunks (duplicate data chunks that may be rewritten)

  1. Step 3 (update metadata entries) update the metadata entries of data chunks in the added container

identify candidate chunks and write them to rewrite candidate cache

  1. Step 4 (recalculate ) recalculate the of old containers that contain the data chunks in the rewrite candidate cache

reclassify these data chunks

  1. Step 5 (rewrite) rewrite the remaining candidate chunks and update metadata entries, write to the recipe persistently
  2. Step 6 (adjust) To make better trade-off between deduplication ratio reduction and the number of container reads

adjust at the end of each cycle

Implementation and Evaluation

the container I/O time dominates the whole restore time (in HDD)

Trace: FSL select six types of traces in FSL

each trace contains 10 full backup snapshots

  1. Baseline method For rewrite:

normal deduplication with no rewrite capping scheme (Capping) flexible container referenced count based scheme (FCRC) sliding look-back windows scheme (LBW)

For restore cache:

forward assembly area (FAA) adaptive look-ahead window chunk based caching (ALACC)

  1. Metrics

Deduplication ratio vs. Speed factor Restore performance

2. Strength (Contributions of the paper)

3. Weakness (Limitations of the paper)

 

4. Future Works

  1. Use container or not use container?

HYDRAstor, iDedup, Dmdedup and ZFS

Veritas and Data Domain benefit from the high sequential I/O performance and good chunk locality.

  1. Adaptive to different workload characteristics One start-point of this paper is the performance of previous schemes is closely related to the workload characteristics.

Insight: the motivation to propose an adaptive method

  1. Future work how to design a scheme to adaptively change LBW size

more interlligent rewrite policies

how to combine garbage collection with the rewrite design