ALACC: Accelerating Restore Performance of Data Deduplication Systems Using Adaptive Look-Ahead Window Assisted Chunk Caching

VenueCategory
FAST'18containers deduplication

ALACC: Accelerating Restore Performance of Data Deduplication Systems Using Adaptive Look-Ahead Window Assisted Chunk Caching1. SummaryMotivation of this paper: Adaptive Look-Ahead Chunk Caching (ALACC)Implementation and Evaluation2. Strength (Contributions of the paper)3. Weakness (Limitations of the paper)4. Future Works

1. Summary

Motivation of this paper:

Requesting a unique or duplicate data chunk may trigger a container read if the data chunk is not currently available in memory. Due to the serious data fragmentation and size mismatching of requested data and I/O unite, the restore performance is much lower that that of directly reading out the data which is not deduplicated.

  1. how to handle the change of workload locality
  2. how to use the look-ahead window to improve the cache hit ratio
  3. how to make better trade-off between computing overhead and restore performance?
 Container-based CachingChunk-based CachingForward Assembly
AdvantageLess operating and management overheadHigh cache hit ratio (can be improved by using look-ahead window)Highly efficient, Low operating and management overhead
DisadvantageRelatively higher cache miss ration, especially when the caching space is limited.Higher operating and management overheadWorkload sensitive, requires good workload locality
  1. selecting and storing some duplicated data chunks during the deduplication processes
  2. design efficient caching policies during the restore process

Adaptive Look-Ahead Chunk Caching (ALACC)

Forward assembly + chunk-based caching + LAW (limited memory space, look-ahead window)

  1. design a hybrid scheme which combines chunkl-based caching and forward assembly
  2. a new way to make better decision on which data chunks are to be cached or evicated.

divide the FAW into two portions: FAA Covered Range (size is same as FAA) + Information for Chunk Cache (used for caching policy)

Restore process:

  1. check the chunk in container read buffer
  2. check the chunk in chunk cache
  3. if that chunk is not in the chunk cache, read the container that holds the data chunk from the storage
  4. each chunk in this container is checked with LAW to identify all the locations it appears in the FAA.
  5. decide which chunks in this container are to be inserted to the chunk cache according to the a caching policy

Caching policy:

  1. divide chunks in read-in container into three types:
  1. U-chunk (Unused chunk): not appear in the current entire LAW
  2. P-chunk (Probably used chunk): appears in the current FAA but does not appear in the second portion of the LAW
  3. F-chunk (Future used chunk): chunks will be used in the second portion of the LAW.
  1. Caching principle F-chunks should be cached, if more cache space is still available, it may cache some P-chunks. (also define the priority)

For example, for backup trace: most data chunks are unique at the beginning. Later, in different sections of the work load, they may have various degrees of deduplication and re-usage. Mainly consider three sizes: LAW size , FAA size , chunk cache size . (it should require ) The size of LAW:

Too large: the computing overhead large but the extra information in the LAW is wasting Too small: it becomes forward assembly + LRU cache.

1553335487753 The key goal: to achieve best trade-off between cache efficiency and (CPU and memory) overhead.

dynamically adjusts the memory space ratio of FAA and chunk cache, and the size of LAW. Instead of using a fixed value as the threshold.

Implementation and Evaluation

a deduplication system: 11k LoC C program Restore recovery log (RRL) is maintained to ensure reliability. Not open-source

a Dell PowerEdge server with 2.40GHz Intel Xeon 24 cores and 32GB four deduplication traces: FSL_1, FSL_2, EMC_1, EMC_2 Measurement: speed factor (MB/container-read), computing cost factor (second/GB), restore throughput (MB/second).

2. Strength (Contributions of the paper)

  1. this paper comprehensively analyzes the trade-off of different restore operation schemes in terms of the performance in various workloads and memory configurations.
  2. propose ALACC which can dynamically adjust the sizes of FAA and chunk cache to adapt to the changing of chunk locality.
  3. implement an effective look-ahead window with its size dynamically adjusted to provide essential information for FAA.

3. Weakness (Limitations of the paper)

  1. I think ALACC can be still improved by multi-threading implementation. Also, this paper extends ALACC to handle the duplicated data chunk rewriting in FAST'19.

4. Future Works

When this issue meets secure deduplication algorithm, can it arise other problem?