Venue | Category |
---|---|
FAST'18 | containers deduplication |
ALACC: Accelerating Restore Performance of Data Deduplication Systems Using Adaptive Look-Ahead Window Assisted Chunk Caching1. SummaryMotivation of this paper: Adaptive Look-Ahead Chunk Caching (ALACC)Implementation and Evaluation2. Strength (Contributions of the paper)3. Weakness (Limitations of the paper)4. Future Works
Requesting a unique or duplicate data chunk may trigger a container read if the data chunk is not currently available in memory. Due to the serious data fragmentation and size mismatching of requested data and I/O unite, the restore performance is much lower that that of directly reading out the data which is not deduplicated.
- how to handle the change of workload locality
- how to use the look-ahead window to improve the cache hit ratio
- how to make better trade-off between computing overhead and restore performance?
Container-based Caching | Chunk-based Caching | Forward Assembly | |
---|---|---|---|
Advantage | Less operating and management overhead | High cache hit ratio (can be improved by using look-ahead window) | Highly efficient, Low operating and management overhead |
Disadvantage | Relatively higher cache miss ration, especially when the caching space is limited. | Higher operating and management overhead | Workload sensitive, requires good workload locality |
- selecting and storing some duplicated data chunks during the deduplication processes
- design efficient caching policies during the restore process
Forward assembly + chunk-based caching + LAW (limited memory space, look-ahead window)
divide the FAW into two portions: FAA Covered Range (size is same as FAA) + Information for Chunk Cache (used for caching policy)
Restore process:
- check the chunk in container read buffer
- check the chunk in chunk cache
- if that chunk is not in the chunk cache, read the container that holds the data chunk from the storage
- each chunk in this container is checked with LAW to identify all the locations it appears in the FAA.
- decide which chunks in this container are to be inserted to the chunk cache according to the a caching policy
Caching policy:
- U-chunk (Unused chunk): not appear in the current entire LAW
- P-chunk (Probably used chunk): appears in the current FAA but does not appear in the second portion of the LAW
- F-chunk (Future used chunk): chunks will be used in the second portion of the LAW.
For example, for backup trace: most data chunks are unique at the beginning. Later, in different sections of the work load, they may have various degrees of deduplication and re-usage. Mainly consider three sizes: LAW size , FAA size , chunk cache size . (it should require ) The size of LAW:
Too large: the computing overhead large but the extra information in the LAW is wasting Too small: it becomes forward assembly + LRU cache.
The key goal: to achieve best trade-off between cache efficiency and (CPU and memory) overhead.
dynamically adjusts the memory space ratio of FAA and chunk cache, and the size of LAW. Instead of using a fixed value as the threshold.
a deduplication system: 11k LoC C program Restore recovery log (RRL) is maintained to ensure reliability. Not open-source
a Dell PowerEdge server with 2.40GHz Intel Xeon 24 cores and 32GB four deduplication traces: FSL_1, FSL_2, EMC_1, EMC_2 Measurement: speed factor (MB/container-read), computing cost factor (second/GB), restore throughput (MB/second).
When this issue meets secure deduplication algorithm, can it arise other problem?