LIPA: A Learning-based Indexing and Prefetching Approach for Data Deduplication

VenueCategory
MSST'19Deduplication Index

LIPA: A Learning-based Indexing and Prefetching Approach for Data Deduplication1. SummaryMotivation of this paperLearning-based Indexing Prefetching Approach (LIPA)Implementation and Evaluation2. Strength (Contributions of the paper)3. Weakness (Limitations of the paper)4. Future Works

1. Summary

Motivation of this paper

This paper uses the reinforcement learning framework to build an adaptive indexing structure.

to solve the chunk-lookup disk bottleneck problem for large-scale

Current methods:

  1. a full chunk
  2. a sampled chunk index drawback: hard to fit in RAM and sampled chunk index directly affects the deduplication ratio dependent on the sampling ratio.

Problem: how to build an efficient fingerprint indexing to help identify duplicate data chunks?

chunk-lookup disk bottleneck problem

Learning-based Indexing Prefetching Approach (LIPA)

Goal: propose a simple reinforcement learning method to learn how to prefetch a segment dynamically.

a trial-and-error prefectching and then gives a delayed reward during the data stream evolution.

An incoming segment may share the same feature with previous neighboring segments

This paper trains the locality relationship and deduplicate each incoming segment against only a few of its previous segments.

By using reinforcement learning mythology,

it aims to identify the most similar segment (champion) for an incoming segment (temporal locality) then prefetch several successive segments (followers) by exploiting spatial locality.

choose a segment with the higher score

a reward value: the count of the lookup hits of the segment. once update or incremental update

Implementation and Evaluation

chunking, hashing, indexing, storing

  1. Dataset
  1. Linux Kernel archival: 155 versions
  2. Vmdk: pre-made VM disk images for VMware's Virtual Appliance Market place.
  3. Fslhome: 9 users in 14 days from Sep. 16th to 30th in 2011
  4. FSL MacOS
  1. Deduplication ratio
  2. RAM usage memory overhead for the fingerprint index lookup.
  3. Data throughput

2. Strength (Contributions of the paper)

  1. the main contribution of this work is that it proposes a new deduplication framework based on reinforcement learning for data deduplication.

3. Weakness (Limitations of the paper)

  1. the rationale behind this work is still vague, it should explains

segment with higher score higher lookup hits should be cached? it just shows this via experiments instead of theoretical analysis

4. Future Works

  1. This work mentions the point that here lacks an adaptive feedback mechanism to adjust the mapping relationship to reflect the dynamics of incoming data streams.

refer from MSST'16: even similar users behave quite differently, which should be accounted for in future deduplication systems.

  1. The key reinforcement learning algorithm is based on K-armed bandit.

a very general learning framework for prefetching

  1. This paper mentions that it is important to find a better balance between

exploitation by selecting a segment known to be useful. exploration by selecting a segment whose usefulness is unknown but which might provide a bigger reward and thus expand the known space.

  1. This work mentions that it can be strengthen LIPA to capture the regular and irregular patterns by using a number of upper-level access hints and file attributes.