Venue | Category |
---|---|
MSST'19 | Deduplication Index |
LIPA: A Learning-based Indexing and Prefetching Approach for Data Deduplication1. SummaryMotivation of this paperLearning-based Indexing Prefetching Approach (LIPA)Implementation and Evaluation2. Strength (Contributions of the paper)3. Weakness (Limitations of the paper)4. Future Works
This paper uses the reinforcement learning framework to build an adaptive indexing structure.
to solve the chunk-lookup disk bottleneck problem for large-scale
Current methods:
- a full chunk
- a sampled chunk index drawback: hard to fit in RAM and sampled chunk index directly affects the deduplication ratio dependent on the sampling ratio.
Problem: how to build an efficient fingerprint indexing to help identify duplicate data chunks?
chunk-lookup disk bottleneck problem
Goal: propose a simple reinforcement learning method to learn how to prefetch a segment dynamically.
a trial-and-error prefectching and then gives a delayed reward during the data stream evolution.
An incoming segment may share the same feature with previous neighboring segments
This paper trains the locality relationship and deduplicate each incoming segment against only a few of its previous segments.
By using reinforcement learning mythology,
it aims to identify the most similar segment (champion) for an incoming segment (temporal locality) then prefetch several successive segments (followers) by exploiting spatial locality.
choose a segment with the higher score
a reward value: the count of the lookup hits of the segment. once update or incremental update
chunking, hashing, indexing, storing
- Linux Kernel archival: 155 versions
- Vmdk: pre-made VM disk images for VMware's Virtual Appliance Market place.
- Fslhome: 9 users in 14 days from Sep. 16th to 30th in 2011
- FSL MacOS
segment with higher score higher lookup hits should be cached? it just shows this via experiments instead of theoretical analysis
refer from MSST'16: even similar users behave quite differently, which should be accounted for in future deduplication systems.
a very general learning framework for prefetching
exploitation by selecting a segment known to be useful. exploration by selecting a segment whose usefulness is unknown but which might provide a bigger reward and thus expand the known space.