Venue | Category |
---|---|
FAST'22 | post-dedup |
DeepSketch: A New Machine Learning-Based Reference Search Technique for Post-Deduplication Delta Compression1. SummaryMotivation of this paperDeepSketchImplementation and Evaluation2. Strength (Contributions of the paper)3. Weakness (Limitations of the paper)4. Some Insights (Future work)
Motivation
to maximize storage efficiency, existing post-deduplication delta-compression perform delta compression along with deduplication and lossless compression
existing approaches achieve significantly lower data-reduction ratios than the optimal due to their limited accuracy in identifying similar data blocks
super-feature data sketching (compared with brute-force searching (optimal))
key question: how to find a good reference block that provides a high data-reduction ratio
main idea: leverage the learning-to-hash method to achieve higher accuracy in reference search for delta compression
improve the data-reduction efficiency
envision DeepSketch's DNN is pre-trained before building or updating a DeepSketch-enabled system
similar to solving a nearest-neighbor search problem
main observation
high FNR of previous approaches
SFSketch is highly optimized to identify only very similar data
two challenges of using DNN in DeepSketch (different from ML-related works)
lack of semantic information
high dimensional space
dynamic K-means clustering
address how to find appropriate initial parameters for clustering
dynamically refine the value of without any hints for initial parameters
complexity
neural-network training
transfer the learned knowledge of the classification model to a hash network model
address the issue that data blocks are not uniformly distributed over clusters (may biased)
reference selection
the traditional exact-matching-based search is not effective for the learning-to-hash model
two sketch stores
Implementation
trace:
small traces: eleven block I/O traces
evaluation
baseline: Finesse
overall data reduction
reference search pattern analysis
combination with existing techniques (Finesse + DeepSketch)
impact of training data-set quality
overhead analysis
performance overhead: DeepSketch, Finesse and combination of DeepSketch + Finesse
memory overhead
the first machine learning-based reference search technique for post-deduplication delta compression
address the the training issues for an extremely high-dimensional data set
do not test DeepSketch in a large dataset
the improvement of DeepSketch is limited
it must target a scenario where data-reduction is paramount (e.g., backup system)
delta compression background
delta compression can achieve a high data-reduction ratio even for
previous approaches can be collectively called locality-sensitive hash (LSH)
uncorrectable bit-error rate (UBER)
learning-to-hash method
trains a neural network (NN) to generate a hash value for a given input data block
argument of the sketch index memory overhead