Venue | Category |
---|---|
FAST'15 | Data Deduplication |
Design Tradeoffs for Data Deduplication Performance in Backup Workloads1. SummaryMotivation of this paperDeFrameImplementation and Evaluation2. Strength (Contributions of the paper)3. Weakness (Limitations of the paper)4. Some Insights (Future work)
Motivation
In order to understand the fundamental tradeoffs in each of its design choices
Parameter space (design parameter)
Goal
Present a general-purpose Deduplication Framework
Inline data deduplication space
Fingerprint index
a well-recognized performance bottleneck (a large-scale deduplication system)
putting all fingerprints in DRAM is cost-efficient
Two submodules:
Classification
Exact deduplication / Near-exact deduplication
Prefetching policy
Exact + Prefetching
avoid a large fraction of lookup requests to the key-value store.
the fragmentation problem would reduce the efficiency of the fingerprint prefetching and caching
Near-exact + sample
rewriting
DeFrame Architecture
Container store: metadata section + data section
Recipe store: associated container IDs without the need to consult the fingerprint index, add some indicators of segment boundaries.
fingerprint index:
Backup pipeline:
Restore pipeline:
Implementation
Metrics
Findings
fragmentation results in an ever-increasing lookup overhead for EDPL
Consider the self-reference
all fingerprints are updated with their new segment IDs in the key-value store.
For lowest storage cost: EDLL is preferred (highest deduplication ratio, sustained high backup performance) For low memory footprint: ND is preferred,
NDPL: for its simpleness NDLL: better deduplication ratio For a sustained high restore performance: EDPL + rewriting