Venue | Category |
---|---|
FAST'12 | Deduplication System |
iDedup: Latency-aware, Inline Data Deduplication for Primary Storage1. SummaryMotivation of this paperiDedupImplementation and Evaluation2. Strength (Contributions of the paper)3. Weakness (Limitations of the paper)4. Some Insights (Future work)
due to the associated latency costs.
Prior research has not applied deduplication techniques inline to the request path for latency sensitive, primary workloads.
inline deduplication: add work to the write path, increase latency offline deduplication: wait for system idle time to do deduplication. reads remain fragmented in both.
Disadvantages of offline deduplication
Current workloads have two insights:
- spatial locality
- temporal locality
Key question: how to do the tradeoff between capacity savings and deduplication performance?
examine blocks at write time configure a minimum sequence length tradeoff: capacity savings and performance
a completely memory-resident, LRU cache. tradeoff: performance (hit rate) and capacity savings (dedup-metadata size)
maps the fingerprint of a block to its disk block number (DBN) on disk. use LRU policy, (fingerprint, DBN)
Dedup-metadata cache: a pool of block entries (content-nodes) Fingerprint hash table: maps fingerprint to DBN DBN hash table: map DBN to its content-node.
In disk
Reference count file: maintains reference counts of deduplicated file system blocks in a file.
refcount updates are often collocated to the same disk blocks (thereby amortizing IOs to the refcount file)
- the minimum duplicate sequence threshold
- in-memory dedup-metadata cache size
Two comparisons:
deduplication can convert sequential reads from the application into random reads from storage.
depends on the workload property how to enable the system to automatically make this tradeoff.