Venue | Category |
---|---|
ToS'18 | Distributed Deduplication |
Cluster and Single-Node Analysis of Long-Term Deduplication Patterns1. SummaryMotivation of this paperCluster and Single-node analysisImplementation and Evaluation2. Strength (Contributions of the paper)3. Weakness (Limitations of the paper)4. Some Insights (Future work)
Motivation
Most past studies either were of a small static snapshot or covered only a short period that was not representative of how a backup system evolves over time.
By understanding such datasets' characteristics, it can design more efficient storage systems.
Single-node deduplication vs. Cluster deduplication
Stateless vs. stateful
How to simulate the incremental backup?
need to detect newly added and modified files.
comparing two consecutive snapshot, it can identify whether a file was newly added checking the mtime (modified time)
Follow the work of FAST'12 is the size of each chunk's metadata divided by the chunk size is the size before deduplication is the raw data size afterwards
, is the size of a file's recipe, is the size of the hash index
The benefit of larger chunk sizes
larger chunk sizes reduce the number of metadata entries. reduce the amount of metadata also reduces the number of I/Os to the chunk index.
Chunk popularity
the skew in chunk popularity has also been found in primary storage and HPC systems Identifying such popular chunks would be useful in optimizing performance.
accelerate chunk indexing and improve cache hit ratios
Analysis of groups of users
Analysis of cluster deduplication
Two challenges:
different design principles: deduplication ratio, load distribution, and throughput
Chunk-level routing
Chunk-level routing is required to achieve exact deduplication
resulting in more CPU and memory consumption.
Key Metrics for cluster-deduplication
Load distribution
Physical load distributions: the capacity usage at the nodes
Logical load distributions: I/O performance
the performance of load balance is the Coefficient of Variation
Stateless algorithm leads to high data skew in terms of the logical load distribution, which is opposite to their performance in terms of the logical load distribution.
Stateful approach incurs the most communication overhead, since it needs to send information to all storage nodes to request its similarity index. Using a master node may decrease the communication overhead, but it needs to store all the Bloom filters in the master node and the master node might become a bottleneck as the system scales up.
Stateless causes lower communication overhead since the client can choose the destination node without sending any messages.
provide a detailed comparative analysis.
the impact of chunk sizes