Rangoli-SYSTOR'13

Rangoli: Space management in deduplication environments

Venue	Category
SYSTOR'13	Space Management

Rangoli: Space management in deduplication environments 1. SummaryMotivation of this paperRangoliImplementation and Evaluation2. Strength (Contributions of the paper)3. Weakness (Limitations of the paper)4. Some Insights (Future work)

1. Summary

Motivation of this paper

Motivation In deduped volumes there is no direct relation between the the logical size of the file and the physical space occupied by it.

hard to find an optimal space reclamation Space reclamation in non-deduped environments is simpler. (guarantee changes in the used space of the volume by an amount equal to the logical size of the file)

In this work, it proposes a fast and efficient tool which can identify the optimal set of files for space reclamation in a deduped environment.

Two dimensions

Source centric: select groups of files at source that have a high degree of disk sharing

Migrate them together to the new destination (storage efficiency preservation)

Destination aware: Pick files at source that potentially have maximum duplicate data at some destination volume

In this paper, it only considers the source centric dimension, and the destination is agnostic.

Rangoli

Key idea: migrating similar files is better for preserving storage efficiency.

seek to partition the dataset such that most the data sharing between file within the same partition. files across partitions have little or no data sharing.

Metric

Space Reclamation (SR): for the source volume, the difference in the total used physical space
Cost of Migration (CM): the number of blocks transmitted over the network.
$\frac{SR}{CM}$ (higher is better)
Physical space bloat (PSB): the ratio of increase in the physical space consumption of the dataset to its original space consumption. (lower is better)

Main algorithm

Step 1: FPDB processing process the fingerprint database and compute the extent of data sharing across files

represent it as a bipartite graph.

In its FPDB, it stores <fp, block len, inode> such that there are multiple records with the same fp. Thus, it can achieve its goal via traversing the FPDB.

it contains one fingerprint record for every logical block of the file.

$K$ migration bins

$\frac{1}{K}$ of the volume space. (each migration bin is approximately equal in size)

Step 3: Qualification of migration bins compute the metrics for each migration bin and chose the best among them.

$p$ :
$b$ : denote the extent of data sharing of within the bin
$bin$ $p$ with the remainder of the dataset.

Implementation and Evaluation

Evaluation Datasets:

four datasets: Debian, HomeDir, VMDK, EngWeb

Evaluation objectives:

flexibility
impact on space consumption
network costs
scalability

2. Strength (Contributions of the paper)

a novel solution for space reclamation in deduped environments

fast and scalable and tested on real world dataset.

a deterministic solution to report the exact metrics before the actual migration

find the exact space reclamation and associated penalties (e.g., network cost, physical space consumption)

investigate how to find optimal datasets for space reclamation

better than alternatives based on MinHash

3. Weakness (Limitations of the paper)

From my perspective, this algorithm can only fit the NetApp deduplication system since its special design of FPDB.

4. Some Insights (Future work)

we can consider the similarity indicative hashes to repesent the similarity of the data

min-hash, minimum hash