Venue | Category |
---|---|
IWQoS'15 | Distributed Deduplication |
Even Data Placement for Load Balance in Reliable Distributed Deduplication Storage Systems1. SummaryMotivation of this paperEDPImplementation and Evaluation2. Strength (Contributions of the paper)3. Weakness (Limitations of the paper)4. Some Insights (Future work)
This paper considers the scenario of distributed deduplication
It targets the problem of load balance in the above scenario
Motivation Conventional data placement cannot simultaneously achieve both read balance and storage balance objectives.
the trade-off between read balance and storage balance.
Solve this problem by formulates a combinatorial optimization problem..
Even Data Placement (EDP) algorithm
file chunks may be clustered in a small number of nodes. poor read performance.
padding: when use variable-size chunking. (only store the non-padded chunks)
with deduplication: inherently leads to uneven data distribution. breaking the connection between read balance and storage balance. (parallel I/O accesses)
minimize the sum of the read balance gaps for all files. has a huge solution space for this optimization problem.
introducing a weight for each node .
Main Algorithm (based on greedy algorithm)
Swap: attempts to swap the chunk positions of different file pairs to see if the summation of the read balance gaps can be further reduced.
Some extensions
System Architecture
Metadata server runs the data placement policy, then responds to the client with the list of unique chunks and how they placed across nodes. Clients then applies erasure coding to the unique chunks, and writes the encoded chunks to different nodes in parallel. Metadata server: maintain metadata in key-value databases and implement them using Kyoto Cabinet library.
Evaluation
Simulation experiments
varying I/O bandwidth across nodes
extend this optimization problem to heterogeneous environments. EDP algorithm, greedy algorithm.
combining deduplication and erasure coding extensive simulation and testbed experiments under real-world workloads.
near-optimal?