Even Data Placement for Load Balance in Reliable Distributed Deduplication Storage Systems

VenueCategory
IWQoS'15Distributed Deduplication

Even Data Placement for Load Balance in Reliable Distributed Deduplication Storage Systems1. SummaryMotivation of this paperEDPImplementation and Evaluation2. Strength (Contributions of the paper)3. Weakness (Limitations of the paper)4. Some Insights (Future work)

1. Summary

Motivation of this paper

the trade-off between read balance and storage balance.

Solve this problem by formulates a combinatorial optimization problem..

Even Data Placement (EDP) algorithm

EDP

file chunks may be clustered in a small number of nodes. poor read performance.

  1. apply deduplication to remove duplicate chunks
  2. divide data into non-overlapping groups of unique chunks (as a stripe), then encode the data chunks to form additional parity chunks.

padding: when use variable-size chunking. (only store the non-padded chunks)

  1. round-robin fashion: the number of storage nodes is equal to the erasure coding stripe size.

with deduplication: inherently leads to uneven data distribution. breaking the connection between read balance and storage balance. (parallel I/O accesses)

image-20191122113917775

  1. Per-batch basis Only consider the how to place the unique chunks to maximally achieve balanced read.

minimize the sum of the read balance gaps for all files. has a huge solution space for this optimization problem.

  1. Heterogeneity Awareness Modern distributed storage systems are often composed of heterogeneous nodes , so the read latencies of data chunks differ across nodes.

introducing a weight for each node .

image-20191123134134016

Metadata server runs the data placement policy, then responds to the client with the list of unique chunks and how they placed across nodes. Clients then applies erasure coding to the unique chunks, and writes the encoded chunks to different nodes in parallel. Metadata server: maintain metadata in key-value databases and implement them using Kyoto Cabinet library.

Implementation and Evaluation

  1. Effectiveness of EDP
  2. Impact of erasure coding
  3. Impact of heterogeneity

varying I/O bandwidth across nodes

  1. the normalized read latency
  2. Impact of chunking schemes: fixed-size and variable-size chunking schemes
  3. Read latency distribution:
  4. Computational overhead: varying the number of files in each batch.

2. Strength (Contributions of the paper)

  1. formulating an optimization problem for read balance and storage balance.

extend this optimization problem to heterogeneous environments. EDP algorithm, greedy algorithm.

  1. implement a distributed storage system prototype

combining deduplication and erasure coding extensive simulation and testbed experiments under real-world workloads.

3. Weakness (Limitations of the paper)

  1. The overhead of this algorithm is high. Although EDP is in polynomial time, it needs to buffer each batch of unique chunks and wait the schedule result. I deem it would harm the performance dramatically.

near-optimal?

4. Some Insights (Future work)

  1. this paper assumes that the network transmission is the bottleneck, so it needs to leverage the I/O parallel to guarantee the performance. If this assumption cannot keep, this problem may because unnecessary.
  2. A trade-off between read balance and storage balance.