Reconsidering Single Failure Recovery in Clustered File Systems

@DSN'16 @Single Failure

Reconsidering Single Failure Recovery in Clustered File SystemsSummaryStrength (Contributions of the paper)Weakness (Limitations of the paper)Future Works

Summary

Motivation of this paper:

  1. existing studies on single failure recovery neglect the bandwidth diversity propert in CFS archiecture (intra-rack and cross-rack).
  2. Many single failure recovery solutions focus on XOR-based erasure-codes, which are not commonly used for maintaining fault tolerance in a CFS.
  3. Existing single failure recovery solutions focus on minimizing the amount of repair traffic, but most of them do not consider the load balancing issue during the recovery operation. To this end, this paper aims to reduce and balance the amount of cross-rack repair traffic for a single failure tolerance.

Cross-rack-aware Recovery (CAR) Three key techniques:

Due to the linearity, suppose the first requested chunks are stored in the same rack, so it can specify a node in that rack to perform the linear operations based on the

To describe the load balance rate of the cross-rack repair traffic, it defines it as follows:

The ratio of the maximum amount of cross-rack repair traffic across each rack to the average amount of cross-rack repair traffic over the intact racks.So it formulates this question into an optimization problem.

Goal: minimize the load balancing rate, subject to the condition that the total amount of cross-rack repair. (Minimize , subject to is minimized)

The main idea of it is to replace the currently selected multi-stripe recovery solution with another one that introduces a smaller load balancing rate

1536422986790

Implementation and Evaluation: Evaluation:

  1. Cross-Rack Repair Traffic: evaluate the amount of cross-rack repair traffic when recovering a single lost chunk.
  2. Load Balancing: measure the laod balancing rate (i.e, )
  3. Computation Time and Transmission Time

Strength (Contributions of the paper)

  1. this paper identifies the open issues that are not addressed by existing studies on single failure recovery.
  2. It proposes CAR, a new cross-rack-aware single failure recovery algorithm for a CFS setting.
  3. It also implements CAR and conduct extensive testbed experiments based on different CFS settings with up to 20 nodes.

Weakness (Limitations of the paper)

  1. Firstly, this paper does not provider the details of how to implement this recovery scheme in to a proactical system, e.g., how to achieve the partial decoding?
  2. The idea of this paper is not vey novel and easy to understand. I think the performance of this scheme highly depends on the layout.

Future Works

  1. A very serious issue is how to decrease the overhead of the partial decoding in the internal of the a rack.
  2. For a specific layout, this scheme may lead to the skewed workload for a rack.