Venue | Category |
---|---|
Cluster'20 | Delta Compression |
Exploring the Potential of Fast Delta Encoding: Marching to a Higher Compression Ratio1. SummaryMotivation of this paperGdeltaImplementation and Evaluation2. Strength (Contributions of the paper)3. Weakness (Limitations of the paper)4. Some Insights (Future work)
Motivation
delta compression is costly because of its time-consuming word-matching operations for delta calculation
existing delta encoding approaches
Observation
Design goals
Main idea
Gear-based rolling hash
use Gear hash as a rolling hash for word-matching in delta encoding
Array-based indexing scheme
use the most-significant bits of fingerprint as the index of a hash array by the right-shift operation
traditional hash table's memory cost is about 2-3 times more than the array-based index
Encoding and Decoding
the delta chunk is recorded by the "Copy" and "Insert" instructions in turn
Container-based further compression
put several delta chunks altogether in a container with a size of 2 MiB of 4 MiB
compress the section of "Copy" and "Insert" instructions at the bit-level using FSE and compress the data part in the "Insert" instructions at both the world- and bit-level using zstd
Baseline: Xdelta and Zdelta
Datasets
first apply FastCDC to generate the data chunks and detects similar chunks with a super-feature based resemble detection method (FAST'12)
Main results
Gear hash can achieve nearly the same hashing efficiency as Adler32
Array-based indexing achieves a much higher encoding speed as its simple design
Container-based further compression improves the compression ratio by 10%-90%
Very simple design
Background of lossless data reduction
there are mainly three technology roadmaps for lossless data reduction
delta compression bridges the gap in compression ratio (between data deduplication and general compression)
Delta compression (two steps)
Resemblance detection: generate features and then search similar chunks according to their features
Delta encoding: refers to the process of calculating the delta among the very similar chunks/files