CompressionEst-FAST'13

To Zip or Not to Zip: Effective Resource Usage for Real-Time Compression

Venue	Category
FAST'13	Compression

motivation
- adding compression on the data path consumes scarce CPU and memory resources on the storage system
  - real-time compression for block and file primary storage systems
  - it is advisable to avoid compressing what we refer to as "incompressible" data
    - standard LZ type compression algorithms incur higher performance overheads when the data does not compression well
main problem
- identifying incompressible data in an efficient manner, allowing systems to effectively utilize their limited resources
  - a macro-scale compression estimation for the whole data set (offline)
  - a micro-scale compressibility test for individual write operations (online)

implementation
- the macro-scale solution: written in C, multi-threaded
evaluation
- compression ratios v.s. the number of samples
- running time v.s. compression trade-off
  - compared with the prefix method and the full compression

the macro-scale test provides a quick and accurate estimate for which data sets to compress
the micro-scale test heuristics have proved critical in reducing resource consumption while maximizing compression for volumes containing a mix of compressible and incompressible data

is not general to other compression algos (e.g., LZ4, ZSTD)
define the thresholds to find a good point for disabling compression is not clear
evaluation is limited, no end-to-end system performance evaluation

a bit about compression techniques
- this paper focuses on Zlib - a popular compression engine for (zip), combines:
  - LZ compression: pointers instead of repetitions
  - Huffman encoding: use shorter encoding to popular characters
existing solutions for estimating compression ratios
- by file extension
  - not always accurate, not always available
- look at the actual data
  - scan and compress everything
  - look at a prefix of (a file or a chunk) and deduce about the rest
    - not guarantees on the outcome
    - good for compressible data - zero overhead