Venue | Category |
---|---|
FAST'13 | Compression |
To Zip or Not to Zip: Effective Resource Usage for Real-Time Compression1. SummaryMotivation of this paperCompression Estimation/TestImplementation and Evaluation2. Strength (Contributions of the paper)3. Weakness (Limitations of the paper)4. Some Insights (Future work)
motivation
adding compression on the data path consumes scarce CPU and memory resources on the storage system
real-time compression for block and file primary storage systems
it is advisable to avoid compressing what we refer to as "incompressible" data
main problem
identifying incompressible data in an efficient manner, allowing systems to effectively utilize their limited resources
the macro-scale solution
for an entire volume or file system of a storage system
the general framework
m
random locationslocation, contribution
real life implementations of compression algorithms are subject to locality limits (can use a chunk to define the locality)
define the contribution of a byte as the compression ratio of its locality
the micro-scale solution
for a single write: 8KB, 16KB, 32KB, 128KB
recommend to zip or not to zip (has to be much faster than actual compression)
the heuristics method
collect a set of basic indicators about the chunk
sample: at most 2KB of data per write buffer
define several thresholds to test the indicators
implementation
evaluation
compression ratios v.s. the number of samples
running time v.s. compression trade-off
a bit about compression techniques
this paper focuses on Zlib - a popular compression engine for (zip), combines:
existing solutions for estimating compression ratios
by file extension
look at the actual data
scan and compress everything
look at a prefix of (a file or a chunk) and deduce about the rest
put all together
when most is compressible
when significant percent is incompressible
when most is incompressible