Venue | Category |
---|---|
FAST'19 | Sketch+Deduplication |
Sketching Volume Capacities in Deduplicated Storage1. SummaryMotivation of this paperVolume SketchImplementation and Evaluation2. Strength (Contributions of the paper)3. Weakness (Limitations of the paper)4. Future Works
This work focuses on technologies and tools for managing storage capacities in storage system with deduplication.
analyzing capacities in deduplicated storage environment reclaimable capacity, attributed capacity
The key issus: once deduplication is brought in to the equation, the capacity of a volume is no longer a known quantity.
for any single volume or any combination of volumes
This work addresses gaps for reporting and management of data that has already been deduplicated, which prior works do not address.
This paper borrow techniques from the realm of streaming algorithms (sketch)
the metadata of each volume is sampled using a content-based sampling technique to produce a capacity sketch of the volume. key property: sketch is much smaller than the actual metadata, yet contains enough inoformation to evaluate the volumes capacity properties.
for each data chunk, examine its fingerprint (the hash), and include it in the sketch only if it contains leading zeros for a parameter Sample ratio: , also called sketch factor. this is the tradeoff: the required resources to handle the sketches vs. the accuracy which they provide.
representing all of the data in the system
It also collects further parameters in the sketch
Reference count: the number of times the data chunk with fingerprint was written in the data set Physical count: the number of physical copies stored during writes to the data set .
how much is a volume involved in deduplication attributed capacity: a breakdown of a volume of its space savings to deduplication and compression. For example: if a data chunk reference is 3, 2 originating from volume and one from volume , then the space is split in a and fashion between volume and respectively.
Very similar to the MSST'12's work
pull the sketch data out of the storage system onto an adjacent management server where the sketch data is analyzed.
this avoid using CPU and memory resources in the storage that could otherwise be spent on serving I/O requests.
each slice has its own sketch which is maintained by the owning process.
Ingest phase: for each volume, it collects all of its relevant hashes while merging and aggregating multiple appearances of hashes.
- a full system hash table: all of the hashes seen in the full system
- volume level structures: A B-Tree for each volume in the system which aggregates the hashes (only store a pointer to the entry in the full table)
- Synthetic data: using VDBench benchmarking suite
- UBC data traces
- Production data in the field
deduplication opportunities that were not identified the choice to forgo a deduplication opportunity for avoiding extensive data fragmentation
This is an inaccuracy that the storage systems are dynamic and it cannot expect to freeze them at a specific state. It needs to reduce the time window in which the sketches are extracted.
sketch distribution and concurrency