Venue | Category |
---|---|
USENIX ATC'19 | Cloud Deduplication |
Data Domain Cloud Tier: Backup here, backup there, deduplicated everywhere!1. SummaryMotivation of this paperCloud TierImplementation and Evaluation2. Strength (Contributions of the paper)3. Weakness (Limitations of the paper)4. Some insights
Object storage in public and private clouds provides cost-effective, on-demand, always available storage.
customers backup their primary data (typically retained for 30-90 days)
selected backups are transitioned to cloud storage (retained long term 1-7 years)
- File is represented by a Merkle tree with user data as variable sized chunk at the bottom level of the tree. (referred as L0 chunks)
- SHA-1 fingerprint of L0 chunks are grouped together at the next higher level of the tree to form chunks. (L1 chunks)
- the top of the tree is (L6 chunks)
- LP chunks: chunks above L0 as LP chunks
- If two files are exactly the same, they would have the same L6 fingerprint. (if two files only partially overlap in content, then some branches of the tree will be identical.)
Metadata and data separation: L0-containers and LP-containers
the locality of L0 chunks is preserved which results in better read performance Each container: data section (chunk), metadata section (fingerprints of chunks)
- use it as extra capacity
- use it for long term archival of selected data
Metadata container: it refers a third type of container, which stores the metadata sections from multiple L0 and LP-Containers
the metadata section of containers are reading during deduplication and garbage collection, and require quick access, so Metadata-Containers are stored on the local storage as well as in cloud storage. mirror the metadata
Main difference: store critical cloud tier metadata on the local storage of the Data Domain system to improve performance and reduce cost.
larger objects result in less metadata overhead. also decrease transaction costs as cloud storage providers charge per-object transaction costs. terms objects and containers interchangeably.
a perfect hash function is a collision-free mapping which maps a key to a unique position in a bit vector. (1 : 1 mapping) a perfect hash function + a bit vector
how to traverse the Merkle tree enumeration is done in a breadth first manner
reduces the amount of data transferred to the cloud tier by performing a deduplication process relative to chunks already present in the cloud tier. do not need to generate perfect hash vector, directly scan Merkle tree.
mark and sweep
need to implement new APIs in cloud providers. delete a compression region in a cloud container when it is completely unreferenced instead of individual chunk.
GC analysis Cleaning efficiency loss due to compression region cleaning
Freeable space estimation File migration and seeding performance Garbage collection performance File migration and restore from the cloud
builds upon a previous technique using perfect hashes and sequential storage scans.
transfer the unique content to the cloud tier to preserve the benefits of deduplication during transfer.
handle the latency and financial cost of reading data from the cloud back to the on-premises appliance.
needs an algorithm to estimate the amount of space unique to a set of files.
have not found customer demand for convergent encryption or stronger encryption requirements for cloud storage than on-premises storage.
If multiple keys are selected, customers accept a potential loss in cross-dataset deduplication.
If an active tier is lost, the backup copies migrated to object storage can be recovered.
suppose it is static fixed fingerprint set.