Venue | Category |
---|---|
ToCC'20 | Data Encryption |
Duplicacy: A New Generation of Cloud Backup Tool Based on Lock-Free Deduplication1. SummaryMotivation of this paperLock-free deduplicationImplementation and Evaluation2. Strength (Contributions of the paper)3. Weakness (Limitations of the paper)4. Some Insights (Future work)
Motivation
an ever-increasing demand for cross-client deduplication solution to save network bandwidth, lower storage costs and improve backup speeds.
existing solutions depend on lock-based approaches relying on a centralized
chunk database
make it hard to delete a chunk in presence of multiple clients
Main idea
individually
and stored in a file using the hash of the chunk content as the file nameNaming chunks
packing together all files, as if it were creating one big tar archive
dependent on the total size of files
backup manifest files are also split into chunks using the same variable-size chunking algorithm
key points
eliminating the need for a chunk database has not been attempted before by any backup tool
Lock-free situations
concurrent backup
just perform a file lookup via the file storage API
concurrent deletion: two-step fossil deletion
fossil collection: aggressively identifies
unreferenced chunks based on only existing backups
ignoring backups that may be still in progress
use all know backup manifest files to identifies unreferenced chunks
instead of deleting unreferenced chunks immediately, it performs a renaming operation on these chunks
fossil deletion: lists all chunks referenced by these new backups, also check fossil chunks,
concurrent backup and deletion
Duplicacy: command line version is written in the Go language
a single backup tool to support all different storage services
features: incremental backup, full snapshot, encryption
support both fixed-size chunking and variable-size chunking
fixed-size chunking: for the large backup files that unlikely subject to insertions and deletions
chunk-based approaches provide native support for incremental backup and full snapshot
cache
a file lookup is performed before uploading each chunk
To reduce the file lookup API calls, it maintains an in-memory cache to store hashes of chunks referenced by the last backup from the same client
encryption and compression
each chunk can be individually compressed and encrypted
apply HMAC-SHA256 to generate the chunk hash
chunk content is encrypted by AES-GCM with an encryption key that is the HMAC-SHA256 of the chunk hash
a master key derived from the PBKDF2 function on the storage password chosen by the user
Evaluation
Trace:
Backup performance:
Storage efficiency
Ubiquity integration
the wide support for cloud storage systems
cross-client deduplication
Existing backup system (open-source)
Duplicity: http://duplicity.nongnu.org/
Restic: https://github.com/restic/restic
Attic (Borg) - python
Obnam: https://obnam.org/
Bupstash: https://bupstash.io/
drawback of rsync
make uses of the fixed-size chunking algorithm where the chunk size is determined by the file size
add a simple rolling hash to detect insertions and deletions
different from the CDC chunking
a lookup to check duplicates is performed only after a breakpoint hash been identified
incremental backup becomes dependent on previous backups