Venue | Category |
---|---|
FAST'12 | Workload Analysis |
Characteristics of Backup Workloads in Production Systems1. SummaryMotivation of this paperBackup analysisImplementation and Evaluation2. Strength (Contributions of the paper)3. Weakness (Limitations of the paper)4. Some Insights (Future work)
there has been little in the way of corresponding studies for backup systems. data backups are used to protect primary data.
backup filesystems have had to scale their throughput to meet storage growth.
- autosupport reports (storage usage, compression, file counts and ages, caching statistics and other metrics): limited in detail but wide in deployment.
- chunk-level metadata (chunk hash identifiers, sizes and location): contain great detail but are limited in deployment.
get the information of sub-chunk: can be used to investigate deduplication rates at various chunk sizes smaller than the default 8KB.
both a histogram (probability distribution) and a cumulative distribution function (CDF)
since it combines individual files together from the primary storage system into "tar-like" collections. larger files reduce the likelihood of whole-file deduplication but increase the stream locality within the system.
short retention periods lead to higher data churn.
the aggregate overhead scales inversely with chunk size.
It assumes a small fixed cost (30 bytes, a fingerprint, chunk length, and a small overhead for other metadata)
per physical chunk stored in the system per logical chunk in a file recipe
use a factor to report the reduction in deduplication effectiveness
: the metadata size divided by the average chunk size.
the real deduplication includes metadata costs:
without metadata costs:
Sometime the improvement in deduplication of small chunk size sufficiently compensates for the added per-chunk metadata.
Writes: Using stream locality hints achieves good deduplication hit rates with caches.
stream locality: container level
Reads: use cache to provide fast restores of data during disaster recovery.
show the backup workload tends to have shorter-lived and larger files than primary storage.
using a single chunk size to extrapolate deduplication at larger chunk sizes.
depends on the applications which generates them
Backup: individual files are typically combined into large units ("tar" file)
Incremental backups: files are modified (a large portions in common with earlier versions) Full backups: are likely to have many of their comprising files completely unmodified
Windows 2000: entire files deduplication Venti: fixed-block deduplication LBFS: variable-sized chunks deduplication
Combining unique chunks into "compression regions"
aggregate new unique chunks into compression regions, which are compressed as a single unit (approximately 128KB before compression)