Venue | Category |
---|---|
SOSP'19 | Ceph |
File Systems Unfit as Distributed Storage Backends: Lessons from 10 Years of Ceph Evolution1. SummaryMotivation of this paperBlueStoreImplementation and Evaluation2. Strength (Contributions of the paper)3. Weakness (Limitations of the paper)4. Some Insights
motivation
developing a zero-overhead transaction mechanism in file systems is challenging
metadata performance at the local level affects performance at the distributed level
supporting for novel, backward-incompatible storage hardware (e.g., SMR, ZNS SSD)
backward incompatible zone interface
from FileStore to BlueStore
storing object data as files and running RocksDB on top of a journaling file system
BlueStore uses raw disks
other changes
architecture
BlueFS and RocksDB
fast metadata operations
no consistency overhead for object writes
BlueFS implements basic system calls required by RocksDB
data path and space allocation
copy-on-write clone operation
no journaling double-writes
cache
checksum
overwrite of erasure-coded data
perform overwrites in EC pools using two-phase commit
transparent compression
exploring new interfaces
evaluation
outline the main reasons behind Ceph's decision to develop BlueStore
user space storage backend deployed directly on raw storage devices
efficient transactions
fast metadata operation
adopt SMR and ZNS SSD
introduce the design of BlueStore, the challenges its design overcomes, and opportunities for future improvements
several experiments that evaluate the improvement of design changes
cache sizing and writeback
key-value store efficiency
embedding RocksDB in BlueStore is problematic in multiple ways
CPU and memory efficiency
the storage backend plays a key role in the performance of the overall system
receive I/O requests over the network and serve them from locally attached storage devices using storage backend software
transaction support in the storage backend
most file systems implement the POSIX standard, which lacks a transaction concept
using inefficient or complex mechanisms
Ceph distributed storage system architecture
librados
library provides a transactional interface form manipulating objects and object collections in RADOS
logical partitions (pools): provide redundancy for the contained objects either through replication and erasure coding
CRUSH: form an indirection layer between clients and OSDs
a separate Ceph OSD daemon per local storage device
each OSD processes client I/O requests from librados clients and cooperates with peer OSDs to replicate or erasure code updates
ObjecStore interface
provide abstractions for objects, object collections, a set of primitives to inspect data, and transactions to update data
each OSD may make use of a different backend implementation of the ObjectStore interface
provide transactions in a storage backend running on top of a file system
hooking into a file system's internal (but limited) transaction mechanism
implementing a WAL in user space
slow read-modify-write
three steps
fsync
is called to commit the transaction to diskevery read-modify-write operation incurred the full latency of the WAL commit
non-idempotent operations
double writes
data is written twice
lead to most file systems only log metadata changes, allowing data loss after a crash
using a key-value database with transactions as a WAL
the metadata was stored in RocksDB, while the object data were still represented as files in a file system
introduce high consistency overhead that stems from running atop a journaling file system
similar to the journaling of journal problem
fsync
fsync
fsync
issues one expensive FLUSH CACHE command to disk
with a journaling file system, each fsync
issues two flush commands