Venue | Category |
---|---|
FAST'22 | Post-Deduplication functionality |
DedupSearch: Two-Phase Deduplication Aware Keyword Search1. SummaryMotivation of this paperDedupSearchImplementation and Evaluation2. Strength (Contributions of the paper)3. Weakness (Limitations of the paper)4. Some Insights (Future work)
motivation
in deduplicated storage, it creates multiple logical pointers from different files and even users, to each physical chunk
this paper aims to address the keyword search issue in deduplicated storage
the main goal
why other approaches cannot work
naive approaches
main idea
begin with a physical phase that performs a physical scan of the storage system and scans each chunk of data for the keywords
then, with a logical phase that performs a logical scan of the file system by traversing the chunk pointers that make up the files
challenges
most deduplication systems do not maintain "back pointers" from chunks to the file that contain them (addressed by the logical phase)
keywords might be split between adjacent chunks in a file (addressed by recording the partial matches)
string-matching algorithm
use the Aho-Corasick string-match algorithm
construct a trie for the reverse dictionary to identify suffixes at the beginning of a chunk
match result database
exact matches
tiny substrings
keywords that begin or end with frequent letters in the alphabet might result in the allocation of numerous chunk-result record
tiny-result record
database organization
generation of full research results
for each file in the system, the file recipe is read, and the fingerprints of its chunks are used to lookup result records in the database
the logical phase can be parallelized to some extent
implementation
evaluation
traces
experiments
DedupSearch performance
DedupSearch data structures
DedupSearch overheads
very strong experiments
address the string search issue from the deduplication aspect (a new direction)
the scenario is limited
the main idea is very similar to DeduplicationGC-FAST'17, GoSeed-FAST'20
lack the support of wildcards
since its prefix/suffix approach incur high overhead, it would be more challenging to support wildcards
the concept from near-storage processing
the restore process considered by it