CRSS publication: Online De-duplication in a Log-Structured File System for Primary Storage

Online De-duplication in a Log-Structured File System for Primary Storage

Published as Storage Systems Research Center Technical Report UCSC-SSRC-11-03.

Abstract

Data de-duplication is a term used to describe an algorithm or technique that eliminates duplicate copies of data from a storage system. Data de-duplication is commonly performed on secondary storage systems such as archival and backup storage. De-duplication techniques fall into two major categories based on when they de-duplicate data: offline and online. In an offline de-duplication scenario, file data is written to disk first and de-duplication happens at a later time. In an online de-duplication scenario, duplicate file data is eliminated before being written to disk. While data de-duplication can maximize storage utilization, the benefit comes at a cost. After data de-duplication is performed, a file written to disk sequentially could appear to be written randomly. This fragmentation of file data can result in decreased read performance due to increased disk seeks to read back file data. Additional time delays in a storage system’s write path and poor read performance prevent online de-duplication from being commonly applied to primary storage systems. The goal of this work is to maximize the amount of data read per seek with the smallest impact to de-duplication possible. In order to achieve this goal, I propose the use of sequences. A sequence is defined as a group of consecutive file data blocks in an incoming file system write request. A sequence is considered a duplicate if a group of consecutive data blocks are found to be in the same consecutive order on disk. By using sequences, de-duplicated file data will not be fragmented over the disk. This will allow a de-duplicated storage system to have disk read performance similar to a system without de-duplication. I offer the design and analysis of three algorithms used to perform sequence-based de-duplication. I show through the use of different data sets that it is possible to perform sequence-based de-duplication on archival data, static primary data and dynamic primary data. Finally, I present a full scale implementation using one of the three algorithms and report the algorithm’s impact on de-duplication and disk seek activity.

Publication date:
May 2011

Authors:
Stephanie Jones

Projects:
Deduplication
Deduplication Optimization

Available media

Full paper text: PDF

Bibtex entry

@techreport{jones-ssrctr-11-03,
  author       = {Stephanie Jones},
  title        = {Online De-duplication in a Log-Structured File System for Primary Storage},
  institution  = {University of California, Santa Cruz},
  number       = {UCSC-SSRC-11-03},
  month        = may,
  year         = {2011},
}

Last modified 24 May 2019