Extreme Binning: Scalable, Parallel Deduplication for Chunk-based File Backup

Appeared in Proceedings of the 17th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2009).

Abstract

Data deduplication is an essential and critical component of backup systems. Essential, because it reduces storage space requirements; critical, because the performance of the entire backup operation depends on its throughput. Traditional backup workloads consist of large data streams with high locality and existing deduplication techniques require this locality to provide reasonable throughput. We present Extreme Binning: a scalable deduplication technique for backup requests made up of individual files and with no locality among consecutive files in a given window of time. Due to the lack of locality existing techniques perform poorly. Extreme Binning exploits file similarity instead of locality and makes only one disk access per file to maintain throughput. The backup system scales gracefully with the data; more backup nodes can be added very easily to boost throughput. In such a multi node backup system every file is allocated, using a stateless routing algorithm, to one node only allowing for maximum parallelization. Each backup node is autonomous with no dependency across nodes making data management tasks robust and low overhead.

Publication date:
September 2009

Authors:
Deepavali Bhagwat
Kave Eshghi
Darrell D. E. Long
Mark Lillibridge

Projects:
Deduplication

Available media

Full paper text: PDF

Bibtex entry

@inproceedings{bhagwat-mascots09,
  author       = {Deepavali Bhagwat and Kave Eshghi and Darrell D. E. Long and Mark Lillibridge},
  title        = {Extreme Binning: Scalable, Parallel Deduplication for Chunk-based File Backup },
  booktitle    = {Proceedings of the 17th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2009)},
  month        = sep,
  year         = {2009},
}
Last modified 28 May 2019