Deduplication Optimization @ SSRC

Deduplication Optimization

1. Frequency-based Chunking for Deduplication Chunking based Data deduplication (dedupe) has becomes a prevalent techniques to support many data driven applications in our daily life since it effectively reduces the data footprint on disks, making the same physical capacity to accommodate a much larger data size. Content based chunking, a stateless chunking deduplication algorithms partitions the long byte steam into a sequence of smaller size data chunks and remove the duplicate ones. However, due to its nature of randomness, content based chunking may suffer high performance variability as well as no performance guarantee. Meanwhile, content based chunking does not consider the appearing frequencies of data chunks while partitioning. However, frequent data chunks have a far reaching impact on the dedupe performance. Intuitively, if a data chunk occurs k times in the byte stream, then its degree of redundancy is high and k-1 copies of the data chunk can be eliminated. On the other hand, if a data chunk only appears once in the byte stream, no gain (space saving) will be obtained.

2. Data Characterization Effects on Deduplication Data deduplication is a data dependent process whose various performance metrics are decided by the input data as well as the algorithms and techniques used in the process. While the algorithmic complexity and technical overheads can be quantified, it has been impossible to quantify just how much the data content really affects the system deduplication performance. This study statistically analyzes how different data sets affect the deduplication metrics such as compression, read/write throughput and deletion overhead. Through this method we hope to quantify the characteristics of data based on its effect on the metrics under interest. Based on these statistics, we hope to provide data deduplication community with set of standardized set of workloads that can be tested for the system evaluation.

Status

1. Frequency-based Chunking We proposed a novel chunking algorithm called Frequency-based Chunking which is able to obtain relatively high occurrence frequencies through chunk partitioning. Through extensive experiments, our scheme is compared against existing content based chunking scheme, and results shows our approach achieves significant better results with respect to space saving, the number of distinct chunks and the average chunk size. We are continuing to investigate issues such as running time performance and scalability for various data sets.

2. Data Characterization Currently we were able to quantify how different characteristics of the original file structure such as size of the files, text versus binary affect the compression and throughput of the data deduplication process. We are currently trying to test how various backup policies also affect this process. We were able to show that the amount of change from backup to backup is not the major characteristic of the data when it comes to throughput or the deletion overhead of the system. Both the locality of the data and the hot/cold characteristics of the data segments must be considered. To this end, we have applied a machine learning technique see if some characteristics of the data can be learned and used to predict future patterns. On the single test set we have it has shown significant improvement to the previous approaches where only the amount of changes are considered.

Faculty

Students

Publications

Date		Publication
Dec 1, 2012		Zhike Zhang, Deepavali Bhagwat, Witold Litwin, Darrell D. E. Long, Thomas Schwarz, Improved Deduplication through Parallel Binning, Proceedings of the 31st IEEE International Performance, Computing and Communications Conference (IPCCC '12), December 2012. [Deduplication] [Deduplication Optimization]
May 11, 2011		Stephanie Jones, Online De-duplication in a Log-Structured File System for Primary Storage, Technical Report UCSC-SSRC-11-03, May 2011. [Deduplication] [Deduplication Optimization]
Sep 1, 2010		Deepavali Bhagwat, Deduplication for Large Scale Backup and Archival Storage, Technical Report UCSC-SSRC-, September 2010. [Archival Storage] [Deduplication] [Deduplication Optimization]
Oct 27, 2009		Guanlin Lu, David Du, Chunking Based Deduplication Algorithm with Consideration of Data Chunk Frequencies, October 2009. [Deduplication Optimization]
Oct 1, 1996		Darrell D. E. Long, Jehan-François Pâris, A Leaner, More Efficient, Available Copy Protocol, Proceedings of the Symposium on Parallel and Distributed Processing, October 1996. [Deduplication Optimization]
Oct 1, 1988		Darrell D. E. Long, Jehan-François Pâris, A Realistic Evaluation of Optimistic Dynamic Voting, Proceedings of the Symposium on Reliable Distributed Systems, October 1988. [Deduplication Optimization]

Last modified 23 May 2019