2. Data Characterization Effects on Deduplication Data deduplication is a data dependent process whose various performance metrics are decided by the input data as well as the algorithms and techniques used in the process. While the algorithmic complexity and technical overheads can be quantified, it has been impossible to quantify just how much the data content really affects the system deduplication performance. This study statistically analyzes how different data sets affect the deduplication metrics such as compression, read/write throughput and deletion overhead. Through this method we hope to quantify the characteristics of data based on its effect on the metrics under interest. Based on these statistics, we hope to provide data deduplication community with set of standardized set of workloads that can be tested for the system evaluation.
2. Data Characterization Currently we were able to quantify how different characteristics of the original file structure such as size of the files, text versus binary affect the compression and throughput of the data deduplication process. We are currently trying to test how various backup policies also affect this process. We were able to show that the amount of change from backup to backup is not the major characteristic of the data when it comes to throughput or the deletion overhead of the system. Both the locality of the data and the hot/cold characteristics of the data segments must be considered. To this end, we have applied a machine learning technique see if some characteristics of the data can be learned and used to predict future patterns. On the single test set we have it has shown significant improvement to the previous approaches where only the amount of changes are considered.