Reliability and Power-Efficiency in Erasure-Coded Storage Systems

Published as Storage Systems Research Center Technical Report UCSC-SSRC-09-08.

Abstract

Data reliability is paramount in modern storage systems. Such reliability is generally provided using erasure codes across storage devices. Until recently, most systems employed mirroring and single parity to tolerate device failures. Recent studies suggest that these techniques are not sufficient going forward. Recent advances in the theory of erasure codes has resulted in an abundance of codes that induce interesting tradeoffs in reliability, space-efficiency and performance. In the three parts of this thesis, we study the structural properties of erasure codes and their effects on modern storage systems. We are particularly interested in linear codes with irregular fault tolerance. While such codes offer many benefits over traditional coding techniques, reasoning about the structure of these codes is non-trivial. In the first part of this thesis, we describe our study on the reliability of erasure codes. We have developed a generalized framework for evaluating the reliability of an arbitrary erasure code over a system configuration. In the process of studying the reliability of erasure codes, we found that many traditional modeling techniques do not extend well to multi-disk fault tolerant systems, irregular codes, latent sector faults and time dependent event rates. Our framework overcomes these obstacles and allows efficient, apples-to-apples comparison between any class of linear erasure code. In the second part of this work, we extended the simulation framework to study the reliability of erasure-coded fragment placement in a system with heterogeneous devices. In doing so, we designed a metric that quickly orders fragment placements by reliability. The metric is used in conjunction with a brute force algorithm and a simulated annealing algorithm to efficiently find near-optimal placements. An exploratory study shows the effects of fragment placement on system reliability. Finally, we study a property we call reconstructability to evaluate the potential power savings in an erasure-coded storage system. Storage contributes a non-trivial amount of energy to the ever increasing power budget of data centers. Given the various environmental and monetary consequences of power-hungry data centers, energy consumption has joined performance and reliability as a principle metric in large-scale storage systems. Here we define a novel technique in power-aware systems called power-aware coding, which exploits the structure of an erasure code--- which is generally used to provide data reliability---to save power in a storage system. We define a minimal device activation policy for a power-aware storage system and define the properties of optimal codes under this policy. A suite of metrics are derived, which are used to compare the relative expected power savings of arbitrary linear erasure codes. The metrics and the reliability simulation framework are used to perform a rudimentary exploration of the power-space-reliability tradeoff in a system that employs power-aware coding.

Publication date:
December 2009

Authors:
Kevin Greenan

Projects:
Reliable Storage

Available for download:

Full text:
Download as PDF

Bibtex entry

@techreport{ssrctr-09-08,
  author       = {Kevin Greenan},
  title        = {Reliability and Power-Efficiency in Erasure-Coded Storage Systems},
  institution  = {University of California, Santa Cruz},
  number       = {UCSC-SSRC-09-08},
  month        = dec,
  year         = {2009},
}
Last modified 8 Dec 2009