SSRC talk: Finding fault tolerant XOR-based erasure codes for storage
Jay Wylie (HP Labs)
XOR-based erasure codes have had a tremendous impact on networked systems in the recent past. The impact of such codes on clustered storage systems has not yet been felt. Replication and RAID continue to dominate clustered storage systems. We believe that a clear understanding of XOR-based erasure codes applicable to clustered storage systems, rather than networked systems, will facilitate their adoption in clustered storage systems.
Towards this end, we have identified a new fault tolerance metric for XOR-based erasure codes: the minimal erasures list (MEL). The MEL completely describes the fault tolerance of an XOR-based erasure code at and beyond its Hamming distance; it is therefore a useful metric for comparing the fault tolerance of such codes. We have also developed the ME algorithm that efficiently determines the MEL of an erasure code. We have used the ME Algorithm (with some extensions) to find the most fault tolerant XOR-based erasure codes up to seven data symbols and seven parity symbols. These codes are directly applicable in clustered storage systems today.