HECURA: Scalable Data Management

Modern high-end computing (HEC) systems must manage petabytes of data stored in billions of files, yet techniques for naming and managing files were developed 40 years ago for collections of thousands of files and do not scale well to such large systems. We propose a revolution in the way files are identified and managed that leverages provenance (a record of the data and processes that contributed to its creation), content, and other attributes to provide a scalable and searchable file namespace that tracks data as it moves through the scientific workflow. By including semantic concepts alongside traditional file system metadata and considering file provenance and per-user personalization, our approach allows scientists to better find and utilize the data they need, using both content and data history to identify and manage stored information.

Imagine a world where it is as easy to locate and browse data sets as it is to search the web. The strengths and weaknesses of the web provide several useful lessons that we leverage: 1) Although the web implements a hierarchical namespace, search has become the dominant navigation tool; 2) While finding some information is easy, finding the right or good information is not; 3) The easier it is for people to contribute information to a repository, the more critical it becomes to determine the veracity of that data; 4) The links that relate documents provide valuable insight into the importance of documents.

By applying these lessons and addressing the unmet challenges the web introduced, we will design a file system that leverages file attributes, both explicit and implicit relationships among files, and user-assigned names to facilitate management of petabytes of data. Rather than providing static “directories,” we generate dynamic directories in response to search queries that are simple to express and quick to execute. Building upon our web analogy, we can achieve a similar breakthrough for file search by incorporating provenance relationships and the underlying dynamic graph that such relationships represent. Furthermore, provenance provides a tool with which we can address the challenge of data quality, allowing users to make intelligent choices about data sets based on their origin, excluding data from suspect or unknown sources.

Realizing this vision requires that we design mechanisms that gather, maintain, and index the large volume of metadata and provenance information generated by HEC file system applications and users. We will leverage our experience in gathering and using provenance data and building partitioned indexes for simple metadata to construct partitioned indexes that contain provenance information and all types of metadata. We will explore both the use of storage class memories and caching of index partitions to ensure that performance remains high even on multi-petabyte file systems, allowing our metadata index to replace “traditional” file system metadata structures and support both a flexible namespace and mechanisms to use provenance to improve search quality and confidence in stored data. By doing so, the research proposed by this project will enable HEC users and, more broadly, all users of computer storage to find, manage, and share their data more effectively, increasing the utility of the vast quantity of digitally-stored information.

Status

This project was funded in Fall 2009.

Publications

Last modified 28 Jun 2010