HECURA: Scalable Data Management @ SSRC

HECURA: Scalable Data Management

This project is no longer active. Information is still available below.

Modern high-end computing (HEC) systems must manage petabytes of data stored in billions of files, yet techniques for naming and managing files were developed 40 years ago for collections of thousands of files and do not scale well to such large systems. We propose a revolution in the way files are identified and managed that leverages provenance (a record of the data and processes that contributed to its creation), content, and other attributes to provide a scalable and searchable file namespace that tracks data as it moves through the scientific workflow. By including semantic concepts alongside traditional file system metadata and considering file provenance and per-user personalization, our approach allows scientists to better find and utilize the data they need, using both content and data history to identify and manage stored information.

Imagine a world where it is as easy to locate and browse data sets as it is to search the web. The strengths and weaknesses of the web provide several useful lessons that we leverage: 1) Although the web implements a hierarchical namespace, search has become the dominant navigation tool; 2) While finding some information is easy, finding the right or good information is not; 3) The easier it is for people to contribute information to a repository, the more critical it becomes to determine the veracity of that data; 4) The links that relate documents provide valuable insight into the importance of documents.

By applying these lessons and addressing the unmet challenges the web introduced, we will design a file system that leverages file attributes, both explicit and implicit relationships among files, and user-assigned names to facilitate management of petabytes of data. Rather than providing static “directories,” we generate dynamic directories in response to search queries that are simple to express and quick to execute. Building upon our web analogy, we can achieve a similar breakthrough for file search by incorporating provenance relationships and the underlying dynamic graph that such relationships represent. Furthermore, provenance provides a tool with which we can address the challenge of data quality, allowing users to make intelligent choices about data sets based on their origin, excluding data from suspect or unknown sources.

Realizing this vision requires that we design mechanisms that gather, maintain, and index the large volume of metadata and provenance information generated by HEC file system applications and users. We will leverage our experience in gathering and using provenance data and building partitioned indexes for simple metadata to construct partitioned indexes that contain provenance information and all types of metadata. We will explore both the use of storage class memories and caching of index partitions to ensure that performance remains high even on multi-petabyte file systems, allowing our metadata index to replace “traditional” file system metadata structures and support both a flexible namespace and mechanisms to use provenance to improve search quality and confidence in stored data. By doing so, the research proposed by this project will enable HEC users and, more broadly, all users of computer storage to find, manage, and share their data more effectively, increasing the utility of the vast quantity of digitally-stored information.

Status

This project was funded in Fall 2009.

Faculty

Students

Associates

Margo Seltzer

Alumni

Kiran-Kumar Muniswamy-Reddy

Publications

Date		Publication
Mar 2, 2012		Aleatha Parker-Wood, Darrell D. E. Long, Ethan L. Miller, Margo Seltzer, Daniel Tunkelang, Making Sense of File Systems Through Provenance and Rich Metadata, Technical Report UCSC-SSRC-12-01, March 2012. [Scalable File System Indexing] [Dynamic Non-Hierarchical File Systems] [HECURA: Scalable Data Management]
Sep 12, 2011		Christina Strong, Stephanie Jones, Aleatha Parker-Wood, Alexandra Holloway, Darrell D. E. Long, Los Alamos National Laboratory Interviews, Technical Report UCSC-SSRC-11-06, September 2011. [Scalable File System Indexing] [Dynamic Non-Hierarchical File Systems] [HECURA: Scalable Data Management] [Ultra-Large Scale Storage]
May 27, 2011		Yulai Xie, Kiran-Kumar Muniswamy-Reddy, Dan Feng, Darrell D. E. Long, Yangwook Kang, Zhongying Niu, Zhipeng Tan, Design and Evaluation of Oasis: An Active Storage Framework based on T10 OSD Standard, Proceedings of the 27th IEEE Symposium on Massive Storage Systems and Technologies (MSST 2011), May 2011. [HECURA: Scalable Data Management] [Ultra-Large Scale Storage]
May 6, 2010		Aleatha Parker-Wood, Christina Strong, Ethan L. Miller, Darrell D. E. Long, Security Aware Partitioning for Efficient File System Search, 26th IEEE Symposium on Massive Storage Systems and Technologies: Research Track (MSST 2010), May 2010. [Scalable File System Indexing] [HECURA: Scalable Data Management] [Ultra-Large Scale Storage] [Prediction and Grouping]
Dec 10, 2009		Andrew Leung, Organizing, Indexing, and Searching Large-Scale File Systems, Technical Report UCSC-SSRC-09-09, December 2009. [Scalable File System Indexing] [HECURA: Scalable Data Management] [Ultra-Large Scale Storage]
Nov 13, 2009		Andrew Leung, Ian Adams, Ethan L. Miller, Magellan: A Searchable Metadata Architecture for Large-Scale File Systems, Technical Report UCSC-SSRC-09-07, November 2009. [Scalable File System Indexing] [HECURA: Scalable Data Management] [Ultra-Large Scale Storage]
Jul 1, 1991		Richard Golding, Darrell D. E. Long, Accessing Replicated Data in a Large-Scale Distributed System, International Journal in Computer Simulation 1(4), July 1991, pages 347-372. [Scalable File System Indexing] [HECURA: Scalable Data Management]

Last modified 23 May 2019