Dynamic Non-Hierarchical File Systems @ SSRC

Dynamic Non-Hierarchical File Systems

This project is no longer active. Information is still available below.

Modern high-end computing (HEC) systems must manage petabytes (going on exabytes) of data stored in billions of files, yet current techniques for naming and managing files were developed 40 years ago for collections of thousands of files. HEC users are therefore forced to adapt their usage to fit an outdated file system model and interface, unsuitable for exascale systems. We had the opportunity to meet with scientists at several of the national laboratories. We talked with them about the science they do and how they use the supercomputers.

From these discussions we have learned several lessons: 1) Hierarchical namespaces have become a hinderance rather than a help; 2) Currently it is easier, and faster, for scientists to manage their own metadata than try to search for data they have stored; 3) While finding some data can be easy, finding the right or good data is not. From these observations we can see that simply modifying existing high performance filesystems, and the requisite storage of additional semantic metadata, would be woefully inadequate.

We propose to develop a radically different filesystem structure that addresses these problems directly, and which will leverage provenance (a record of the data and processes that contributed to its creation), file content, and rich semantic metadata to provide a scalable and searchable file name space. Such a name space would allow the tracking of data as it moves through the scientific workflow. This allows scientists to better find and utilize the data they need, using both content and data history to identify and manage stored information. We take advantage of the familiar search-based metaphor to provide an initial easy- to-use interface that enables users to find the files they need and evaluate the authenticity and quality of those files. Realizing this vision requires research success in dynamic, nonhierarchical file systems design and implementation, large-scale metadata management, efficient scalable indexing, and automatic provenance capture.

Status

We propose a dynamic nonhierarchical file system which includes automatically collected information flow provenance in addition to traditional metadata. Information flow provenance will automatically create and track relationships among files, allowing a visualization file to be related to the input deck used to create it as well as the calculation that was run. This dynamic and automatic addition of relationships will not only allow the user to be presented with a personalized view of related data, but also potentially allow the user to make connections he/she was otherwise unable to see.

We are exploring the benefits to be gained by expanding on the functionality provided by file system indexes, providing features not typically available in current file systems and search indexes. We are currently working on creating a unified search space over traditional metadata, content-based metadata, and provenance that will help find relevant files regardless of where they are stored. We have examined scientific metadata from a variety of disciplines, in order to better understand its properties. Most metadata studies have focused on POSIX metadata, which is homogenous, low-dimensional, predominately numeric, and has no missing values. However, we have discovered that scientific metadata is heterogeneous, high dimensional, a mixture of numeric, textual, and categorical, and very sparse (even within a single discipline and object type). We are using data from this study to inform choices in designing a new type of on-demand scientific data index.

Additionally, search must enforce file security, however, doing so efficiently is not straightforward. Our techniques allow security information to be used during index partitioning and embedded within each partition. Doing so allows us to eliminate partitions with improper permissions from the search space, improving performance and potentially altering the ordering of returned results.

File system metadata should be treated as an aid to managing and accessing data and not a rigid and limited structure to which the user must conform. To this end we propose to enhance metadata management to provide seamless support for a search-based dynamic interface to the files. File system search provides a clean, powerful abstraction from the file system. It is often easier to specify what one wants using file metadata and extended attributes rather than specifying where to find it. Searchable metadata allows users and administrators to ask complex, ad hoc questions about the properties of the files being stored, helping them to locate, manage, and analyze their data.

Web users are familiar with the problem of “information over- load” in response to a search query; we will reduce this problem in our system by adding importance ranking, and facilitating searches that are restricted to a local region of the provenance and relationship graph. This combination of file relationship information and per-file metadata has strong promise to greatly improve the quality of searches, so we will explore approaches that allow queries to include this information.

In order to do ranking, we are exploring eigenvector analysis on the provenance graph, similar to Googles PageRank. Similarly to a web graph, provenance allows us to examine what files scientists think are useful and worth deriving from. However, naively applying PageRank to a provenance graph simply results in ranking frequently used roots (such as gcc) as most important. Instead, by modifying the PageRank transition function, we can favor newer, less ubiquitous, but still frequently used files.

As we increase the amount of information we store and require access to, optimizing the computing system becomes increasingly important. Such a system must be fast enough to respond to the user, while maintaining an equilibrium between saving energy and using the system to its full potential. This must be accomplished without noticeable degradation in the reliability and security of the system.

We are developing tools to help achieve a balanced, reliable, and secure system of any scale. Horus, a keyed hash tree, encrypts data and supports a much finer-grained approach to security than can currently be achieved. We are developing a data allocation algorithm that optimizes for multiple objectives, including energy, performance, and reliability, that will optimally place data on devices.

Faculty

Alumni

Thomas Kroeger

Publications

Date		Publication
Apr 5, 2022		Devashish Purandare, Daniel Bittman, Ethan L. Miller, Analysis and Workload Characterization of the CERN EOS Storage System, Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems (CHEOPS '22), April 2022. [Archival Storage] [Designing systems for QLC flash] [Dynamic Non-Hierarchical File Systems]
Jan 7, 2014		Christina Strong, Ahmed Amer, Darrell D. E. Long, Building JACK: Developing Metrics for Use in Multi-Objective Optimal Data Allocation Strategies, Technical Report UCSC-SSRC-14-01, January 2014. [Dynamic Non-Hierarchical File Systems]
Jun 30, 2013		Aleatha Parker-Wood, Brian Madden, Michael McThrow, Darrell D. E. Long, Ian Adams, Avani Wildani, Examining Extended and Scientific Metadata for Scalable Index Designs, Proceedings of the 6th International Systems and Storage Conference (SYSTOR 2013), June 2013. [Scalable File System Indexing] [Dynamic Non-Hierarchical File Systems]
Jan 28, 2013		Thomas Schwarz, Ignacio Corderi, Darrell D. E. Long, Jehan-François Pâris, Simple, Exact Placement of Data in Containers, Proceedings of the International Conference on Computing, Networking and Communications (ICNC), January 2013. [Scalable File System Indexing] [Dynamic Non-Hierarchical File Systems]
Dec 14, 2012		Aleatha Parker-Wood, Brian Madden, Michael McThrow, Darrell D. E. Long, Examining Extended and Scientific Metadata for Scalable Index Designs, Technical Report UCSC-SSRC-12-07, December 2012. [Scalable File System Indexing] [Dynamic Non-Hierarchical File Systems] [Ultra-Large Scale Storage]
Oct 29, 2012		Yulai Xie, Kiran-Kumar Muniswamy-Reddy, Dan Feng, Yan Li, Darrell D. E. Long, Zhipeng Tan, Lei Chen, A Hybrid Approach for Efficient Provenance Storage, The 21st ACM Conference on Information and Knowledge Management (CIKM), October 2012. [Scalable File System Indexing] [Dynamic Non-Hierarchical File Systems] [Ultra-Large Scale Storage]
Mar 2, 2012		Aleatha Parker-Wood, Darrell D. E. Long, Ethan L. Miller, Margo Seltzer, Daniel Tunkelang, Making Sense of File Systems Through Provenance and Rich Metadata, Technical Report UCSC-SSRC-12-01, March 2012. [Scalable File System Indexing] [Dynamic Non-Hierarchical File Systems] [HECURA: Scalable Data Management]
Nov 6, 2011		Stephanie Jones, Christina Strong, Aleatha Parker-Wood, Alexandra Holloway, Darrell D. E. Long, Easing the Burdens of HPC File Management, Proceedings of the 6th Parallel Data Storage Workshop (PDSW '11), November 2011. [Scalable File System Indexing] [Dynamic Non-Hierarchical File Systems]
Sep 12, 2011		Christina Strong, Stephanie Jones, Aleatha Parker-Wood, Alexandra Holloway, Darrell D. E. Long, Los Alamos National Laboratory Interviews, Technical Report UCSC-SSRC-11-06, September 2011. [Scalable File System Indexing] [Dynamic Non-Hierarchical File Systems] [HECURA: Scalable Data Management] [Ultra-Large Scale Storage]
Jun 20, 2011		Yulai Xie, Kiran-Kumar Muniswamy-Reddy, Darrell D. E. Long, Ahmed Amer, Dan Feng, Zhipeng Tan, Compressing Provenance Graphs, 3rd USENIX Workshop on the Theory and Practice of Provenance, June 2011. [Archival Storage] [Dynamic Non-Hierarchical File Systems] [Ultra-Large Scale Storage]
Jun 20, 2011		Stephanie Jones, Christina Strong, Darrell D. E. Long, Ethan L. Miller, Tracking Emigrant Data via Transient Provenance, Proceedings of the 3rd USENIX Workshop on the Theory and Practice of Provenance (TaPP '11), June 2011. [Secure File and Storage Systems] [Scalable File System Indexing] [Dynamic Non-Hierarchical File Systems]
Oct 1, 2001		Randal Burns, Robert M. Rees, Larry Stockmeyer, Darrell D. E. Long, Scalable Session Locking for a Distributed File System, Cluster Computing Journal 4, October 2001, pages 295-306. [Dynamic Non-Hierarchical File Systems]

Last modified 19 Oct 2020