RSS: Events
|
News
|
Papers
News
››› Complete list of
news items
Events
No upcoming events at this time.
››› Complete list of events
|
Scalable File System Indexing
Faculty
Post-doctoral Fellows
Students
Alumni
Sponsors
Description
As the number and variety of files stored and accessed by users dramatically increases, existing file system structures have begun to fail as a mechanism for managing all of the information contained in those files. Many applications, such as email clients, multimedia management applications, and desktop search engines, have been forced to develop their own richer metadata infrastructures. While effective, these solutions are generally non-standard, non-portable, and potentially non-scalable. These issues suggest search, indexing, and information retrieval are becoming increasingly important areas for file and storage systems. In conjunction with faculty and students specializing in information retrieval at the UC Santa Cruz Department for Information Systems and Technology Management, we are developing system architectures that address these issues, which are scalable up to billions of files.
Status
Our current areas of focus are scalable indexing architectures for storage systems, improved file system namespaces, and incorporating concepts from databases and information retrieval, such as ranked search and more intelligent indexes into file systems. We particularly emphasize queries over extended and user-supplied metadata, such as scientific metadata and document metadata.
We are exploring new file system designs where search is first-class functionality rather than an after thought. The current approach of using a search index in addition to the file system's index requires two large and separate index structures to be maintained. This separation forces users and applications to access and update two structures when using their data. Our file system designs take a new approach to internal file system structures, layouts, and logging that are search optimized. Our new design can improve search performance, allow data layouts based on how files are queried, and improve efficiency by reducing the number of index structures that must be maintained. We are in the process of implementing these concepts in the Ceph distributed file system.
In addition, we are doing active research into effective ways of partitioning metadata into indexes. A
partitioned metadata index can rule out irrelevant files and quickly focus on files that are more likely to match the search criteria. By integrating partitioning with security criteria, we have been able to design a highly scalable design for scalable metadata search. This allows us to eliminate files that the querier cannot view without ever loading those indexes from storage. We have implemented and tested this system. Our results are available in the proceedings of the 26th IEEE Symposium on Massive Storage Systems and Technologies.
Our previous work in this area includes work done in collaboration with NetApp. Our metadata search index, Spyglass, leverages the characteristics unique to storage systems, such as data distributions and hierarchical namespaces, to design new search and indexing algorithms. Our design has search performance that can outperform basic DBMS-based solutions by up to four orders of magnitude, allows time-traveling queries over versioned metadata, and can efficiently re-crawl very large file systems.
We have also previously done work on a file system query language, QUASAR, that allows users to have powerful semantic access to stored data. QUASAR allows semantic file system views and directories to be created, which provide more meaningful data representations. Inter-file relationships, such as provenance, can be expressed and searched through links. To aid browsing, we are investigating applying faceted search to QUASAR. Faceted search uses rich key-value metadata to allow users to interactively navigate the search space and can allow interfaces to be automatically personalized for each user.
Publications
2013
-
Aleatha Parker-Wood,
Brian Madden,
Michael McThrow,
Darrell D. E. Long,
Ian Adams,
Avani Wildani,
Examining Extended and Scientific Metadata for Scalable Index Designs,
Systor 2013,
June 2013.
-
Thomas Schwarz,
Ignacio Corderi,
Darrell D. E. Long,
Jehan-François Pâris,
Simple, Exact Placement of Data in Containers,
Proceedings of the International Conference on Computing, Networking and Communications (ICNC),
January 2013.
2012
-
Aleatha Parker-Wood,
Brian Madden,
Michael McThrow,
Darrell D. E. Long,
Examining Extended and Scientific Metadata for Scalable Index Designs,
Technical Report UCSC-SSRC-12-07,
December 2012.
-
Yulai Xie,
Kiran-Kumar Muniswamy-Reddy,
Dan Feng,
Yan Li,
Darrell D. E. Long,
Zhipeng Tan,
Lei Chen,
A Hybrid Approach for Efficient Provenance Storage,
The 21st ACM Conference on Information and Knowledge Management (CIKM),
October 2012.
-
Aleatha Parker-Wood,
Darrell D. E. Long,
Ethan L. Miller,
Margo Seltzer,
Daniel Tunkelang,
Making Sense of File Systems Through Provenance and Rich Metadata,
Technical Report UCSC-SSRC-12-01,
March 2012.
2011
-
Stephanie Jones,
Christina Strong,
Aleatha Parker-Wood,
Alexandra Holloway,
Darrell D. E. Long,
Easing the Burdens of HPC File Management,
Proceedings of the 6th Parallel Data Storage Workshop (PDSW '11),
November 2011.
-
Christina Strong,
Stephanie Jones,
Aleatha Parker-Wood,
Alexandra Holloway,
Darrell D. E. Long,
Los Alamos National Laboratory Interviews,
Technical Report UCSC-SSRC-11-06,
September 2011.
-
Stephanie Jones,
Christina Strong,
Darrell D. E. Long,
Ethan L. Miller,
Tracking Emigrant Data via Transient Provenance,
Proceedings of the 3rd USENIX Workshop on the Theory and Practice of Provenance (TaPP '11),
June 2011.
2010
2009
-
Andrew Leung,
Organizing, Indexing, and Searching Large-Scale File Systems,
Technical Report UCSC-SSRC-09-09,
December 2009.
-
Andrew Leung,
Ian Adams,
Ethan L. Miller,
Magellan: A Searchable Metadata Architecture for Large-Scale File Systems,
Technical Report UCSC-SSRC-09-07,
November 2009.
-
Andrew Leung,
Aleatha Parker-Wood,
Ethan L. Miller,
Copernicus: A Scalable, High-Performance Semantic File System,
Technical Report UCSC-SSRC-09-06,
October 2009.
-
Andrew Leung,
Minglong Shao,
Timothy Bisson,
Shankar Pasupathy,
Ethan L. Miller,
Spyglass: Metadata Search for Large-Scale Storage Systems,
;login: — The USENIX Magazine 34(3),
June 2009.
-
Andrew Leung,
Minglong Shao,
Timothy Bisson,
Shankar Pasupathy,
Ethan L. Miller,
Spyglass: Fast, Scalable Metadata Search for Large-Scale Storage Systems,
Proceedings of the 7th USENIX Conference on File and Storage Technologies (FAST '09),
February 2009.
2008
-
Andrew Leung,
Ethan L. Miller,
Scalable Full-Text Search for Petascale File Systems,
Proceedings of the 2008 Petascale Data Storage Workshop (PDSW 08),
November 2008.
-
Sasha Ames,
Carlos Maltzahn,
Ethan L. Miller,
Quasar: A Scalable Naming Language for Very Large File Collections,
Technical Report UCSC-SSRC-08-04,
October 2008.
-
Sasha Ames,
Carlos Maltzahn,
Ethan L. Miller,
QUASAR: Interaction with File Systems Using a Query and Naming Language,
Technical Report UCSC-SSRC-08-03,
September 2008.
-
Andrew Leung,
Minglong Shao,
Timothy Bisson,
Shankar Pasupathy,
Ethan L. Miller,
High-Performance Metadata Indexing and Search in Petascale Data Storage Systems,
Proceedings of the SciDAC 2008 Conference,
July 2008.
-
Andrew Leung,
Shankar Pasupathy,
Garth Goodson,
Ethan L. Miller,
Measurement and Analysis of Large-Scale Network File System Workloads,
Proceedings of the 2008 USENIX Technical Conference,
June 2008.
-
Andrew Leung,
Minglong Shao,
Timothy Bisson,
Shankar Pasupathy,
Ethan L. Miller,
Spyglass: Fast, Scalable Metadata Search for Large-Scale Storage Systems,
Technical Report UCSC-SSRC-08-01,
May 2008.
-
Jonathan Koren,
Yi Zhang,
Xue Liu,
Personalized Interactive Faceted Search,
Proceedings of the 17th International Conference on the World Wide Web (WWW 2008),
April 2008.
2007
-
Jonathan Koren,
Yi Zhang,
Sasha Ames,
Andrew Leung,
Carlos Maltzahn,
Ethan L. Miller,
Searching and Navigating Petabyte Scale File Systems Based on Facets,
Proceedings of the 2007 ACM Petascale Data Storage Workshop (PDSW 07),
November 2007.
-
Deepavali Bhagwat,
Kave Eshghi,
Pankaj Mehra,
Content-based Document Routing and Index Partitioning for Scalable Similarity-based Searches in a Large Corpus,
Proceedings of the 13th ACM SIGKDD international conference on Knowledge Discovery and Data Mining (KDD '07),
August 2007, pages 105-112.
-
Yi Zhang,
Jonathan Koren,
Efficient Bayesian Hierarchical User Modeling for Recommendation Systems,
Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '07),
July 2007, pages 47-54.
-
Carlos Maltzahn,
Nikhil Bobb,
Mark W. Storer,
Damian Eads,
Scott A. Brandt,
Ethan L. Miller,
Graffiti: A Framework for Testing Collaborative Distributed Metadata,
Proceedings in Informatics 21,
March 2007, pages 97–111.
-
Mark W. Storer,
Graffiti Server - Design and Implementation,
Technical Report UCSC-SSRC-07-02,
January 2007.
Last modified 24 Oct 2012
|