Storage QoS @ SSRC

Storage QoS

This project is no longer active. Information is still available below.

Storage systems for large and distributed clusters of compute servers are themselves large and distributed. Their complexity and scale makes it hard to manage these systems, and in particular they make it hard to ensure that applications using them get good, predictable performance. At the same time, shared access to the system from multiple applications, users, and competition from internal system activities leads to a need for predictable performance.

This project investigates mechanisms for improving storage system performance in large distributed storage systems through mechanisms that integrate the performance aspects of the path that I/O operations take through the system, from the application interface on the compute server, through the network, to the storage servers. We focus on five parts of the I/O path in a distributed storage system: I/O scheduling at the storage server, storage server cache management, client-to-server network flow control, client-to-server connection management, and client cache management.

Much of the existing work on QoS for storage considers management of the individual elements of the path that I/Os take through a storage system, but little of the work considers end-to-end management of the whole path. The problem with a naive chaining of multiple management algorithms along the path (e.g., one algorithm for the network, and another for the the storage server) is that emergent behaviors that arise from such chaining can reduce the overall performance of the system. Also, much of the existing work is specific to continuous media and other applications with periodic real-time I/O workloads, as opposed to applications with general workloads.

The unifying idea in this project is that the storage server should control data movement between clients and the server. Only storage server has knowledge of the I/O demands across all its clients. The server is also more likely to contain a bottleneck resource than any individual client is. Accordingly, the server can make I/O scheduling decisions to balance client usage, can manage cache space taking into account the workload from all clients contending for the cache, and can manage the network flow.

The techniques build on our current I/O scheduling work, which allow applications to specify quality of service for I/O sessions. The QoS includes a reserved (minimum) performance, a limit on performance, and fair sharing of extra performance among sessions. The project extends the scheduling work to improve disk utilization, and then uses QoS and utilization information to guide cache management and network flow control decisions. We also investigate how we can use machine learning techniques to predict near-future resource demand in order to handle those clients that are connected over long-latency links.

These techniques should help storage systems to scale to support the compute clusters currently being planned. Large scale means sharing, both within one application and between applications. Performance management ensures that each application or client gets good performance. For example, when many nodes are computing simulation data and other nodes are visualizing that data, the two can proceed without interference. Large scale also means there will always be system maintenance going on to handle failure and replacement. Performance management ensures that maintenance can proceed without interfering with applications.

Status

We are in the process of building a new Linux device driver, Fahrrad, to implement disk I/O scheduling. This driver builds on the RBED CPU scheduler and the RAD real-time model, and on our experience with the Zygaria I/O scheduler driver.

Faculty

Associates

Theodore Wong

Publications

Date		Publication
Oct 1, 2020		Oceane Bel, Kenneth Chang, Nathan Tallent, Dirk Duellman, Ethan L. Miller, Faisal Nawab, Darrell D. E. Long, Geomancy: Automated Performance Enhancement through Data Layout Optimization, Proceeding of the Conference on Mass Storage Systems and Technologies (MSST '20), October 2020. [Scalable High-Performance QoS] [Prediction and Grouping] [Storage QoS]
Sep 19, 2016		Yan Li, Yash Gupta, Ethan L. Miller, Darrell D. E. Long, Pilot: A Framework that Understands How to Do Performance Benchmarks The Right Way, Proceedings of the 24th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2016), September 2016. [Tracing and Benchmarking] [Ultra-Large Scale Storage] [Storage QoS]
Jun 2, 2015		Yan Li, Xiaoyuan Lu, Ethan L. Miller, Darrell D. E. Long, ASCAR: Automating Contention Management for High-Performance Storage Systems, 31st International Conference on Massive Storage Systems and Technologies (MSST2015), June 2015. [Scalable High-Performance QoS] [Ultra-Large Scale Storage] [Storage QoS]
Aug 1, 2008		Brad Smith, J.J. Garcia-Luna-Aceves, Best Effort Quality-of-Service, 17th International Confereonce on Computer Communications and Networks (ICCCN '08), August 2008. [Storage QoS]
Sep 24, 2007		Joel Wu, Scott A. Brandt, Providing Quality of Service Support in Object-Based File System, Proceedings of the 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007), September 2007. [Ultra-Large Scale Storage] [Storage QoS]
May 16, 2006		Joel Wu, Scott A. Brandt, The Design and Implementation of AQuA: an Adaptive Quality of Service Aware Object-Based Storage Device, Proceedings of the 23rd IEEE / 14th NASA Goddard Conference on Mass Storage Systems and Technologies, May 2006, pages 209-218. [Ultra-Large Scale Storage] [Storage QoS]
Jun 14, 2005		Joel Wu, Scott A. Brandt, Hierarchical Disk Sharing for Multimedia Systems and Servers, Proceedings of the 15th ACM International Workshop on Network and Operating Systems Support for Digital Audio and Video (NOSSDAV 2005), June 2005, pages 189-194. [Ultra-Large Scale Storage] [Storage QoS]
Jan 20, 2005		Joel Wu, Scott Banachowski, Scott A. Brandt, Automated QoS Support for Multimedia Disk Access, Proceedings of the SPIE, Multimedia Computing and Networking (MMCN 2005), January 2005, pages 103-107. [Storage QoS]
May 26, 2004		Joel Wu, Scott A. Brandt, Storage Access Support for Soft Real-Time Applications, Proceedings of the 10th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS 2004), May 2004. [Ultra-Large Scale Storage] [Storage QoS]

Last modified 23 May 2019

Storage QoS

Status

Faculty

Associates

Sponsors

Publications