SSRC talk: Dealing with Asymmetry in Large-Scale Data Analytics Clusters (Daniel Faria, Aster Data)
Daniel Faria, Aster Data Systems
High levels of asymmetry can have a significant impact in the performance of queries in large-scale data analytics clusters. As in any other distributed system performing scatter-gather computations, the running times of such queries are dictated by the performance of the slowest node. With many factors contributing to the overall system asymmetry, the performance gap between fastest and slowest nodes can be substantial, leading to poor resource utilization.
In the first part of this talk, we will demonstrate that planned load-balancing will never work in the real world. First, we will show that there are far too many factors introducing performance asymmetry, from standard factors such as heterogeneous hardware and data skew to dynamic factors such as workload-dependent imbalances and masked failures. These factors are also distributed across many layers in the system, from disk subsystems and network protocols to database applications, making impractical any sort of performance prediction. We will also show that masked failures can create considerable imbalance and that authoritative information about the failed components will not always be available.
We will then argue that building introspective systems is likely the only viable solution to deal with asymmetry dynamically in clusters with hundreds or thousands of nodes. We believe that large-scale systems will always exhibit non-trivial levels of asymmetry, and that systems need to monitor their components and robustly identify performance anomalies so that these levels can be reduced at runtime. We will present several challenges related to building reliable introspective systems and our current ideas on how to address them.
Daniel Faria is a Member of Technical Staff at Aster Data Systems, having joined the company in October of 2006. He graduated from Stanford University in 2006 with a PhD in Computer Science, where he worked with Professor David Cheriton on using location-based services to improve security in wireless LANs. Daniel also received a M.S. in Computer Science from Stanford and M.S. and B.S. degrees also in C.S. from the Federal University of Minas Gerais in Brazil.
Wednesday, December 5, 2007 at 12:00 PM
Miller, Ethan L.