A team of researchers from Carnegie Mellon University (CMU), University of California Berkeley (UCB), Penn State University (PSU) and Oak Ridge National Laboratory (ORNL) developed a novel algorithm for resilient and communication-efficient parallel matrix multiplication in high-performance computing systems, called 3D Coded SUMMA. The algorithm combines storage-optimal matrix-multiplication (MatDot) codes with the 3D scalable universal matrix multiplication algorithm (SUMMA). It performs the communication-efficient parallel matrix multiplication and is able to recover from compute node failures using redundancy through coded computation. For tolerating two compute node failures, 3D Coded SUMMA requires 50% less redundancy than traditional replication and has an execution time overhead of only 5-10%.
Significance and Impact
Resilience, i.e., obtaining a correct solution in a timely and efficient manner, is a key challenge in extreme-scale supercomputers. Current state-of-practice solutions, such as checkpoint/restart, algorithm-based fault tolerance and redundant computing, may not provide the necessary failure tolerance at a reasonable cost in high-performance computing systems that experience high failure rates, such as due to aging or unexpected reliability issues. The developed algorithm offers a new capability for such scenarios by providing the failure tolerance of traditional redundant computing at significantly lower cost. This work also applies the latest advances in coding theory to failure tolerant computing, opening up an entire new area of research.
- Developed the 3D Coded SUMMA novel resilient and communication-efficient parallel matrix multiplication algorithm.
- Implemented a proof-of-concept prototype of 3D Coded SUMMA as a Message Passing Interface (MPI) application.
- Evaluated the performance and failure resilience of the 3D Coded SUMMA proof-of-concept prototype on a 960-core ORNL Linux cluster.
A novel fault-tolerant parallel matrix multiplication algorithm, called 3D Coded SUMMA, has been developed that achieves higher failure tolerance than replication-based schemes for the same amount of redundancy. This performed research bridges the gap between recent developments in coded computing and fault tolerance in high-performance computing. The fundamental concept of coded computing is the same as traditional algorithm-based fault-tolerance, which is weaving redundancy in the computation using error-correcting codes. This integrates MatDot codes, an innovative code construction for parallel matrix multiplications, into the three-dimensional SUMMA (Scalable Universal Matrix Multiplication Algorithm) in a communication-avoiding manner. To tolerate any two node failures, the 3D Coded SUMMA requires 50% less redundancy than replication, while the overhead in execution time is only about 5-10%.
Last Updated: January 14, 2021 - 7:44 pm