A team of researchers from Oak Ridge National Laboratory (ORNL) designed, implemented and evaluated a high-performance computing (HPC) runtime system that uses the design pattern concept to orchestrate resilience capabilities for efficient protection against faults, errors and failures. Resilience design patters are used as an abstraction to interface with and coordinate different parts of the supercomputing software stack and to enable tuning of the cost-benefit trade-offs between performance overheads introduced by resilience solutions and the protection coverage provided by them. This new runtime environment eliminates unnecessary overheads due to inefficient resilience solutions or double protection and coverage gaps due to lack of coordination between different parts of the system. The developed prototype is evaluated using a parallel linear solver application on 1024 processors by injecting permanent process failures and transient data corruptions.
Significance and Impact
Resilience, i.e., obtaining a correct solution in a timely and efficient manner, is a key challenge in extreme-scale supercomputers. As high-performance computing (HPC) systems become increasingly complex, they require intricate solutions for detection and mitigation for various modes of faults and errors that occur in these large-scale systems, as well as solutions for failure recovery. The developed pattern-oriented resilient runtime solution offers a novel way for orchestrating efficient resilience strategies, based on HPC system and application reliability properties, resilience capabilities and resilience needs. It permits system designers and users to actively balance the cost-benefit trade-offs between performance overhead and protection coverage of different resilience solutions. The result is a resilient supercomputing software stack that is able to adapt to emerging reliability threats with efficient responses, delivering science through advanced computing with high productivity and correctness.
- Designed the architecture of the Plexus runtime system, which implements pattern instances to provide a resilient environment for HPC applications.
- Developed strategies for the resilience patterns to be instantiated, modified and destroyed by the runtime based on static and dynamic policies to meet the resiliency needs of HPC applications.
- Implemented a prototype and evaluated the cost and benefit of these runtime techniques with the instancing of failure detection and recovery patterns for a large-scale parallel application.
Citation and DOI
Saurabh Hukerikar and Christian Engelmann. PLEXUS: A Pattern-Oriented Runtime System Architecture for Resilient Extreme-Scale High-Performance Computing Systems. In Proceedings of the 25th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC) 2020, Perth, Australia, December 1-4, 2020.
For high-performance computing (HPC) system designers and users, meeting the myriad challenges of next-generation exascale supercomputing systems requires rethinking their approach to application and system software design. Among these challenges, providing resiliency and stability to the scientific applications in the presence of high fault rates requires new approaches to software architecture and design. As HPC systems become increasingly complex, they require intricate solutions for detection and mitigation for various modes of faults and errors that occur in these large-scale systems, as well as solutions for failure recovery. These resiliency solutions often interact with and affect other system properties, including application scalability, power and energy efficiency. Therefore, resilience solutions for HPC systems must be thoughtfully engineered and deployed.
In previous work, we developed the concept of resilience design patterns, which consist of templated solutions based on well-established techniques for detection, mitigation and recovery. In this work, we use these patterns as the foundation to propose new approaches to designing runtime systems for HPC systems. The instantiation of these patterns within a runtime system enables flexible and adaptable end-to-end resiliency solutions for HPC environments. We describe the architecture of the runtime system, named Plexus, and the strategies for dynamically composing and adapting pattern instances under runtime control. This runtime-based approach enables actively balancing the cost-benefit trade-off between performance overhead and protection coverage of the resilience solutions. Based on a prototype implementation of PLEXUS, we demonstrate the resiliency and performance gains achieved by the pattern-based runtime system for a parallel linear solver application on 1024 processors. Permanent Message Passing Interface (MPI) process failures and transient corruptions in application data structures are injected.
Last Updated: January 17, 2021 - 3:39 pm