Project Status: Inactive
Resilience, i.e., obtaining a correct solution in a timely and efficient manner, is one of the key challenges in extreme-scale supercomputing. Extreme heterogeneity, i.e., using multiple, and potentially configurable, types of processors, accelerators and memory/storage in a single computing platform, is adding a significant amount of complexity to the supercomputer hardware/software ecosystem. Errors and failures reported by such heterogeneous hardware will need to be handled by the appropriate software component to enable efficient masking, recovery, and avoidance with little burden on the user.
This project takes a first step toward resilience in leadership-class supercomputers with extreme heterogeneity. It performs research to enable fine-grain resilience for graphics processing units accelerated systems, such as ORNL’s Summit, that is more efficient than traditional application-level checkpoint/restart. The approach centers on a novel concept for Quality of Service (QoS) and corresponding extensions for the for OpenMP parallel programming model. This project develops (1) error and failure models, (2) software resilience strategies and protection domains, (3) OpenMP QoS language extensions for resilience, (4) OpenMP QoS runtime extensions and policies for resilience, and (5) a proof-of-concept prototype demonstrating these capabilities on Summit.
The ultimate goal is to make fault resilience an integral part of the supercomputer hardware/software ecosystem, such that the burden for providing it is on the system by design and not on the user as an afterthought.