Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale

Project Status: Active

Extreme-scale, high-performance computing (HPC) significantly advances discovery in fundamental scientific processes by enabling multiscale simulations that range from the very small, on quantum and atomic scales, to the very large, on planetary and cosmological scales. Computing at scales in the hundreds of petaflops, exaflops—quintillions (billion billions) operations per second—, and beyond will also lend a competitive advantage to the US energy and industrial sectors by providing the computing power for rapid design and prototyping and big data analysis.

To build and effectively operate extreme-scale HPC systems, the US Department of Energy (DOE) cites several key challenges, including resilience, or efficient and correct operation despite the occurrence of faults or defects in system components that can cause errors. These innovative systems require equally innovative components designed to communicate and compute at unprecedented rates, scales, and levels of complexity, increasing the probability for hardware and software faults.

This DOE Early Career research project offers a structured hardware and software design approach for improving resilience in extreme-scale HPC systems so that scientific applications running on these systems generate accurate solutions in a timely and efficient manner. Frequently used in computer engineering, design patterns identify problems and provide generalized solutions through reusable templates.

Using a novel resilience design pattern concept, this project identifies and evaluates repeatedly occurring resilience problems and coordinates solutions throughout hardware and software components in HPC systems. This effort will create comprehensive methods and metrics by which system vendors and computing centers can establish mechanisms and interfaces to coordinate flexible fault management across hardware and software components and optimize the cost-benefit trade-offs among performance, resilience, and power consumption. Reusable programming templates of these patterns will offer resilience portability across different HPC system architectures and permit design space exploration and adaptation to different design trade-offs.

Visit the Project Website

About the DOE Early Career Program

The program, started in 2009, supports the development of individual research programs of outstanding scientists early in their careers and stimulates research careers in the disciplines supported by the DOE Office of Science.

Under the program, university-based researchers receive at least $150,000 per year to cover summer salary and research expenses. For researchers based at DOE national laboratories, where DOE typically covers full salary and expenses of laboratory employees, grants will be at least $500,000 per year to cover year-round salary plus research expenses. The research grants are planned for five years.

To be eligible for the DOE award, a researcher must be an untenured, tenure-track assistant or associate professor at a U.S. academic institution or a full-time employee at a DOE national laboratory, who received a Ph.D. within the past 10 years. Research topics are required to fall within one of the Department's Office of Science's six major program offices:  Advanced Scientific Computing Research (ASCR); Biological and Environmental Research (BER); Basic Energy Sciences (BES), Fusion Energy Sciences (FES); High Energy Physics (HEP), and Nuclear Physics (NP).

Visit the Early Career Research Program Website