Highlight

Pattern-based Modeling of Multiresilience Solutions for High-Performance Computing

Pattern composition for detection, containment and mitigation for soft and process failure resilience in GMRES solver
Pattern composition for detection, containment and mitigation for soft and process failure resilience in GMRES solver.

Achievement

Demonstrated the benefit of utilizing design patterns to evaluate and refine multiple multiresilience solutions. Explored the relationship between soft and hard error resilience approaches using pattern-based performance modeling. Developed and evaluated multiresilience solutions for linear solver application based on the iterative generalized minimal residual (GMRES) method. Demonstrated the benefit of utilizing design patterns to evaluate and refine multiple multiresilience solutions. Explored the relationship between soft and hard error resilience approaches using pattern-based performance modeling. Developed and evaluated multiresilience solutions for linear solver application based on the iterative generalized minimal residual (GMRES) method.

Significance and Impact

Pattern-based design approach advances the ability of scientific applications executed on high-performance computing (HPC) systems to reach accurate solutions in a timely and efficient manner. The use of resilience design patterns impacts exploration of resilience-performance-tradeoff design space by coordinating solutions in multiple layers of HPC system stack.

Research Details

  • Demonstrated use of design patterns to systematically explore techniques with diverse performance and reliability characteristics, and the design of comprehensive multiresilience solutions through composition of patterns.
  • Designed a cross-layer multiresilience solution from conception to implementation for GMRES solver by instantiating algorithmic patterns for soft error resilience and process failure recovery, and patterns specific to message-passing interface (MPI) layer for process failure detection and containment
  • Evaluated pattern-based multiresilience solutions for GMRES solver to assess, characterize and minimize the interdependencies between patterns for soft and hard error resilience.

Overview

High performance computing (HPC) applications are affected by multiple types of errors occurring on HPC systems which hinders with their ability to make forward progress and their correctness. The errors are broadly categorized into soft errors causing silent data corruption (SDC) and hard errors causing process failures or a fatal application crash. In this work, efficient multiresilience solutions targeting both soft and hard errors are developed for HPC applications since prior works provide resilience to only a single type of error. This requires systematic integration of multiple independent techniques which encapsulate detection, containment, and mitigation of each error type with minimal impact on application performance. In this regards, we utilize resilience design patterns instantiated across multiple layers of the system stack. The design patterns provide a generalizable solution to a recurring problem. The overhead of multiresilience implementation is reduced by formally defining interfaces among patterns and coordination mechanisms as opposed to an implementation based on naïve stacking of patterns. We navigate the performance-resilience tradeoff space by evaluating the overheads of multiple multiresilience solutions in a linear solver application. The most efficient solution is formed by instantiating algorithmic patterns to work in concert with patterns incorporated from the communication layer. The pattern-based approach enables globally optimal multiresilience solution and avoids over-protection. For example, process failures recovery provided by checkpoint restart pattern can interact with soft error detection pattern in the following manner: low-overhead detector increases the chance of the solver to have slow convergence (increase in time-to-solution) resulting in more checkpoints to be taken as compared to a soft-error free case. On the other hand, use of high-overhead soft error detector to lower the chance of slow convergence is not found to be feasible in our experiments because of the lower overall overhead of combining algorithmic-specific detector and checkpoint of dynamic state. Overall, a structured performance-oriented design approach is demonstrated to identify alternative patterns for detection, containment and mitigation of specific types of errors, whereby efficient multiresilience solutions are architected through iterative refinement of relationships to optimize end-to-end application performance.

Last Updated: May 28, 2020 - 4:04 pm