Highlight

Characterization of the Impact of Soft Errors on Iterative Methods

Achievement

Presents the first comprehensive characterization of the impact of soft errors on the convergence characteristics of two iterative methods using application-level fault injection.

Significance and Impact

This characterization can aid the design of fault injection campaigns that ensure systematic coverage. In addition, it helps identify types of errors that should be targeted by an effective resilience scheme.

Research Details

  • This study systematically characterizes the behavior of two iterative methods—CG and BiCGSTAB —in the presence of soft errors. These methods are exemplar of an important class of methods used to solve systems of equations and constitute the core kernel in many largescale scientific applications.
  • It employs a deterministic error-injection strategy to systematically explore the space of possible error behaviors. It considers 1, 2, and 4 bit error injections under uniform and beta distribution of the bit positions affected by the error.
  • The main outcome of this work is that a large fraction (> 50%) of soft errors are masked by the evaluated iterative methods, quantitatively demonstrating that iterative methods are naturally resilient to soft errors.

Overview

A broad array of techniques has been designed to understand application behavior under soft errors and to detect, isolate, and correct soft-error-impacted application state. The first step toward tolerating soft errors involves understanding an application’s behavior under soft errors. This can help understand the need for error detection/-correction techniques. An ideal error detection/correction strategy identifies all and only the errors that can materially impact application behavior. Detecting and recovering from errors that might be eventually masked by the application can unnecessary increase the cost of soft error resilience. Different portions of the application state might be impacted differently by a soft error, enabling optimizations and datastructure-specific resilience techniques.

This work considers the use of iterative methods to incrementally solve a linear system of equations, which constitutes the core kernel in many scientific applications. It analyzes the impact of soft errors in terms of the type of error (single- vs multi-bit), the distribution and location of bits affected, the data structure and the statement impacted, and variation with time. This study observes the following:

  • The error-induced behavior of solvers varies widely.
  • The comparative behavior of the solvers varies widely with the data sets chosen. Therefore, a large number of data sets should be chosen for meaningful analysis.
  • Not all vectors are equally impacted by soft errors. In many cases, the solution vector x behaves noticeably differently.
  • The studied solvers are more resilient to errors in the mantissa than other portions of the floating-point number.

 

Last Updated: May 28, 2020 - 4:04 pm