Highlight

Comparative Analysis of Soft-Error Detection Strategies: A Case Study with Iterative Methods

Achievement

Presents the first comprehensive evaluation of four online soft error detection techniques in detecting the adverse impact of soft errors on iterative methods. To understand the potential for improved detection, this study evaluates a machine-learning based detector that takes as features that are the runtime features observed by the individual detectors to arrive at their conclusions.

Significance and Impact

This extensive evaluation demonstrates the need for designing error detectors to handle the evolutionary behavior exhibited by iterative solvers.

Research Details

  • Presents a comparative evaluation of four state-of-the-art online soft error detectors in the context of iterative methods through extensive single-bit and multi-bit fault injection experiments.
  • Tracks the evolution of the residual vector for fault-free method execution, fault injection experiments, and detector characterizations experiments, totaling several million runs.
  • Designs an online detector based on o­ine machine learning methodology. This work uses several supervised learning algorithms to create a model per iterative solver. Supervised learning methodology uses a training set with a label for each sample, either innocuous or erroneous. Each label presents the ground truth determined based on the comparison to the fault-free execution. The training sets with ground truth are the collection of the fault-injected experiments performed while evaluating the selected state-of-the-art online soft error detectors.
  • performs extensive fault injection experiments involving 28 data sets, five iterative methods, single- and multi-bit errors, and uniform and normal error distributions, totaling over 1.4 million fault-injection experiments. All the detectors were evaluated using identical fault-injection configuration, enabling a direct and unbiased comparison of their behavior.

Overview

Architectural trends such as technology scaling and near-threshold voltage operation are expected to make soft error resilience an important consideration in performance-oriented and power-constrained environments. Soft errors, transient bit flips, impacting an application state, can lead to application crashes, slowdown in execution, or silent data corruption. These challenges motivate the design of soft error detectors that can detect the adverse impact of soft errors in a timely fashion. To mitigate the adverse impact of soft errors, techniques have been designed to efficiently and accurately detect the presence of soft errors and recover from them. These detectors employ a variety of techniques (curve fitting, machine learning, algorithm analysis, etc.) to ‑ag observed behavior that deviates from predicted correct behavior as a potential error. These detectors have been developed and evaluated in diverse contexts, making a comparative analysis of their effectiveness difficult.

This study presents a comprehensive evaluation of the behavior of soft error detectors. It considers five iterative methods, 28 data sets, and multiple fault-injection scenarios. It evaluates flagging an error based on detector behavior at a single iteration or over a sliding window of iterations. While each detector considered has been shown to be effective in a distinct context, extensive analysis of various configurations evaluated demonstrates that, in the context of iterative methods, they do not achieve perfect detection accuracy. Finally, it also introduces a machine learning based detector using these features. While improved, the machine learning based detector is still far from perfect in terms of its accuracy. It concludes that, in addition to new methods, additional features need to be incorporated to improve detection accuracy.

 

Last Updated: May 28, 2020 - 4:04 pm