Highlight

Understanding scale-dependent soft-error behavior of scientific applications

Achievement

Proposed a methodology to model application fault behavior at large scale based on a reduced set of experiments performed at small scale. Employed machine learning techniques to accurately model application fault behavior using a set of experiments that can be executed in parallel at small scale.

Significance and Impact

Demonstrated that this methodology drastically reduces the set and the scale of the fault injection experiments to be performed and provides a validated methodology to study application fault behavior at large scale.

Research Details

  • Proposed a novel and automated methodology to analyze application fault behavior at large scale based on the analysis performed at small scale. Seeked answers to questions such as “If the fault behavior of an application running on 32 cores is known, can the behavior of the same application running on 4,096 cores be inferred?”, “Is it possible to model the fault behavior of an application at a scale that is not available at this time?”, or “Will strong scaling applications be more vulnerable than weak scaling applications at a certain scale?”.
  • Employed machine learning techniques to build application fault behavior models that can be used to understand the resilience characteristics and vulnerability of scientific applications at large scale, once their behavior at small scale is known.
  • Demonstrated the effectiveness of our methodology and the precision of the fault-behavior models with several strong and weak-scaling applications taken from the DOE proxy applications (LULESH and AMG) and the DOE Office of Science (LAMMPS).

Overview

When studying large-scale systems, researchers often face additional complication due to the scarcity of resources. Performing tens of thousands of fault injection experiments on a large-scale system is a complex and time-consuming task. Even assuming that the system is available for the entire duration of the fault injection campaign, the time required to perform all the experiments might be prohibitive. Additionally, researchers may be interested to analyze the application resilience to faults at a scale that is not available yet. For example, a pre-production system could be installed to provide a head-start before the final large production system is installed (e.g., SummitDev and Summit). This work develops a framework that automatically implements our methodology. Our framework divides all the experiments into two disjoint sets: training and testing. Training sets are used to build the application fault behavior models by running each machine learning algorithm. Then, the framework computes the accuracy of the models against the testing sets. This methodology provides three main advantages: 1) it allows researchers to perform large scale fault behavior analysis without allocating the full system; 2) it speeds up the fault injection campaign by reducing the number of experiments and by running small-scale experiments in parallel, as opposed to running full-size fault-injection experiments sequentially; 3) it provides a validated way to perform fault behavior analysis on systems that are larger than the available ones.

This study shows that the proposed machine learning-based models are capable of precisely predicting the fault behavior of scientific applications at large scale based on experiments at very small scale. In some case (e.g., LAMMPS weak scaling) the model can predict the resilience of an application running on 4,096 cores based on experiments conducted on a single core.