Highlight

Machine Learning Models for GPU Error Prediction in a Large Scale HPC System

Distribution of temperatures
Figure 1: Distribution of temperature (top) and power consumption (bottom) for GPGPUs without single bit errors (left) and with (right).

Achievement

Developed an understanding of general-purpose graphics processing unit (GPGPU) errors in ORNL’s Titan supercomputer. Analyzed large amounts of system data to understand the characteristics of GPGPU temperature, power consumption, workloads, and single-bit memory error distribution. Created machine learning based techniques to exploit these insights for error prediction.

Significance and Impact

Understanding GPGPU errors in large-scale heterogeneous supercomputers is essential to provide the highest possible performance and reliability. This work provides new insights for ORNL’s Titan system that can be applied to improve the operation of Titan and other current and future supercomputers through changes in GPGPU usage and design.

Research Details

  • Analyzed 5 months (February 2015 – June 2015) of Titan’s system logs, specifically focusing on GPGPU temperature, power consumption, and single-bit memory errors (that are detectable and correctable with error-correcting code (ECC) memory).
  • Characterized the temporal and spatial locality of temperature increases, power consumption increases, and single-bit memory errors, and their correlation.
  • Developed and evaluated a single-bit memory error predictor using machine learning that relies on temporal and spatial locality features that correlate with single-bit memory errors.

Overview

In this work, we discover that workload characteristics, certain GPGPU cards, temperature and power consumption could have predictive or associative capabilities with GPGPU errors, but it is non-trivial to exploit it for error prediction. Motivated by these observations and challenges, we explore a machine-learning-based error prediction model that captures hidden interactions among system and workload properties.

This work elaborates the challenges, process, and solutions involved in building an effective machine-learning-based error predictor. In particular, we show how to systematically select a massive set of features by categorizing features into spatial and temporal dimensions. Then we learn the desired prediction function in a generic yet meaningful way. We also overcome the imbalanced dataset challenge and trade-offs in applying various machine learning models, including Logistic Regression (LR), Gradient Boosting Decision Tree (GBDT), Support Vector Machine (SVM), and Neural Network (NN).

We evaluated the machine learning model via different metrics and under diverse testing scenarios. Our results indicate that the proposed techniques achieve high prediction quality and are robust under different conditions. In particular, the GBDT-based solution achieves an F1 score of 0.81, significantly outperforming other competitive techniques. Our evaluation also uncovers interesting insights from comparison across different models, training/testing data, and feature combinations. We show that the proposed techniques impose moderate overhead and are practically feasible for GPGPU soft error prediction.

Last Updated: May 28, 2020 - 4:03 pm