Highlight

Quality Assessment of GPU Power Profiling Mechanisms

Achievement

The development of an assessment methodology to rate the quality and performance of a power profiling mechanism and its applications to four different GPU-power profiling techniques.

Significance and Impact

Energy is becoming one of the most expensive resources for running modern supercomputers. At the same time, each of these supercomputers consists of many components, some are power-hungry, and some are not. This work leads to a good understanding of component-level power consumption behavior, which is critical for figuring out the overall power budget as well as the roadmaps for improving energy efficiency, thereby reducing the total cost of ownership.

Research Details

  • We conducted a detailed assessment of four GPU-power profiling mechanisms utilizing a custom-design GPU stress-test benchmark called matrix-CUDA. The four mechanisms are NVML via Allinea MAP, NVML via direct read, PowerInsight device, and PowerInsight via Allinea MAP,
  • Our approach is to generate a high-low square-wave like power profile using matrix-CUDA and to assess the profiling mechanisms in terms of their reproduced power patterns.
  • Our assessment shows that the PowerInsight device-based GPU-power profiling mechanism is the best because it could reliably generate the expected power patterns at higher frequencies of high-low load variations. Additionally, the PowerInsight device reported the instantaneously measured power values, and therefore the measured profiles showed square-wave patterns having very sharp transitions from low to high load and vice versa. 

Overview

Accurate component-level power measurements are nowadays essential for the design and optimization of high performance computing (HPC) systems and applications. Particularly, as more and more heterogeneous HPC systems are developed, the characterizations of GPU power profiles have become extremely crucial because, although GPUs provide exceptional performance, they do consume substantial amounts of power. Currently, there are various GPU power profiling mechanisms available; however, there is no standard way to assess the quality of such profiling schemes. To address this issue, in this paper, we develop an assessment methodology to rate the quality and performance of the profiling mechanism itself. Specifically, we present the assessments of four different GPU power profiling techniques: (i) Nvidia’s NVML via Allinea MAP, (ii) Nvidia’s NVML via direct reads, and (iii) Penguin Computing’s PowerInsight (PI) via two vendor-provided drivers, and (iv) PowerInsight via Allinea MAP. In addition, we discuss the effects of moving-average filters to explain the slow variations of some of the measured power profiles. Based on our assessment, the GPU power profiling mechanism using PI device outperforms the other schemes by reliably measuring the ground-truth power profile generated by a GPU stress-test benchmark.