The design and implementation of a component-level power profiler.
Significance and Impact
This work enables a software developer to gain further insights about the power-consumption behavior of his/her codes because it provides component-level details. Having an energy-efficient code is important because energy efficiency is now a first-class design constraint in high performance computing.
- This work implements a mechanism for collecting the power data for components such as CPU, GPU, and memory DIMM from metering devices located outside of compute nodes.
- This work implements a mechanism for displaying the collected power data in conjunction with other performance metrics collected on compute nodes.
- The team has developed computer programs and scripts that expand a commercial performance profiler by incorporating out-of-band, component-level power data. Two bugs in the profiler were identified and addressed in the process. Our initial evaluation indicates that the new profiler provides a higher quality than Intel’s Running Average Power Limit (RAPL).
Energy efficiency is a major challenge for high-performance computing (HPC), and the capability for accurate and fine-grained power monitoring is critical to the success of energy-efficient HPC. Solutions for mapping monitored power data back to the source code of a run are also important because they can help a software developer diagnose the inefficiency in the utilization of allocated resources. In this work we designed and implemented one such solution, a component-level power profiler. We essentially expand an industrial-strength, application-level performance profiler by incorporating component-level power data along with performance data collected by the profiler itself. In the process two bugs in the profiler were identified and in turn fixed by the vendor. We now have a capability for the power profiling of a parallel application at component level. Our initial evaluation indicates that the new profiler provides a reliable, higher sampling rate than Intel’s Running Average Power Limit (RAPL) technology, a popular mechanism used in the HPC community. In addition, this profiler provides a complete picture of the power breakdown of a compute node, and this is something that RAPL is incapable of.