Highlight

Scaling the Summit: Deploying the World’s Fastest Supercomputer

Summit Node Architecture
Summit Node Architecture

Achievement

Summit, the latest flagship supercomputer deployed at Oak Ridge Leadership Computing Facility (OLCF), became the number one system in the Top500 list in June 2018 and remains in the top spot in the most recent edition of the list. Summit also took the top spot in the 10th HPCG Performance list, and the number three spot in the Green 500 November 2018 list. Furthermore, five of six SC18 Gordon Bell award submissions used Summit, including the two winners.

Significance and Impact

An extensive acceptance test plan was developed to evaluate the unique features introduced in the Summit architecture and system software stack. The acceptance test includes tests to ensure that the system is reliable, stable, and performant. This work describes the multi-month process of deploying Summit, which includes acceptance test plan design, test development and validation, and the multi-phase acceptance process. The description includes issues encountered, lessons learned, and best practices uncovered along the way.

Research Details

  • Summit’s acceptance testing used an established methodology that successfully deployed prior OLCF leadership systems including Jaguar and Titan.
  • The acceptance test plan includes four distinct test elements (hardware, functionality, performance, and stability), each with specific entry and exit criteria that must be met before the next test element can proceed.
  • Acceptance tests are developed to represent the current OLCF application portfolio and simulate production workloads.

Overview

Due to the scale and complexity of the Summit system, the OLCF developed and executed a thorough acceptance test plan with two major goals. The first goal was to verify performance targets of individual hardware technologies across the entire system. The second goal was to determine the system’s readiness to support the facility’s user programs by conducting tests involving all major components of the system software stack. Each of the acceptance test elements represent a critical capability for a leadership computing system. The hardware test element verifies all delivered parts of the system meet the manufacturer’s specifications for functionality and performance, and also evaluates common system administration actions related to monitoring, maintenance, and fault recovery. The functionality test element demonstrates that the system hardware and software meet essential requirements for developing and running a wide variety of applications. The performance test element measures the ability of the system hardware and software to meet contractual performance and scalability requirements and includes measurement of application-specific figures of merit for isolated runs. Finally, the stability test element simulates a mix of code development and production application workloads that fully utilize the system for an extended period of time (i.e., two weeks), with a requirement that no high-severity failures are encountered.

Publication

V. Melesse Vergara, et al., “Scaling the Summit: Deploying the World’s Fastest Supercomputer”, International Workshop on OpenPOWER for HPC, Frankfurt, Germany, June 2019. (Resolution Publication #125376)