Publication

GPU-centric Communication on NVIDIA GPU Clusters with InfiniBand: A Case Study with OpenSHMEM

Citation

Sreeram Potluri (NVIDIA Corporation), Anshuman Goswami (NVIDIA Corporation), Davide Rossetti (NVIDIA Corporation), Manjunath Gorentla Venkata, Neena Imam (Oak Ridge National Laboratory), Chris J. Newburn (NVIDIA Corporation). GPU-centric Communication on NVIDIA GPU Clusters with InfiniBand: A Case Study with OpenSHMEM. HiPC 2017, Jaipur, India, December 2017.

Abstract

GPU-based extreme-scale systems are popular for scientific and data-intensive computing. As we move towards pre-exascale systems with multiple GPUs per node and tens of hundreds of GPUs in a system, efficiently moving data between the GPUs in the node and between the GPUs across node is critical to achieve performance. It is important the programming model supports such capabilities for both productivity and performance. NVSHMEM is an implementation of OpenSHMEM for NVIDIA GPUs which enables communication from the CUDA kernels over PCIe and NVLink. In this work, we focus on developing support for communication over Mellanox’s InfiniBand in NVSHMEM suitable for Summit.  We demonstrated the usage and productivity advantages of NVSHMEM with micro benchmarks and 2D stencil. To evaluate the efficiency of InfiniBand support, we showed that the network can be saturated with few streaming processors available on NVIDIA’s Pascal P100 GPUs and achieve message throughput of 90 Million messages per sec on Mellanox’s EDR InfiniBand HCA.

Read Publication

Last Updated: May 28, 2020 - 4:04 pm