Experiences on Using Different OpenMP 4.5 Programming Styles to Bring DRMG++ to Exascale

Event: The International Workshop on OpenMP (IWOMP)


Presenter: Arghya Chatterjee


With the rapidly changing microprocessor designs and architectural diversity (multicores, manycores, accelerators) for the next generation pre-Exascale systems, applications at the Oak Ridge Leadership Computing Facility must adapt to the hardware, to exploit the different types of parallelism in the architecture. To get the benefit of all the hardware threads within a node its best to use a hybrid programming approach, by using OpenMP within the nodes, and MPI across the nodes, for scaling.

On these systems with heterogeneous architectures, that provide different levels of parallelism, data locality, and memory hierarchies, addressing performance portability can be challenging. As we move towards these systems, we need to experiment with different “programming styles” for in-node parallelization using OpenMP to manage the complexity of these systems.

New features in the OpenMP specification, 4.0/4.5, addresses some of these challenges, like, affinity, target support, user-defined reductions, taskgroups, and taskloops that can interoperate with OpenMP 2.5 features such as worksharing loops, nested parallelism, etc. We need to use these constructs collectively to exploit all the parallelism available within the nodes.

Finding out the best programming style (e.g.: SPMD-style, multi-level tasks, accelerator programming, nested parallelism, or a combination of these) to maximize performance and achieve performance portability across multiple systems with different architectures is still an open research problem. As a part of this exploration, we have developed a mini-application, from the original DMRG++ application (sparse matrix algebra computational motif, developed at ORNL). Our mini-app uses different types of parallelism (tasking, data, SIMD-level parallelism, etc.) that we can use, to exploit OpenMP 4.5 constructs, collectively, to give us the performance on the future pre-Exascale system.

In this talk, we will briefly discuss the application, the different types of parallelism in our application, followed by the different ex-periments when using various programming styles (using the latest OpenMP 4.5 constructs) and their performance implications. We will conclude with our experience, on what worked and didn’t work and what is currently missing in OpenMP, which might improve its usability.