### Achievement

We designed a novel distributed-memory algorithm to improve the strong scalability of the solution of a sparse triangular system. This operation appears in the solve phase of direct methods for solving general sparse linear systems, Ax = b. Our 3D sparse triangular solver employs several techniques, including a 3D MPI process grid, elimination tree parallelism, and data replication, all of which reduce the per-process communication when combined. We present analytical models to understand the communication cost of our algorithm and show that our 3D sparse triangular solver can reduce the per-process communication volume asymptotically by a factor of O(n^{1/4}) and O(n^{1/6}) for problems arising from the finite element discretizations of 2D “planar” and 3D “non-planar” PDEs, respectively. We implement our algorithm for use in SuperLU_DIST3D, using a hybrid MPI+OpenMP programming model.

### Significance and Impact

Solving a large and sparse system of linear equation is ubiquitous in all domains of scientific computing. The sparse triangular solver is an important sparse linear algebraic kernel used by many iterative and direct algorithms for solving the system of linear equations. Yet distributed memory sparse triangular solver is not very scalable. Our 3D triangular solve algorithm, when run on 12k cores of Cray XC30, outperforms the current state-of-the-art 2D algorithm by 7.2x for planar and 2.7x for the non-planar sparse matrices, respectively. This will enable faster completion of many scientific simulations that will lead to accelerated discovery.

### Research Details

In our previous work, we developed a communication-avoiding algorithm for sparse LU factorization called SuperLU_Dist3D. The idea underlying this SuperLU_Dist3D method is to organize the MPI processes logically into a three-dimensional grid, rather than a traditional 2D one, and then exploit the structure of the elimination tree—an abstraction that captures the data dependencies in sparse LU factorization—to replicate data judiciously. This combination of techniques provably reduce communication asymptotically in the problem size in common cases. In this work, we leverage the 3D sparse LU data structure of SuperLU_Dist3D to develop a communication avoiding SpTrs, which yields asymptotic reductions in the latency and communication-volume costs of a conventional SpTrs. Briefly, our new 3D SpTrs works as follows. Consider the 3D process grid as a collection of 2D MPI process grids. The prior technique of SuperLU_Dist3D mapped independent subtrees of the elimination tree to each 2D process grid and replicated the common ancestors. Our 3D triangular solver exploits this same 3D organization. It first solves independent subtrees on different 2D process grids, and then performs a reduction before solving the subproblem in the common ancestor tree on a single 2D grid.

### Publication

Sao, Piyush, Ramakrishnan Kannan, Xiaoye Sherry Li, and Richard Vuduc. "A communication-avoiding 3D sparse triangular solver." In Proceedings of the ACM International Conference on Supercomputing, pp. 127-137. ACM, 2019 https://dl.acm.org/citation.cfm?id=3330357

### Overview

We present a novel distributed memory algorithm to improve the strong scalability of the solution of a sparse triangular system. This operation appears in the solve phase of direct methods for solving general sparse linear systems, Ax = b. Our 3D sparse triangular solver employs several techniques, including a 3D MPI process grid, elimination tree parallelism, and data replication, all of which reduce the per-process communication when combined. We present analytical models to understand the communication cost of our algorithm and show that our 3D sparse triangular solver can reduce the per-process communication volume asymptotically by a factor of O(n1/4) and O(n1/6) for problems arising from the finite element discretizations of 2D "planar" and 3D "non-planar" PDEs, respectively. We implement our algorithm for use in SuperLU_DIST3D, using a hybrid MPI+OpenMP programming model. Our 3D triangular solve algorithm, when run on 12k cores of Cray XC30, outperforms the current state-of-the-art 2D algorithm by 7.2x for planar and 2.7x for the non-planar sparse matrices, respectively.