### Achievement

The CP decomposition is a low-rank approximation of a multi-dimensional array, or tensor, which generalizes matrix approximations like the truncated singular value decomposition. It approximates the input tensor by a sum of rank-one tensors, which are outer products of vectors. CP is often used for finding hidden patterns, or latent factors, within tensor data, particularly when the goal is to interpret the factors, and it is popular within the signal processing, machine learning, and scientific computing communities. A joint collaboration between ORNL and Wake Forest University resulted in a distributed-memory parallel algorithm and implementation of an alternating optimization method for computing a CP decomposition of dense tensor data that can enforce nonnegativity of the computed low-rank factors. This paper was presented in the presitigious 26^{th} IEEE Conference on High Performance Computing, Data, and Analytics at Bangalore, India. The principal task is to parallelize the Matricized-Tensor Times Khatri-Rao Product (MTTKRP) bottleneck subcomputation. The algorithm is computation efficient, using dimension trees to avoid redundant computation across MTTKRPs within the alternating method. Our approach is also communication efficient, using a data distribution and parallel algorithm across a multidimensional processor grid that can be tuned to minimize communication. We benchmarked our software on synthetic as well as hyperspectral image and neuroscience dynamic functional connectivity data, demonstrating that our algorithm scales well to 100s of nodes (up to 4096 cores) and is faster and more general than the currently available parallel software.

### Significance and Impact

For the first time in the literature, we presented the first distributed-memory parallel implementation of tensor factorization based on NNCP algorithms for arbitrary-dimension dense tensors. The algorithm involved an optimized use of dimension trees for dense tensors, avoiding recomputation across multiple MTTKRPs. This communication optimal algorithm with a carefully chosen processor grid. We demonstrated a performance improvement of up to 2.2x over the existing state-of-the-art parallel software on 3D tensors and efficient parallel scaling of up to 1771x on 4096 cores. This enables the scientists to process multi-dimensional data captured from the scientific instruments to scale on entire titan.