### Achievement

The CP decomposition is a low-rank approximation of a multi-dimensional array, or tensor, which generalizes matrix approximations like the truncated singular value decomposition. It approximates the input tensor by a sum of rank-one tensors, which are outer products of vectors. CP is often used for finding hidden patterns, or latent factors, within tensor data, particularly when the goal is to interpret the factors, and it is popular within the signal processing, machine learning, and scientific computing communities. A joint collaboration between ORNL and Wakeforest University resulted in a distributed-memory parallel algorithm and implementation of an alternating optimization method for computing a CP decomposition of dense tensor data that can enforce nonnegativity of the computed low-rank factors. This paper was presented in the presitigious 26th IEEE Conference on High Performance Computing, Data, and Analytics at Bangalore, India. The principal task is to parallelize the Matricized-Tensor Times Khatri-Rao Product (MTTKRP) bottleneck subcomputation. The algorithm is computation efficient, using dimension trees to avoid redundant computation across MTTKRPs within the alternating method. Our approach is also communication efficient, using a data distribution and parallel algorithm across a multidimensional processor grid that can be tuned to minimize communication. We benchmarked our software on synthetic as well as hyperspectral image and neuroscience dynamic functional connectivity data, demonstrating that our algorithm scales well to 100s of nodes (up to 4096 cores) and is faster and more general than the currently available parallel software.

### Significance and Impact

For the first time in the literature, we presented the first distributed-memory parallel implementation of NNCP algorithms for arbitrary-dimension dense tensors. The algorithm involved an optimized use of dimension trees for dense tensors, avoiding recomputation across multiple MTTKRPs. This communication optimal algorithm with a carefully chosen processor grid. We demonstrated a performance improvement of up to 2.2x over the existing state-of-the-art parallel software on 3D tensors and efficient parallel scaling of up to 1771x on 4096 cores. This enables the scientists to process multi-dimensional data captured from the scientific instruments to scale on entire titan.

### Overview

The last few decades have witnessed a significant increase in the sophistication of scientific instruments. This sophistication has led to an exponential increase in scientific data, so much so that today’s scientific datasets are comprised of multidimensional matrices known as tensors. Understanding such complex datasets brings unprecedented insight but is computationally expensive. A joint collaboration between Oak Ridge National Laboratory and Wake Forest University has resulted in a distributed-memory parallel algorithm that reuses computation via dimension trees, allowing researchers to analyze vast scientific datasets, and achieve results, much faster than current practices. The algorithm also reduces communications via the tuning of a multidimensional processor grid.

In fact, the team demonstrated a performance improvement of up to 2.2 times that of existing state-of-the-art parallel software and efficient scaling of 1,771 times across 4,096 cores. Such efficiency allowed researchers to process multi-dimensional data, in this case materials and MRI data, from scientific instruments in half the time across all of ORNL’s Titan supercomputer, which ranks among the top ten most powerful supercomputers in the world. The team’s results were presented at the prestigious 26th IEEE Conference on High Performance Computing, Data, and Analytics in Bangalore, India.