Sparse matrix-matrix multiplication (SpGEMM) is a widely used kernel in
various graph, scientific computing and machine learning algorithms. In this
paper, we consider SpGEMMs performed on hundreds of thousands of processors
generating trillions of nonzeros in the output matrix. Distributed SpGEMM at
this extreme scale faces two key challenges: (1) high communication cost and
(2) inadequate memory to generate the output. We address these challenges with
an integrated communication-avoiding and memory-constrained SpGEMM algorithm
that scales to 262,144 cores (more than 1 million hardware threads) and can
multiply sparse matrices of any size as long as inputs and a fraction of output
fit in the aggregated memory. As we go from 16,384 cores to 262,144 cores on a
Cray XC40 supercomputer, the new SpGEMM algorithm runs 10x faster when
multiplying large-scale protein-similarity matrices.