Graph Neural Networks (GNNs) are powerful and flexible neural networks that
use the naturally sparse connectivity information of the data. GNNs represent
this connectivity as sparse matrices, which have lower arithmetic intensity and
thus higher communication costs compared to dense matrices, making GNNs harder
to scale to high concurrencies than convolutional or fully-connected neural
networks.
We introduce a family of parallel algorithms for training GNNs and show that
they can asymptotically reduce communication compared to previous parallel GNN
training methods. We implement these algorithms, which are based on 1D, 1.5D,
2D, and 3D sparse-dense matrix multiplication, using torch.distributed on
GPU-equipped clusters. Our algorithms optimize communication across the full
GNN training pipeline. We train GNNs on over a hundred GPUs on multiple
datasets, including a protein network with over a billion edges.