Technology trends suggest that future machines will rely
on parallelism to meet increasing performance requirements. To aid in programmer productivity and application performance, many
parallel programming models provide communication building
blocks called collective communication. These
operations, such as Broadcast, Scatter, Gather, and Reduce, abstract
common global data movement patterns behind a simple library
interface allowing the hardware and runtime system to optimize them for performance and scalability.
We consider the problem of optimizing collective communication in Partitioned Global Address Space (PGAS) languages. Rooted in traditional
shared memory programming models, they deliver the benefits of
sophisticated distributed data structures using language extensions
and one-sided communication.
One-sided communication allows one processor to directly read and
write memory associated with another.
Many popular PGAS language implementations share a common runtime
system called GASNet for implementing such communication. To provide a highly scalable
platform for our work, we present a new implementation of
GASNet for the IBM BlueGene/P, allowing GASNet to scale to tens of thousands of processors.
We demonstrate that PGAS languages are highly scalable and that
the one-sided communication within them is an efficient and
convenient platform for collective communication. We show how to use one-sided communication to achieve 3x improvements in the latency and
throughput of the collectives over standard message passing implementations.
Using a 3D FFT as a representative communication
bound benchmark, for example, we see a 17% increase in performance on 32,768
cores of the BlueGene/P and a 1.5x improvement on 1024 cores of the CrayXT4. We also show how the
automatically tuned collectives can deliver more than an order of
magnitude in performance over existing implementations on shared
memory platforms.
There is no obvious
best algorithm that serves all machines and usage patterns
demonstrating the need for tuning and we thus build
an automatic tuning system in GASNet
that optimizes the collectives for a variety of large scale
supercomputers and novel multicore architectures. To understand the large search space, we
construct analytic performance models use them to minimize the
overhead of autotuning. We demonstrate that autotuning is
an effective approach to addressing performance optimizations on complex parallel systems.