Chen, Yuxin

PGAS-based Fine-Grained Asynchronous Execution on GPUs

2024

Chen, Yuxin
Advisor(s): Owens, John

Abstract

Inspired by the Partitioned Global Address Space (PGAS) model, this dissertation explores the design philosophies for adapting PGAS-based fine-grained asynchronous execution models to GPUs. This approach outperforms the traditional Bulk-Synchronous Parallelism (BSP) model in both computation and communication efficiency. We highlight several key design principles: Computation aspect: (1) relaxing global barriers and handling dependencies in a more fine-grained manner; (2) replacing the BSP model with lightweight asynchronous task scheduler to maintain data dependency, and (3) eliminating heavy synchronizations to unlock greater parallelism. Communication aspect: (1) leveraging PGAS-style lightweight one-sided communication for extensive computation and communication overlap; (2) replacing synchronizations with lighter mechanisms for data consistency and (3) designing communication aggregator when network bandwidth is performance bottleneck.

We applied these principles to graph algorithms, developing Atos, an integrated GPU fine-grained asynchronous execution graph framework that aligns with the PGAS principles in both single and multi-GPU contexts.

Graph algorithms inherently exhibit irregular and unpredictable workload patterns, often following a producer-consumer workflow model. Traditionally on single-GPU setups, these algorithms are mapped onto the bulk-synchronous execution model, where each bulk step processes a vertex frontier while generating new one for the next bulk step. The producer and consumer are serialized by the kernel boundary. In contrast, Atos implements a fine-grained asynchronous execution scheme on a single-GPU, concurrently running multiple consumers and producers, thereby streamlining the delivery of new tasks to consumers. This approach entails a substantially higher number of handshakes between consumers and producers, making handshake overhead pivotal to the effectiveness of this approach. Instead of relying on a heavily weighted serialized handshake (single-GPU) or two-sided handshakes (multi-GPU), we’ve developed an asynchronous (distributed) queue. This design offers a lightweight and efficient handshake mechanism, minimizing overhead, preserving dependency integrity, and swiftly exposing newly generated parallelism, otherwise concealed in a bulk-synchronous model.

Moving to multi-GPU scenarios, we expand upon the single-GPU design by incorporating cross-GPU communication components. The dominant hybrid MPI+X model often results in over-serialized node code, marked by heavyweight two-sided CPU handshakes after computationally intensive loop nests on GPUs. In contrast, Atos operates entirely on GPUs, utilizing PGAS-style lightweight one-sided memory operations for efficient intra-node and inter-node communication within GPU kernels. This across-GPU communication in Atos features lower latency and lower overhead, and doesn’t require synchronizing with CPUs. These features make it profitable to use high-utilization GPU kernels with more frequent communication, resulting in better latency hiding and smoother interconnection usage.

These strategies align with PGAS principles, prioritizing one-sided, non-blocking communication, lightweight data consistency mechanisms, fine-grained dependency management, and fine-grained task scheduling. Adhering to this philosophy, our design effectively boosts hardware utilization and overall throughput. This is particularly impactful for problems constrained by limited parallelism. In our two case studies: Breadth-First Search and PageRank, Atos outperforms leading graph libraries (Gunrock, Groute, and Galois) across single-GPU, single-node-multi-GPU, and multi-node-GPU setups, showcasing the prowess of its PGAS-based asynchronous execution model.

In the last, we are working on extending this PGAS-style lightweight one-sided communication scheme to the forward pass of Deep Learning Recommendation Models (DLRM). The insights we’ve gained from Atos regarding lightweight one-sided communication are expected to benefit DLRM as well. By leveraging the efficiency of PGAS-style communication, we aim to reduce the overhead associated with traditional collective communication methods in DLRM and dramatically increase the communication and computation overlap, thereby enhancing the scalability and performance of DLRM. Our performance evaluation supports these expectations, demonstrating significant improvements in both performance and scalability.

Main Content

For improved accessibility of PDF content, download the file to your device.

UC Davis

PGAS-based Fine-Grained Asynchronous Execution on GPUs