Search

Scholarly Works (100 results)

Sort By:

Show:

Thesis
Peer Reviewed

Automatically Tuning Collective Communication for One-Sided Programming Models

Nishtala, Rajesh
Advisor(s): Yelick, Katherine A

UC Berkeley Electronic Theses and Dissertations (2009)

Technology trends suggest that future machines will rely

on parallelism to meet increasing performance requirements. To aid in programmer productivity and application performance, many

parallel programming models provide communication building

blocks called collective communication. These

operations, such as Broadcast, Scatter, Gather, and Reduce, abstract

common global data movement patterns behind a simple library

interface allowing the hardware and runtime system to optimize them for performance and scalability.

We consider the problem of optimizing collective communication in Partitioned Global Address Space (PGAS) languages. Rooted in traditional

shared memory programming models, they deliver the benefits of

sophisticated distributed data structures using language extensions

and one-sided communication.

One-sided communication allows one processor to directly read and

write memory associated with another.

Many popular PGAS language implementations share a common runtime

system called GASNet for implementing such communication. To provide a highly scalable

platform for our work, we present a new implementation of

GASNet for the IBM BlueGene/P, allowing GASNet to scale to tens of thousands of processors.

We demonstrate that PGAS languages are highly scalable and that

the one-sided communication within them is an efficient and

convenient platform for collective communication. We show how to use one-sided communication to achieve 3x improvements in the latency and

throughput of the collectives over standard message passing implementations.

Using a 3D FFT as a representative communication

bound benchmark, for example, we see a 17% increase in performance on 32,768

cores of the BlueGene/P and a 1.5x improvement on 1024 cores of the CrayXT4. We also show how the

automatically tuned collectives can deliver more than an order of

magnitude in performance over existing implementations on shared

memory platforms.

There is no obvious

best algorithm that serves all machines and usage patterns

demonstrating the need for tuning and we thus build

an automatic tuning system in GASNet

that optimizes the collectives for a variety of large scale

supercomputers and novel multicore architectures. To understand the large search space, we

construct analytic performance models use them to minimize the

overhead of autotuning. We demonstrate that autotuning is

an effective approach to addressing performance optimizations on complex parallel systems.

Cover page: Automatically Tuning Collective Communication for One-Sided Programming Models

Thesis
Peer Reviewed

Scalable Parallel Algorithms for Genome Analysis

Georganas, Evangelos
Advisor(s): Yelick, Katherine A

UC Berkeley Electronic Theses and Dissertations (2016)

A critical problem for computational genomics is the problem of de novo genome assembly: the development of robust scalable methods for transforming short randomly sampled “shotgun” sequences, namely reads, into the contiguous and accurate reconstruction of complex genomes. These reads are significantly shorter (e.g. hundreds of bases long) than the size of chromosomes and also include errors. While advanced methods exist for assembling the small and haploid genomes of prokaryotes, the genomes of eukaryotes are more complex. Moreover, de novo assembly has been unable to keep pace with the flood of data, due to the dramatic increases in genome sequencer capabilities, combined with the computational requirements and the algorithmic complexity of assembling large scale genomes and metagenomes.

In this dissertation, we address this challenge head on by developing parallel algorithms for de novo genome assembly with the ambition to scale to massive concurrencies. Our work is based on the Meraculous assembler, a state-of-the-art de novo assembler for short reads developed at JGI. Meraculous identifies non-erroneous overlapping substrings of length k (k-mers) with high quality extensions and uniquely assembles genome regions into uncontested sequences called contigs by constructing and traversing a de Bruijn graph of k-mers, a special graph that is used to represent overlaps among k-mers. The original reads are subsequently aligned onto the contigs to obtain information regarding the relative orientation of the contigs. Contigs are then linked together to create scaffolds, sequences of contigs that may contain gaps among them. Finally gaps are filled using localized assemblies based on the original reads.

First, we design efficient scalable algorithms for k-mer analysis and contig generation. K-mer analysis is characterized by intensive communication and I/O requirements and our parallel algorithms successfully reduce the memory requirements by 7×. Then, contig generation relies on efficient parallelization of the de Bruijn graph construction and traversal, which necessitates a distributed hash table and is a key component of most de novo assemblers. We present a novel algorithm that leverages one-sided communication capabilities of the UPC to facilitate the requisite fine-grained, irregular parallelism and the avoidance of data hazards. The sequence alignment is characterized by intensive I/O and large computation requirements. We introduce mer-Aligner, a highly parallel sequence aligner that employs parallelism in all of its components. Finally, this thesis details the parallelization of the scaffolding modules, enabling the first massively scalable, high quality, complete end-to-end de novo assembly pipeline. Experimental large-scale results using human and wheat genomes demonstrate efficient performance and scalability on thousands of cores. Compared to the original Meraculous code, which requires approximately 48 hours to assemble the human genome, our pipeline called HipMer computes the assembly in only 4 minutes using 23,040 cores of Edison – an overall speedup of approximately 720×.

In the last part of the dissertation we tackle the problem of metagenome assembly. Metagenomics is currently the leading technology to study the uncultured microbial diversity. While accessing an unprecedented number of environmental samples that consist of thousands of individual microbial genomes is now possible, the bottleneck is becoming computational, since the sequencing cost improvements exceed that of Moore’s Law. Metagenome assembly is further complicated by repeated sequences across genomes, polymorphisms within a species and variable frequency of the genomes within the sample. In our work we repurpose HipMer components for the problem of metagenome assembly and we design a versatile, high-performance metagenome assembly pipeline that outperforms state-of-the-art tools in both quality and performance.

Cover page: Scalable Parallel Algorithms for Genome Analysis

Thesis
Peer Reviewed

Communication Avoidance for Algorithms with Sparse All-to-all Interactions

Koanantakool, Penporn
Advisor(s): Yelick, Katherine A

UC Berkeley Electronic Theses and Dissertations (2017)

In parallel computing environments from multicore systems to cloud computers and supercomputers, data movement is the dominant cost in both running time and energy usage. Even worse, hardware trends suggest that the gap between computing and data movement, both in memory systems and interconnect networks, will continue to grow. Minimizing communication is therefore necessary in devising scalable parallel algorithms. This work discusses parallelizing kernels in applications ranging from chemistry and cosmology to machine learning.

We have developed new communication-avoiding algorithms for problems with all-to-all interactions such as many-body and matrix computations, taking into account their sparsity patterns, either from cutoff distance, symmetry, or data sparsity. Our algorithms are communication-efficient (some are provably optimal) and scalable to tens of thousands of processors, exhibiting orders of magnitude speedup over more commonly used algorithms.

These all-to-all computational patterns arise in scientific simulations and machine learning. The last part of the thesis will present a case study of communication-avoiding sparse-dense matrix multiplication as used in graphical model structure learning. The resulting high-performance sparse inverse covariance matrix estimation algorithm enables processing high-dimensional data with arbitrary underlying structures at a scale that was previously intractable, e.g., 1.28 million dimensions (over 800 billion parameters) in under 21 minutes on 24,576 cores of a Cray XC30. Our method is used to automatically estimate the underlying functional connectivity of the human brain from resting-state fMRI data. The results show good agreement with a state-of-the-art clustering, which used manual intervention, from the neuroscience literature.

Cover page: Communication Avoidance for Algorithms with Sparse All-to-all Interactions

Thesis
Peer Reviewed

Single Program, Multiple Data Programming for Hierarchical Computations

Kamil, Amir Ashraf
Advisor(s): Yelick, Katherine

UC Berkeley Electronic Theses and Dissertations (2012)

As performance gains in sequential programming have stagnated due to power constraints, parallel computing has become the primary tool for increasing performance. Parallel computing has long been used in scientific computing, and programmers of the future will likely face many of the same challenges that occur in programming large-scale machines. One such challenge is that of hierarchy: machines are built in a hierarchical fashion, with a wide range of communication costs between different parts of a machine, and applications such as divide-and-conquer algorithms often have hierarchical structure.

Large-scale parallel machines are programmed primarily with the single program, multiple data (SPMD) model of parallelism. This model combines independent threads of execution with global collective communication and synchronization operations. Previous work has demonstrated the advantages of SPMD over other models: its simplicity enables productive programming and avoids many classes of parallel errors, and at the same time it is easy to implement and amenable to compiler analysis and optimization. Its local-view execution model allows programmers to take advantage of data locality, resulting in good performance and scalability on large-scale machines. However, it is a flat model that does not fit well with hierarchical machines or algorithms.

In this dissertation, we introduce the recursive single program, multiple data (RSPMD) execution model. This model extends SPMD with hierarchical, structured teams, or groupings of threads. We design RSPMD extensions for the Titanium language, including a hierarchical team data structure and lexically-scoped constructs for operating over teams. We demonstrate that these extensions prevent erroneous use of teams that would result in deadlock. In addition, we present a runtime mechanism for ensuring proper use of both global collective operations and collectives over teams, eliminating more potential sources of deadlock.

As analyzable as SPMD is, we demonstrate that RSPMD can also be analyzed precisely and efficiently. We define a hierarchical pointer analysis for determining which data a pointer can reference, as well as on which threads the referenced data may reside. We then present a series of analyses for computing the set of concurrent statements in both SPMD and RSPMD programs. We show that these analyses improve the results of multiple client analyses, including data-locality and sharing inference, race detection, and memory-model enforcement.

Finally, we present application case studies demonstrating the expressiveness and performance of the RSPMD model. We show that the model enables divide-and-conquer algorithms such as sorting to be elegantly expressed, and that team collective operations increase performance of a conjugate gradient benchmark by up to a factor of two. The model also facilitates optimizations for hierarchical machines, improving scalability of a particle in cell application by 8x, performance of sorting by up to 40%, and execution time of a stencil code by as much as 14%.

Cover page: Single Program, Multiple Data Programming for Hierarchical Computations

Thesis
Peer Reviewed

Parallelizing Irregular Applications for Distributed Memory Scalability: Case Studies from Genomics

Ellis, Marquita May
Advisor(s): Yelick, Katherine

UC Berkeley Electronic Theses and Dissertations (2020)

Generalizable approaches, models, and frameworks for irregular application scalability is an old yet open area in parallel and distributed computing research. Irregular applications are particularly hard to parallelize and distribute because, by definition, the pattern of computation is dependent upon the input data. With the proliferation of data-driven and data-intensive applications from the realm of Big Data, and the increasing demand for and availability of large-scale computing resources through HPC-Cloud convergence, the importance of generalized approaches to achieving irregular application scalability is only growing.

Rather than offering another software language or framework, this dissertation argues we first need to understand application scalability, especially irregular application scalability, and more closely examine patterns of computation, data sharing, and dependencies. As it stands, predominant performance models and tools from parallel and distributed computing focus on applications that are divided into distinct communication and computation phases, and ignore issues related to memory utilization. While time-tested and valuable, these models are not always sufficient for understanding full application scalability, particularly, the scalability of data-intensive irregular applications. We present application case studies from genomics, highlighting the interdependencies of communication, computation, and memory capacities and performance.

The genomics applications we will examine offer a particularly useful and practical vantage point for this analysis, as they are data-intensive irregular application targets for both HPC and cloud computing. Further, they present an extreme for both domains.

For HPC, they are less akin to traditional, well-studied and well-supported scientific simulations and more akin to text and document analysis applications. For cloud computing, they are an extreme in that they require frequent random global access to memory and data, stressing interconnection network latency and bandwidth and co-scheduled processors for tightly orchestrated computation.

We show how common patterns of irregular all-to-all computation can be managed efficiently, comparing bulk-synchronous approaches built on collective communication and asynchronous approaches based on one-sided communication. For the former, our work is based on the popular Message Passing Interface (MPI) and makes heavy use of globally collective communication operations that exchange data across processors in a single step or, to save memory use, in a set of irregular steps. For the latter, we build on the UPC++ programming framework, which provides lightweight RPC mechanisms, to transfer both data and computational work between processors. We present performance results across multiple platforms including several modern HPC systems and, at least in one case, a cloud computing platform.

With these application case studies, we seek not only to contribute to discussions around parallel algorithm and data structure design, programming systems, and performance modeling within the parallel computing community, but also to contribute to broader work in genomics through software development and analysis. Thus, we develop and present the first distributed memory scalable software for analyzing data sets from the latest generation of sequencing technologies, known as long read data sets. Specifically, we present scalable solutions to the problem of many-to-many long read overlap and alignment, the computational bottleneck to long read assembly, error correction, and direct analysis. Through cross-architectural empirical analysis, we identify the key components to efficient scalability, and highlight the priorities for any future optimization with analytical models.

Cover page: Parallelizing Irregular Applications for Distributed Memory Scalability: Case Studies from Genomics

Thesis
Peer Reviewed

Optimizing Irregular Data Accesses for Cluster and Multicore Architectures

Su, Jimmy Zhigang
Advisor(s): Yelick, Katherine A

UC Berkeley Electronic Theses and Dissertations (2010)

Applications with irregular accesses to shared state are one of the most challenging computational patterns in parallel computing. Accesses can involve both read or write operations, with writes having the additional complexity of requiring some form of synchronization. Irregular accesses perform poorly in local cached-based memory systems and across networks in global distributed memory settings, because they have poor spatial and temporal locality. Irregular accesses arises in transaction processing, in various system level programs, in computing histograms, performing sparse matrix operations, updating meshes in

particle-mesh methods, and building adaptive unstructured meshes. Writing codes with asynchronous parallel updates on clusters and multicore processors presents different sets of challenges. On clusters, the goal is to minimize the number of messages and the volume of messages between nodes. While on multicore machines, the goal is to minimize off-chip accesses since there is significant performance difference between on chip and off chip memory access.

In this dissertation, we explore various analyses, optimizations, and tools for shared accesses on both multicore and distributed memory cluster architectures. On cluster architectures, we consider both irregular reads and writes, demonstrate how Partitioned Global Address Space languages support programming irregular problems, and develop optimizations to minimize

communication traffic, both in volume and number of distinct events. On multicore processors, we consider the lower level code generation and tuning problem, independent of any particular source language. We explore performance tradeoffs between various shared update implementations, such as locking, replication of state to avoid collisions, and hybrid versions.

We develop an adaptive implementation that adjusts the shared update strategy based on densities that yields significant speedups. In addition, we develop a performance debugging tool to find scalability problems in large scientific applications early in the development cycle. Throughout the thesis we perform experiments demonstrating the value of our optimizations and tools in both architectural settings, use a set of benchmarks and applications that include

histogram making, sparse matrix computations, and two scientific simulations involving particle-mesh methods. Our results show substantial speeds of up to 4.8X for multicore platforms and 120X for clusters. The results are a comprehensive set of techniques for improving the performance of irregular applications using advanced languages, compilers, analyses, optimizations and tools.

Cover page: Optimizing Irregular Data Accesses for Cluster and Multicore Architectures

Thesis
Peer Reviewed

Programming Abstractions and Synthesis-Aided Compilation for Emerging Computing Platforms

UC Berkeley Electronic Theses and Dissertations (2018)

Today's cutting-edge applications, ranging from wearable devices and embedded medical sensors to high-performance data centers, put new demands on computer architectures. Those demands include more computation capability, a tight power budget, low latency, high throughput, and many more. To meet these requirements, specialized architectures with low energy consumption are becoming more prevalent. Many of these architectures trade off programmability features for gains in energy efficiency and performance. Hence, programmability challenges are inevitable as applications continue to evolve and make new demands on computing architectures.

I propose key principles for improving programmability intended for application writers as well as compiler developers and language designers. First, I address programmability issues by providing a programming model that hides low-level details but sufficiently exposes essential details for application writers to control. Second, to compile and optimize programs, I apply a new compilation methodology based on synthesis. Unlike a classical compiler's transformation, synthesis obtains a correct and optimal solution by searching for an optimal candidate that is semantically equivalent to a specification program. This search helps compilers generate efficient code without deriving a program via a sequence of transformations, which are challenging for compiler developers to design for new unconventional architectures.

In this thesis, I demonstrate the key principles in three projects: Chlorophyll, a language and compiler for low-power spatial architectures; Floem, a programming system for NIC-accelerated data center applications; and GreenThumb, a framework for building a superoptimizer (an assembly program optimizer based on synthesis).

Cover page: Programming Abstractions and Synthesis-Aided Compilation for Emerging Computing Platforms

Article
Peer Reviewed

Making Sequential Consistency Practical in Titanium

UC Berkeley Previously Published Works (2005)

The memory consistency model in shared memory parallel programming controls the order in which memory operations performed by one thread may be observed by another. The most natural model for programmers is to have memory accesses appear to take effect in the order specified in the original program. Language designers have been reluctant to use this strong semantics, called sequential consistency, due to concerns over the performance of memory fence instructions and related mechanisms that guarantee order. In this paper, we provide evidence for the practicality of sequential consistency by showing that advanced compiler analysis techniques are sufficient to eliminate the need for most memory fences and enable high-level optimizations. Our analyses eliminated over 97% of the memory fences that were needed by a naïve implementation, accounting for 87 to 100% of the dynamically encountered fences in all but one benchmark. The impact of the memory model and analysis on runtime performance depends on the quality of the optimizations: more aggressive optimizations are likely to be invalidated by a strong memory consistency semantics. We consider two specific optimizations-pipelining of bulk memory copies and communication aggregation and scheduling for irregular accesses-and show that our most aggressive analysis is able to obtain the same performance as the relaxed model when applied to two linear algebra kernels. While additional work on parallel optimizations and analyses is needed, we believe these results provide important evidence on the viability of using a simple memory consistency model without sacrificing performance.

Cover page: Making Sequential Consistency Practical in Titanium

Article
Peer Reviewed

Reducing Communication in Graph Neural Network Training

LBL Publications (2020)

Graph Neural Networks (GNNs) are powerful and flexible neural networks that use the naturally sparse connectivity information of the data. GNNs represent this connectivity as sparse matrices, which have lower arithmetic intensity and thus higher communication costs compared to dense matrices, making GNNs harder to scale to high concurrencies than convolutional or fully-connected neural networks. We introduce a family of parallel algorithms for training GNNs and show that they can asymptotically reduce communication compared to previous parallel GNN training methods. We implement these algorithms, which are based on 1D, 1.5D, 2D, and 3D sparse-dense matrix multiplication, using torch.distributed on GPU-equipped clusters. Our algorithms optimize communication across the full GNN training pipeline. We train GNNs on over a hundred GPUs on multiple datasets, including a protein network with over a billion edges.

Cover page: Reducing Communication in Graph Neural Network Training

Article
Peer Reviewed

BCL

UC Berkeley Previously Published Works (2019)

One-sided communication is a useful paradigm for irregular parallel applications, but most one-sided programming environments, including MPI's one-sided interface and PGAS programming languages, lack application-level libraries to support these applications. We present the Berkeley Container Library, a set of generic, cross-platform, high-performance data structures for irregular applications, including queues, hash tables, Bloom filters and more. BCL is written in C++ using an internal DSL called the BCL Core that provides one-sided communication primitives such as remote get and remote put operations. The BCL Core has backends for MPI, OpenSHMEM, GASNet-EX, and UPC++, allowing BCL data structures to be used natively in programs written using any of these programming environments. Along with our internal DSL, we present the BCL ObjectContainer abstraction, which allows BCL data structures to transparently serialize complex data types while maintaining efficiency for primitive types. We also introduce the set of BCL data structures and evaluate their performance across a number of high-performance computing systems, demonstrating that BCL programs are competitive with hand-optimized code, even while hiding many of the underlying details of message aggregation, serialization, and synchronization.