Search

Article

Understanding the most efficient design and utilization of emerging multicore systems is one of the most challenging questions faced by the mainstream and scientific computing industries in several decades. Our work explores multicore stencil (nearest-neighbor) computations -- a class of algorithms at the heart of many structured grid codes, including PDE solvers. We develop a number of effective optimization strategies, and build an auto-tuning environment that searches over our optimizations and their parameters to minimize runtime, while maximizing performance portability. To evaluate the effectiveness of these strategies we explore the broadest set of multicore architectures in the current HPC literature, including the Intel Clovertown, AMD Barcelona, Sun Victoria Falls, IBM QS22 PowerXCell 8i, and NVIDIA GTX280. Overall, our auto-tuning optimization methodology results in the fastest multicore stencil performance to date. Finally, we present several key insights into the architectural trade-offs of emerging multicore designs and their implications on scientific algorithm development.

Cover page: Stencil Computation Optimization and Auto-tuning on State-of-the-Art Multicore Architectures

Article

Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors

Datta, Kaushik

Lawrence Berkeley National Laboratory (2009)

Cover page: Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors

Article
Peer Reviewed

Titanium Language Reference Manual (Version 2.20)

UC Berkeley Previously Published Works (2006)

The Titanium language is a Java dialect for high-performance parallel scientific computing. Titanium’s differences from Java include multi-dimensional arrays, an explicitly parallel SPMD model of computation with a global address space, a form of value class, and zone-based memory management. This reference manual describes the differences between Titanium and Java.

Cover page: Titanium Language Reference Manual (Version 2.20)

Article
Peer Reviewed

Implicit and explicit optimizations for stencil computations

UC Berkeley Previously Published Works (2006)

Stencil-based kernels constitute the core of many scientific applications on block-structured grids. Unfortunately, these codes achieve a low fraction of peak performance, due primarily to the disparity between processor and main memory speeds. We examine several optimizations on both the conventional cache-based memory systems of the Itanium 2, Opteron, and Power5, as well as the heterogeneous multicore design of the Cell processor. The optimizations target cache reuse across stencil sweeps, including both an implicit cache oblivious approach and a cache-aware algorithm blocked to match the cache structure. Finally, we consider stencil computations on a machine with an explicitly-managed memory hierarchy, the Cell processor. Overall, results show that a cache-aware approach is significantly faster than a cache oblivious approach and that the explicitly managed memory on Cell is more efficient: Relative to the Power5, it has almost 2x more memory bandwidth and is 3.7x faster. Copyright 2006 ACM.

Cover page: Implicit and explicit optimizations for stencil computations

Article
Peer Reviewed

Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures.

UC Berkeley Previously Published Works (2008)

Article
Peer Reviewed

Titanium Language Reference Manual (Version 2.19)

UC Berkeley Previously Published Works (2005)

The Titanium language is a Java dialect for high-performance parallel scientific computing. Titanium’s differences from Java include multi-dimensional arrays, an explicitly parallel SPMD model of computation with a global address space, a form of value class, and zone-based memory management. This reference manual describes the differences between Titanium and Java.

Cover page: Titanium Language Reference Manual (Version 2.19)

Article
Peer Reviewed

Productivity and performance using partitioned global address space languages

UC Berkeley Previously Published Works (2007)

Partitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message passing. One such language, Unified Parallel C (UPC) is an extension of ISO C defined by a consortium that boasts multiple proprietary and open source compilers. Another PGAS language, Titanium, is a dialect of JavaTM designed for high performance scientific computation. In this paper we describe some of the highlights of two related projects, the Titanium project centered at U.C. Berkeley and the UPC project centered at Lawrence Berkeley National Laboratory. Both compilers use a source-to-source strategy that trans-lates the parallel languages to C with calls to a communication layer called GASNet. The result is portable high-performance compilers that run on a large variety of shared and distributed memory multiprocessors. Both projects combine compiler, runtime, and application efforts to demonstrate some of the performance and productivity advantages to these languages.Copyright 2007 ACM.

Cover page: Productivity and performance using partitioned global address space languages

Stencil Computation Optimization and Auto-tuning on State-of-the-Art Multicore Architectures

Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors

Titanium Language Reference Manual (Version 2.20)

Implicit and explicit optimizations for stencil computations

Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures.

Titanium Language Reference Manual (Version 2.19)

Productivity and performance using partitioned global address space languages