Search

Scholarly Works (5 results)

Sort By:

Thesis
Peer Reviewed

Domain-specific translator and optimizer for massive on- chip parallelism

Unat, Didem

UC San Diego Electronic Theses and Dissertations (2012)

Future supercomputers will rely on massive on-chip parallelism that requires dramatic changes be made to node architecture. Node architecture will become more heterogeneous and hierarchical, with software-managed on- chip memory becoming more prevalent. To meet the performance expectations, application software will undergo extensive redesign. In response, support from programming models is crucial to help scientists adopt new technologies without requiring significant programming effort. In this dissertation, we address the programming issues of a massively parallel single chip processor with a software-managed memory. We propose the Mint programming model and domain-specific compiler as a means of simplifying application development. Mint abstracts away the programmer's view of the hardware by providing a high- level interface to low-level architecture-specific optimizations. The Mint model requires modest recoding of the application and is based on a small number of compiler directives, which are sufficient to take advantage of massive parallelism. We have implemented the Mint model on a concrete instance of a massively parallel single chip processor: the Nvidia GPU (Graphics Processing Unit). The Mint source-to-source translator accepts C source with Mint annotations and generates CUDA C. The translator includes a domain-specific optimizer targeting stencil methods. Stencil methods arise in image processing applications and in a wide range of partial differential equation solvers. The Mint optimizer performs data locality optimizations, and uses on-chip memory to reduce memory accesses, particularly useful for stencil methods. We have demonstrated the effectiveness of Mint on a set of widely used stencil kernels and three real-world applications. The applications include an earthquake- induced seismic wave propagation code, an interest point detection algorithm for volume datasets and a model for signal propagation in cardiac tissue. In cases where hand- coded implementations are available, we have verified that Mint delivered competitive performance. Mint realizes around 80% of the performance of the hand-optimized CUDA implementations of the kernels and applications on the Tesla C1060 and C2050 GPUs. By facilitating the management of parallelism and the memory hierarchy on the chip at a high-level, Mint enables computational scientists to accelerate their software development time. Furthermore, by performing domain-specific optimizations, Mint delivers high performance for stencil methods

Cover page: Domain-specific translator and optimizer for massive on- chip parallelism

Article
Peer Reviewed

ExaSAT: An exascale co-design tool for performance modeling

LBL Publications (2015)

One of the emerging challenges to designing HPC systems is understanding and projecting the requirements of exascale applications. In order to determine the performance consequences of different hardware designs, analytic models are essential because they can provide fast feedback to the co-design centers and chip designers without costly simulations. However, current attempts to analytically model program performance typically rely on the user manually specifying a performance model. We introduce the ExaSAT framework that automates the extraction of parameterized performance models directly from source code using compiler analysis. The parameterized analytic model enables quantitative evaluation of a broad range of hardware design trade-offs and software optimizations on a variety of different performance metrics, with a primary focus on data movement as a metric. We demonstrate the ExaSAT framework's ability to perform deep code analysis of a proxy application from the Department of Energy Combustion Co-design Center to illustrate its value to the exascale co-design process. ExaSAT analysis provides insights into the hardware and software trade-offs and lays the groundwork for exploring a more targeted set of design points using cycle-accurate architectural simulators.

Cover page: ExaSAT: An exascale co-design tool for performance modeling

Article
Peer Reviewed

BoxLib with Tiling: An AMR Software Framework.

LBL Publications (2016)

Article
Peer Reviewed

Software Design Space Exploration for Exascale Combustion Co-design

LBL Publications (2013)

The design of hardware for next-generation exascale computing systems will require a deep understanding of how software optimizations impact hardware design trade-offs. In order to characterize how co-tuning hardware and software parameters affects the performance of combustion simulation codes, we created ExaSAT, a compiler-driven static analysis and performance modeling framework. Our framework can evaluate hundreds of hardware/software configurations in seconds, providing an essential speed advantage over simulators and dynamic analysis techniques during the co-design process. Our analytic performance model shows that advanced code transformations, such as cache blocking and loop fusion, can have a significant impact on choices for cache and memory architecture. Our modeling helped us identify tuned configurations that achieve a 90% reduction in memory traffic, which could significantly improve performance and reduce energy consumption. These techniques will also be useful for the development of advanced programming models and runtimes, which must reason about these optimizations to deliver better performance and energy efficiency. © 2013 Springer-Verlag.

Cover page: Software Design Space Exploration for Exascale Combustion Co-design

Article
Peer Reviewed

Trends in Data Locality Abstractions for HPC Systems

LBL Publications (2017)

The cost of data movement has always been an important concern in high performance computing (HPC) systems. It has now become the dominant factor in terms of both energy consumption and performance. Support for expression of data locality has been explored in the past, but those efforts have had only modest success in being adopted in HPC applications for various reasons. them However, with the increasing complexity of the memory hierarchy and higher parallelism in emerging HPC systems, locality management has acquired a new urgency. Developers can no longer limit themselves to low-level solutions and ignore the potential for productivity and performance portability obtained by using locality abstractions. Fortunately, the trend emerging in recent literature on the topic alleviates many of the concerns that got in the way of their adoption by application developers. Data locality abstractions are available in the forms of libraries, data structures, languages and runtime systems; a common theme is increasing productivity without sacrificing performance. This paper examines these trends and identifies commonalities that can combine various locality concepts to develop a comprehensive approach to expressing and managing data locality on future large-scale high-performance computing systems.

Cover page: Trends in Data Locality Abstractions for HPC Systems