Davidson, Andrew; Owens, John D.

doi:10.1145/1964179.1964185

Download PDF

Register Packing for Cyclic Reduction: A Case Study

2011

Published Web Location

https://doi.org/10.1145/1964179.1964185

Abstract

We generalize a method for avoiding GPU shared communication when dealing with a downsweep pattern. We apply this generalization to Cyclic Reduction, a tridiagonal solver with this pattern. Previously, Cyclic Reduction suffered poor performance when compared to other tridiagonal solvers on the GPU due to performance issues stemming from shared memory bandwidth bottlenecks and step-efficiency. We address this problem by applying our downsweep shared-memory communication reducing methodology. Our re-mapping also allows Cyclic Reduction to solve larger systems directly in a virtual block. By using our generalized mapping, we improve Cyclic Reduction's performance on a GPU by a factor of 3--4.5x over the original CR implementation, making it 1.5--3x faster than other GPU tridiagonal solvers.

Main Content

For improved accessibility of PDF content, download the file to your device.

Institute for Data Analysis and Visualization

Register Packing for Cyclic Reduction: A Case Study

Published Web Location