In this paper, we investigate the suitability of the GPU for a parallel implementation of the pinwheel error di�usion. We demonstrate a high-performance GPU implementation by efficiently parallelizing and unrolling the image processing algorithm. Our GPU implementation achieves a 10-30x speedup over a two-threaded CPU error di�ffusion implementation with comparable image quality. We have conducted experiments to study the performance and quality tradeoff�s for di�fferences in image block sizes. We also present a performance analysis at assembly level to understand the performance bottlenecks.