In the medical imaging field, we need fast deformable registration methods especially in intra-operative settings characterized by their time-critical applications. Image registration studies which are based on Graphics Processing Units (GPUs) provide fast implementations. However, only a small number of these GPU-based studies concentrate on deformable registration. We implemented Demons, a widely used deformable image registration algorithm, on NVIDIA's Quadro FX 5600 GPU with the Compute Unified Device Architecture (CUDA) programming environment. Using our code, we registered 3D CT lung images of patients. Our results show that we achieved the fastest runtime among the available GPU-based Demons implementations. Additionally, regardless of the given dataset size, we provided a factor of 55 speedup over an optimized CPU-based implementation. Hence, this study addresses the need for on-line deformable registration methods in intra-operative settings by providing the fastest and most scalable Demons implementation available to date. In addition, it provides an implementation of a deformable registration algorithm on a GPU, an understudied type of registration in the general-purpose computation on graphics processors (GPGPU) community.