The slowing of CMOS technology scaling mismatches the ever-increasing demand for computational power, leading to a rise in the use of heterogeneous systems, which pair scalar processors such as CPUs with specialized accelerators like FPGAs and GPUs. These systems enable continued performance and efficiency scaling for specialized tasks while retaining limited generality. This restricted generality inherent in heterogeneous platforms requires specialized knowledge of hardware architectures and low-level programming models, posing a substantial barrier to software developers.
This dissertation addresses the challenges software developers face in leveraging heterogeneous computing resources, particularly FPGA acceleration. We identify three major limitations: limited programmability support in domain-specific resources, difficulty in achieving high performance and efficiency, and time-consuming porting across diverse computational architectures. We present novel approaches and tools to bridge the gap between high-level software development and efficient hardware implementation, making heterogeneous computing more accessible to a broader range of developers.
In this dissertation, we introduce Heterosys, an end-to-end optimization framework simplifying heterogeneous hardware development. It decouples algorithmic descriptions from underlying fabrics and offers layout-driven and architecture-driven design generation, bridging the gap between high-level designs and hardware details.
The frontend of Heterosys is HeteroRefactor, which combines dynamic invariant analysis, automated refactoring, and selective offloading. HeteroRefactor optimizes software kernels onto accelerators for common-case inputs while maintaining correctness through CPU fallback mechanisms. HeteroRefactor automatically refactors software code to make it FPGA-compatible and hardware-friendly, reducing chip resource usage through bitwidth optimization and floating-point precision tuning.
From the individual synthesizable hardware kernels, Adroit optimizes them using a static approach to identify data and control broadcasts. It analyzes data and control dependencies in the source code and reports, trading off clock-cycle latency for higher frequency. By optimizing the FPGA architecture generated by high-level synthesis tools, Adroit relieves software developers from needing to understand the underlying fabric.
As the backend, Heterosys composes multiple kernels into an optimized FPGA system using RapidIR, a comprehensive infrastructure for high-level physical synthesis optimizations. RapidIR integrates coarse-grained floorplanning with high-level pipelining, supporting hierarchical composition of heterogeneous designs from diverse sources. It automates the exploration of various physical optimization strategies, freeing programmers from designing device-specific hardware layouts for each target device.
Our research demonstrates substantial performance improvements across diverse applications and benchmarks, including genomic sequencing and large language model accelerations. Our FPGA optimization techniques achieve operating frequency improvements of 30% to over 100% compared to state-of-the-art EDA tools, resource requirement reductions of 21% to over 90%, and 51% code reduction in porting between platforms.
This dissertation contributes a comprehensive set of methodologies and tools that significantly lower the barriers to entry for heterogeneous computing, particularly FPGA acceleration. By abstracting away much of the hardware complexity, our work paves the way for broader adoption of heterogeneous acceleration in software development practices, potentially driving research innovation and performance improvements across a wide range of applications and industries.