Accelerating Transformers with Systolic Array and Computing-in-memory
Skip to main content
eScholarship
Open Access Publications from the University of California

UCLA

UCLA Electronic Theses and Dissertations bannerUCLA

Accelerating Transformers with Systolic Array and Computing-in-memory

No data is associated with this publication.
Abstract

Transformers have shown sweeping success in the natural language processing (NLP) area. As the model size grows, Transformers start to suffer from massive data movements between memory and computing cores and become memory-bound. Computing-in-memory (CIM) processors have emerged to tackle this problem through in-situ computing. Moreover, CIM provides huge computing parallelism, making it promising for Transformer acceleration. However, accelerating Transformers with CIM is challenging because of the mismatch of computing patterns and low computing precision. This thesis proposes a hybrid dual-core processor containing a CIM and a systolic array (SA) to accelerate Transformers. The SA serves as a general purpose parallel computing unit that handles various patterns of matrix multiplications, while CIM works as a specific accelerator for weight-stationary vector-matrix multiplications. Furthermore, we propose an accuracy-bound workload allocation strategy through layer-wise accuracy sensitivity analysis, considering the impact of nonideal charac teristics in analog computing. We also explore in depth influence of interconnection and computing power ratio between CIM and SA. Finally, we perform compiler and hardware co-optimization to determine the optimal system configurations. Experimental results show that our work achieves 290.21×, 9.47×, 4.46× and 3.48× speedup, compared to CPU, GPU, THU23 and IBM23, respectively.

Main Content

This item is under embargo until September 13, 2025.