Accelerating Transformers with Systolic Array and Computing-in-memory
- Miao, Siyuan
- Advisor(s): He, Lei
Abstract
Transformers have shown sweeping success in the natural language processing (NLP) area. As the model size grows, Transformers start to suffer from massive data movements between memory and computing cores and become memory-bound. Computing-in-memory (CIM) processors have emerged to tackle this problem through in-situ computing. Moreover, CIM provides huge computing parallelism, making it promising for Transformer acceleration. However, accelerating Transformers with CIM is challenging because of the mismatch of computing patterns and low computing precision. This thesis proposes a hybrid dual-core processor containing a CIM and a systolic array (SA) to accelerate Transformers. The SA serves as a general purpose parallel computing unit that handles various patterns of matrix multiplications, while CIM works as a specific accelerator for weight-stationary vector-matrix multiplications. Furthermore, we propose an accuracy-bound workload allocation strategy through layer-wise accuracy sensitivity analysis, considering the impact of nonideal charac teristics in analog computing. We also explore in depth influence of interconnection and computing power ratio between CIM and SA. Finally, we perform compiler and hardware co-optimization to determine the optimal system configurations. Experimental results show that our work achieves 290.21×, 9.47×, 4.46× and 3.48× speedup, compared to CPU, GPU, THU23 and IBM23, respectively.