Transformers have shown sweeping success in the natural language processing (NLP) area. As the model size grows, Transformers start to suffer from massive data movements between
memory and computing cores and become memory-bound. Computing-in-memory (CIM)
processors have emerged to tackle this problem through in-situ computing. Moreover, CIM
provides huge computing parallelism, making it promising for Transformer acceleration.
However, accelerating Transformers with CIM is challenging because of the mismatch of
computing patterns and low computing precision. This thesis proposes a hybrid dual-core
processor containing a CIM and a systolic array (SA) to accelerate Transformers. The SA
serves as a general purpose parallel computing unit that handles various patterns of matrix
multiplications, while CIM works as a specific accelerator for weight-stationary vector-matrix
multiplications. Furthermore, we propose an accuracy-bound workload allocation strategy
through layer-wise accuracy sensitivity analysis, considering the impact of nonideal charac
teristics in analog computing. We also explore in depth influence of interconnection and
computing power ratio between CIM and SA. Finally, we perform compiler and hardware
co-optimization to determine the optimal system configurations. Experimental results show
that our work achieves 290.21×, 9.47×, 4.46× and 3.48× speedup, compared to CPU, GPU,
THU23 and IBM23, respectively.