With the continuous booming development of deep learning, many kinds of model variants are being proposed to tackle more difficult machine learning tasks, such as Transformers, Deep Learning Recommendation Models, and Graph Neural Networks. These emerging deep learning models, while being sufficiently better than prior methods, also require much more hardware resources to train and deploy. To systematically tackle this issue, we approach it from a data-centric perspective and argue that the root cause of the software-hardware imbalance is the data explosion in emerging deep learning models.
To continue the scaling of deep learning applications and bridge the gap between hardware performance and application requirements, this dissertation proposes to leverage data redundancy to effectively reduce model cost and benefit hardware design. We first categorize the data in a deep learning model into three types, namely input dataset, model parameters, and computational results. While parameter redundancy has been extensively studied in prior work, data representation and computational redundancy are rarely discussed. On the base of this observation, we introduce four software-hardware co-designs to explore the other two types of data redundancy and thus improve deep learning efficiency. Specifically, in order to reduce the cost of intermediate computational results in Transformer models, the first two designs leverage dynamic runtime approximation with customized GPU kernel and ASIC design. To release the memory and computation burden caused by massive input training data, the third and fourth design focuses on using high-order tensor decomposition with domain-specific knowledge to achieve high-quality and aggressive data compression. Training on the compressed dataset leads to comparable model accuracy with much less hardware consumption.
Overall, this dissertation demonstrates opportunities and approaches to tackle data explosion in emerging deep learning models. Our methods cover both dynamically generated data and offline trainable data, both deep learning training and inference, and both general computing platforms (e.g., GPGPU) and customized accelerators.