The emergence of data-intensive applications, such as Deep Neural Networks (DNNs), exacerbates the well-known memory bottleneck in computer systems and demands early attention in the design flow. Electronic System-Level (ESL) design using Transaction Level Modeling (TLM) enables early performance estimation, efficient design space exploration, and gradual refinement. In this dissertation, we present our exploratory modeling framework for hardware-software codesign based on IEEE SystemC TLM with particular focus on exposing parallelism and memory contention. We demonstrate the effectiveness of our approach for representative large DNNs such as GoogLeNet and Single Shot MultiBox Detector.
First, we study the impact of communication mechanisms on the available parallelism in TLM models. Specifically, we demonstrate the impact of varying synchronization mechanisms and buffering schemes on the exposed parallelism using different modeling styles of a DNN. We measure the performance of aggressive out-of-order parallel discrete event simulation and analyze the available parallelism in the models. Our study suggests that increased parallel simulation performance indicates better models with higher amounts of parallelism exposed.
Second, we explore the critical aspects of modeling and analysis of timing accuracy with the respect to memory contention. A major hurdle in tackling the memory bottleneck is the detection of memory contention late in the design cycle when detailed timed or cycle-accurate models are developed. A bottleneck detected at such a late stage can severely limit the available design choices or even require costly redesign. To explore new architectures prior to RTL implementation, we propose a novel TLM-2.0 loosely-timed contention-aware (LT-CA) modeling style that offers high-speed simulation close to traditional loosely-timed (LT) models, yet shows the same accuracy for memory contention as low level approximately-timed (AT) models.
Finally, we further refine the TLM-2.0 AT model by adding a cycle-accurate model of a memory subsystem. This model provides a higher timing accuracy for contention analysis. Hence it provides more accurate estimation of the performance. We revise our LT-CA memory delay modeling to provide further accuracy comparable to the cycle-accurate AT model of the shared memory subsystem. The high amount of contention on the shared memory suggests the need to move toward new processor architectures with local memories.