The memory requirements of emerging applications, especially in the domain of machine learn- ing workloads, is outpacing the capacity of traditional memory devices like DRAM. At the same time, heterogeneity in the memory hierarchy is emerging on multiple fronts both with high-capacity, low-bandwidth devices like Intel Optane Data-Center (DC) Persistent Memory Modules (PMM), and low-capacity, high-bandwidth devices like High Bandwidth Memory (HBM). A fundamental question introduced by this heterogeneity is: how do we efficiently manage application data to fully exploit the properties of the underlying memory technologies? This work explores techniques and ideas towards answering this question and understanding the performance implications of hetero- geneous memory.
First, Intel’s DRAM cache mode for Optane DC is reverse engineered using a suite of micro- benchmarks and large scale machine learning applications. It is discovered that for machine learning training applications with large memory footprints and large-scale graph analytics, the DRAM cache behaves poorly with significant access amplification and low bandwidth utilization. There are three reasons for this performance degradation: (1) inflexible direct mapped policy leading to conflict misses, (2) poor traffic shaping cause by on-demand accesses and metadata management, and (3) lack of program semantic insight leading to many unnecessary and slow dirty data writebacks.
Next, AutoTM, a profile-guided compiler-based optimization technique that uses Integer Linear Programming to derive optimal tensor placement and movement for machine learning training in heterogeneous memory systems, is presented. The nGraph compiler is modified to implement AutoTM for two different systems: a CPU-based system with a combination of DRAM and Optane DC and a GPU-based system capable of using both GPU and CPU memory For DRAM/Optane DC systems, AutoTM outperforms the DRAM cache by as much as 3× and as much as 4× for the transparent cudaMallocManaged for GPU/CPU systems.
The third part of this work generalizes memory management primitives. A generic hetero- geneous memory management framework can be broken into three parts: the system (the entity responsible for managing data and metadata), the policy (the entity orchestrating the placement and movement of data), and the abstract runtime (the application or runtime that is actually using the data). The key insight is the modularity of this organization. Upon this framework is built CachedArrays, a policy/system package implemented in the Julia programming language. Unlike AutoTM, CachedArrays works for applications with dynamic control flow and improves end-to-end convolutional neural network (CNN) training performance by up to 2× over the DRAM cache.
Finally, to demonstrate the generality of this framework, it is applied to gigabyte scale em- bedding tables for large DLRM workloads. A performance analysis of the design space embedding table lookup and update operations on Xeon CPUs is conducted. This leads to the implementation of CachedEmbeddings, an instance of the generic heterogeneous memory management framework optimized for small-sized random memory accesses. Using a high-performance DLRM implemen- tation, CachedEmbeddings out performs the DRAM cache for end-to-end DLRM training by up to 1.45× by using a modular, distribution-dependant policy.