The memory system presents many problems in computer architecture and system design. An important challenge is worsening hardware variability that is caused by nanometer-scale manufacturing difficulties. Variability particularly affects memory circuits and systems – which are essential in all aspects of computing – by degrading their reliability and energy efficiency. To address this challenge, this dissertation proposes Opportunistic Memory Systems in Presence of Hardware Variability. It describes a suite of techniques to opportunistically exploit memory variability for energy savings and cope with memory errors when they inevitably occur.
In Part 1, three complementary projects are described that exploit memory variability for improved energy efficiency. First, ViPZonE and DPCS study how to save energy in off-chip DRAM main memory and on-chip SRAM caches, respectively, without significantly impacting performance or manufacturing cost. ViPZonE is a novel extension to the virtual memory subsystem in Linux that leverages power variation-aware physical address zoning for energy savings. The kernel intelligently allocates lower-power physical memory (when available) to tasks that access data frequently to save overall energy. Meanwhile, DPCS is a simple and low-overhead method to perform Dynamic Power/Capacity Scaling of SRAM-based caches. The key idea is that certain memory cells fail to retain data at a given low supply voltage; when full cache capacity is not needed, the voltage is opportunistically reduced and any failing cache blocks are disabled dynamically. The third project in Part 1 is X-Mem: a new extensible memory characterization tool. It is used in a series of case studies on a cloud server, including one where the potential benefits of variation-aware DRAM latency tuning are evaluated.
Part 2 of the dissertation focuses on ways to opportunistically cope with memory errors whenever they occur. First, the Performability project studies the impact of corrected errors in memory on the performance of applications. The measurements and models can help improve the availability and performance consistency of cloud server infrastructure. Second, the novel idea of Software-Defined Error-Correcting Codes (SDECCs) is proposed. SDECC opportunistically copes with detected-but-uncorrectable errors in main memory by combining concepts from coding theory with an architecture that allows for heuristic recovery. SDECC leverages available side information about the contents of data in memory to essentially increase the strength of ECC without introducing significant hardware overheads. Finally, a methodology is proposed to achieve Virtualization-Free Fault Tolerance (ViFFTo) for embedded scratchpad memories. ViFFTo guards against both hard and soft faults at minimal cost and is suitable for future IoT devices.
Together, the six projects in this dissertation comprise a complementary suite of methods for opportunistically exploiting hardware variability for energy savings, while reducing the impact of errors that will inevitably occur. Opportunistic Memory Systems can significantly improve the energy efficiency and reliability of current and future computing systems. There remain several promising directions for future work.