Search

Scholarly Works (33 results)

Sort By:

Show:

Article
Peer Reviewed

Design Automation for Finite State Machine Predictors

Technical Reports (2000)

Finite State Machines (FSM) are a fundamental building block in computer architecture, and are used to control and optimize all types of prediction and speculation. These include branch prediction, confidence estimation, value prediction, memory disambiguation, thread speculation, power optimization, and the list goes on. At the heart of all almost all of these techniques is a FSM, such as a two bit saturating counter, which is predicting a sequence given feedback information. In this paper we present a framework for automatically designing FSM predictors. This approach can be used to develop FSM predictors that perform well over a suite of applications, tailored to a specific application, or even a specific instruction. We examine the ability to create FSM predictors optimized for a group of applications for branch prediction and for confidence estimation used in value prediction.

Pre-2018 CSE ID: CS2000-0656

Cover page: Design Automation for Finite State Machine Predictors

Article
Peer Reviewed

Dynamic Selection of Compression Formats to Reduce Transfer Delay

Technical Reports (2000)

The computational paradigm of the Internet is such that applications are retrieved from remote sites and processed locally or are transfered for remote execution. Given the gap between processor and network speeds, mechanisms are needed to compensate for transfer time in order to maintain acceptable performance of mobile programs. Compression is used to reduce transfer delay by reducing the number of bytes transfered through the use of compact file encoding. In this paper, we examine two techniques for reducing compression-based transfer delay using Java as our platform for mobile code. We first examine the benefit from Selective Compression, a profile-directed optimization that combines and compresses only class files that are used during execution (as opposed to the entire application). Our results show that this approach reduces transfer delay from 11% to 13% on average across all compression techniques and networks studied. The second technique we examine is dynamic selection of compression formats based upon the underlying network connectivity. We consider compression-based transfer delay as the time required for transfer and decompression of files. We show that the compression format that achieves the least delay varies greatly with the network bandwidth available. Therefore, we propose to store mobile programs at the server in different compression formats. Dynamic Compression Format Selection (DCFS) is then used on the client to predict the compression format that will result in the least delay given the bandwidth predicted to be available when transfer occurs. Our results show that DCFS reduces 36% of compression-based transfer delay on average, for the networks and wire-transfer formats studied. When combined with selective compression, we achieve 47% average reduction in delay (60% reduction over the use of jar files).

Pre-2018 CSE ID: CS2000-0650

Cover page: Dynamic Selection of Compression Formats to Reduce Transfer Delay

Article
Peer Reviewed

Patchable Instruction ROM Architecture

Technical Reports (2001)

Increased systems level integration has meant the movement of many traditionally off chip components onto a single chip including a processor, instruction storage, data path, and local memory. The design of these systems are driven by two conflicting goals, the need for reduced area and the need for rapid development times. The two current design options for instruction storage, ROM and Flash, are each highly optimized to one of these two goals but provide little compromise between them. ROM is used for highly area optimized instruction memory to minimize area per instruction, although this comes at a price of lengthy integration time because of it's need to be correct before the chip is sent for fabrication. Flash is an alternative instruction memory that can significantly reduce the time to market by allowing embedded software to be upgraded after fabrication, which means that software test and fabrication can be overlapped. Unfortunately Flash takes over a factor of 2 times the area of the equivalent ROM based storage. In this paper we present the Patchable Instruction ROM as an architecture for instruction storage that can provide the best of both worlds -- reduced area and faster time to market. With area efficiency similar to a standard ROM and support for limited post fabrication software patching, Patchable Instruction ROM provides a new set of design points to consider when building embedded systems. For the programs we examine, we show that our hardware/software technique can achieve an area only 10% larger than ROM with only an 11% inflation in design time over a Flash based approach.

Pre-2018 CSE ID: CS2001-0678

Cover page: Patchable Instruction ROM Architecture

Article
Peer Reviewed

Using Annotations to Reduce Dynamic Optimization Time

Technical Reports (2000)

Dynamic compilation and optimization are widely used in heterogenous computing environments and an environments requiring a virtual machine, where an intermediate form of the code is compiled to a native code during execution. An important tradeoff exists between the amount of time spent dynamically optimizing the program and the running time of the program. The time it takes to perform dynamic optimizations can cause signficant delays during execution and also negate some of the performance gains which result from a faster running program. In this research, we present an annotation framework that substantially reduces compilation overhead of Java programs. Annotations consist of analysis information collected off--line and incorporated into Java programs. The annotations are then used by dynamic compilers to guide optimization. Our annotations reduce compilation overhead incurred at all stages of compilation and optimization as well as enable complex optimizations to be performed dynamically. On average the annotation optimizations reduce optimized compilation overhead by 78% and enable total time speedups of 7% on average for the programs examined.

Pre-2018 CSE ID: CS2000-0663

Cover page: Using Annotations to Reduce Dynamic Optimization Time

Article
Peer Reviewed

Reducing DRAM Power Using Compiler Assisted Refreshing

Technical Reports (2000)

The embedded market has always been a major source of income to the semiconductor market. As both general purpose and embedded processors are moving towards mobile markets different design criterion are becoming more important. The traditionally performance driven field of processor design now has power issues to deal with. Typically there is a performance requirement, and low power, low cost solutions must be found. In this paper we investigate a software and hardware solution for reducing DRAM power. We propose to mark DRAM rows that have data that will not be read again, and then have the memory controller avoid refreshing those rows. To mark the rows with dead data, we propose adding a new instruction freeNrows to the instruction set architecture, to communicate to the memory controller that N rows starting at the address provided should not be refreshed. If a store ever occurs to a non-refreshed row, then the memory controller will change the status of that row to refresh. For the heap memory, a custom allocation routine will be used to mark DRAM rows as non-refresh, when an object is freed from memory. For global memory, compiler analysis can be used to find global data objects (including large arrays) that have part or all of their object as dead leaving a region of code, and then a freeNrows instruction would be inserted to mark all those DRAM rows as non-refreshed. Our results show that on average 60% of the refreshes issued could be ignored without compromising correctness.

Pre-2018 CSE ID: CS2000-0649

Cover page: Reducing DRAM Power Using Compiler Assisted Refreshing

Article
Peer Reviewed

Software Profiling for Deterministic Replay Debugging of User Code

Technical Reports (2005)

Significant time is spent by companies trying to reproduce and fix bugs. We recently proposed a hardware logging approach called BugNet to aid debugging, by capturing the last few million instructions that occurred right before a bug that results in a crash. A developer can then use this log to deterministically replay the recent portion of execution that lead to the crash. We call this Deterministic Replay Debugging. In this paper, we present a software version of BugNet to be used by developers and quality assurance engineers to efficiently track down bugs. Our software approach does not require any hardware support, and the logs can be used to find bugs that result in a crash as well as those that cause wrong answers, instead of only focusing on bugs that cause crashes as in BugNet. The approach only logs the load values that have changed in order to provide deterministic replay across system calls, interrupts and DMA transfers. In addition, we present implementation details for our deterministic replay debugger. This includes detailed analysis measuring exactly how much execution needs to be logged in order to make sure we have captured the cause of the bug, and quantifying the benefit of using dynamic slicing to aid our deterministic replay debugger.

Pre-2018 CSE ID: CS2005-0839

Cover page: Software Profiling for Deterministic Replay Debugging of User Code

Article
Peer Reviewed

A Decoupled Predictor-Directed Stream Prefetching Architecture

Technical Reports (2001)

An effective method for reducing the effect of load latency in modern processors is data prefetching. One form of hardware-based data prefetching, stream buffers, has been shown to be particularly effective due to its' ability to detect data streams and run ahead of them, prefetching as it goes. Unfortunately, in the past, the applicability of streaming was limited to stride intensive code. In this paper we propose Predictor-Directed Stream Buffers (PSB), which allows the stream buffer to follow a general address prediction stream instead of a fixed stride. A general address prediction stream complicates the allocation of both stream buffer and memory resources, because the predictions generated will not be as reliable as prior sequential next-line and stride-based stream buffer implementations. To address this, we examine using confidence-based techniques to guide the allocation and prioritization of stream buffers and their prefetch requests. Our results show that when using PSB on a benchmark suite heavy in pointer-based applications, PSB provides a 23% speedup on average over the best previous stream buffer implementation, and an improvement of 75% over using no prefetching at all.

Pre-2018 CSE ID: CS2001-0694

Cover page: A Decoupled Predictor-Directed Stream Prefetching Architecture

Article
Peer Reviewed

Structures for Phase Classification

Technical Reports (2003)

Understanding program behavior is at the foundation of computer architecture and program optimization. Many programs have wildly different behavior on even the very largest of scales (over the complete execution of the program). Even so, programs tend to have repetitive behavior, where different parts of a program's execution behave in a similar manner. These similar intervals of execution can be grouped into phases, where the intervals in a phase have homogeneous behavior and similar resource requirements. This phase behavior can be exploited by tailoring architecture or compiler optimizations to a given phase, rather than at average or aggregate behavior as is typically done. In this paper, we compare using many different types of information for performing phase classification. The goal is to try to find the minimal amount of information to collect to accurately perform phase classification, and to do this without using architecture performance metrics. We compare using basic blocks, loop branches, procedures, opcode frequencies, register usage, register definitions, memory addresses, and working code and data set sizes. We also examine collecting this information in different data structures from working set bit vectors to frequency vectors. We compare these different structures in terms of their ability to create homogeneous phases. We then evaluate the performance of using the more promising of these structures to guide SimPoint.

Pre-2018 CSE ID: CS2003-0772

Cover page: Structures for Phase Classification

Article
Peer Reviewed

Efficient Design Space Exploration for Customized Processors

Technical Reports (2001)

Customized processors offer the system developer rapidly designed logic specifically constructed to meet the performance and area demands of a given application. Recently, there have been several major projects that automate the process of transforming an optimal processor specification into an efficient layout for manufacturing. Missing from these efforts, however, is an automated approach to constructing the optimal specifications in the first place. In this paper we introduce an efficient, fully automated methodology for guiding the design and optimization of a custom processor. Our approach is to decompose the overall problem of choosing an optimal architecture into a set of sub-problems that are, to first order, independent. For each sub-problem, we create a model that relates performance to area. From this, we build a constraint system that can be solved using linear-integer programming techniques, and arrive at an optimal parameter selection for all architectural components. Using our approach, it takes only a few minutes to explore the entire architecture design space of a custom processor. We show that the expected performance using our model correlates strongly to detailed pipeline simulations, and present results showing design tradeoffs for several different benchmarks.

Pre-2018 CSE ID: CS2001-0688

Cover page: Efficient Design Space Exploration for Customized Processors

Article
Peer Reviewed

Comparing Multinomial and K-Means Clustering for SimPoint

Technical Reports (2005)

SimPoint is a technique used to pick what parts of the program's execution to simulate in order to have a complete picture of execution. SimPoint uses data clustering algorithms from machine learning to automatically find repetitive (similar) patterns in a program's execution, and it chooses one sample to represent each unique repetitive behavior. These samples when taken together represent an accurate picture of the complete execution of the program. SimPoint is based on the k-means clustering algorithm, and recent work has proposed using a different clustering method based on multinomial models, but only provided a preliminary comparison and analysis. In this work we provide a detailed comparison of using k-means and multinomial clustering for SimPoint. We show that k-means performs better than the recently proposed multinomial clustering approach. We then propose two improvements, in the areas of feature reduction and the picking of simulation points, to the prior multinomial clustering approach, which allows multinomial clustering to perform as well as k-means. We then conclude by examining how to potentially combine multinomial clustering with k-means.

Pre-2018 CSE ID: CS2005-0841

Cover page: Comparing Multinomial and K-Means Clustering for SimPoint