Performance comparisons are ubiquitous in computer science. The proceedings of most conferences are filled with bar charts comparing the performance of some computer system to another. For example, computer architects compare the performance of processors, and compiler writers compare the performance of generated code. It is difficult to prove that one computer system is always faster than another for all possible workloads, so these performance comparisons are used as predictors: performance is compared on several representative workloads, and the results are used to argue that one computer system is generally faster than another. Unfortunately, there are many scenarios where it is difficult to make a fair performance comparison. This dissertation focuses on two such scenarios. The first scenario involves simulations in computer architecture. Computer architects typically evaluate new processor designs through slow cycle-level simulation. Because of the poor performance of cycle-level simulators, accelerated simulation methodologies are very popular, where small samples of a program's behavior are simulated, and the results are extrapolated to predict the results of a whole-program simulation. But with these accelerated simulation techniques, it is difficult to meaningfully compare performance estimates when multiple compilations of a program are involved. This dissertation will show that simulation samples must be selected consistently across compilations to produce comparable results, and a technique will be presented to apply accelerated simulation across compilations to produce comparable results. The second scenario involves dynamic optimization systems. Dynamic optimizers must predict if their optimizations will actually improve performance before applying them --- if an optimization is unlikely to improve performance, or if an optimization will degrade performance, the optimization should not be applied. This dissertation presents a new approach to guide dynamic optimization decisions by performing empirical performance evaluations as programs execute. The performance of differently-compiled versions of the same code are measured, and the results of the measurements directly guide optimization decisions. The challenge is that these performance measurements are collected as programs execute, so individual measurements are not directly comparable, because the program may run the code under analysis with different inputs over time. If a single pair of performance measurements indicates that one version of the code is faster than another, it may actually be faster, or it may be that the program chose to run one version on a smaller input than the other. To overcome this challenge, this dissertation presents a statistical technique to analyze pools of timing data to determine which version is the fastest