There is a growing gap between the peak speed of parallel computing systems and the actual delivered performance for scientific applications. In general this gap is caused by inadequate architectural support for the requirements of modern scientific applications, as commercial applications and the much larger market they represent, have driven the evolution of computer architectures. This gap has raised the importance of developing better benchmarking methodologies to characterize and to understand the performance requirements of scientific applications, to communicate them efficiently to influence the design of future computer architectures. This improved understanding of the performance behavior of scientific applications will allow improved performance predictions, development of adequate benchmarks for identification of hardware and application features that work well or poorly together, and a more systematic performance evaluation in procurement situations. The Berkeley Institute for Performance Studies has developed a three-level approach to evaluating the design of high end machines and the software that runs on them: 1) A suite of representative applications; 2) A set of application kernels; and 3) Benchmarks to measure key system parameters. The three levels yield different type of information, all of which are useful in evaluating systems, and enable NSF and DOE centers to select computer architectures more suited for scientific applications. The analysis will further allow the centers to engage vendors in discussion of strategies to alleviate the present architectural bottlenecks using quantitative information. These may include small hardware changes or larger ones that may be out interest to non-scientific workloads. Providing quantitative models to the vendors allows them to assess the benefits of technology alternatives using their own internal cost-models in the broader marketplace, ideally facilitating the development of future computer architectures more suited for scientific computations. The three levels also come with vastly different investments: the benchmarking efforts require significant rewriting to effectively use a given architecture, which is much more difficult on full applications than on smaller benchmarks.