Significant advances in genome sequencing over the past decade have produced a flood of genomic data that pose enormous computational challenges and require new bioinformatics approaches. As the cost of sequencing has decreased and genomics has become an increasingly important tool for health and the environment, genomic data has grown exponentially, often requiring parallel computing on high-performance computing (HPC) systems. However, genomic applications are often characterized by irregular and unstructured computation and data layout, making them a troublesome target for distributed memory parallelism.
In this dissertation, we show that it is possible to productively write highly parallel code for irregular genomic computation using the appropriate abstraction. Genomic algorithms are often based on graph analysis and processing. For individual graph algorithms, it has been previously shown that graphs can be viewed as sparse matrices and the computations become a series of matrix operations. Here, we take this idea to a new level by demonstrating its applicability and challenges for a data- and computationally-intensive end-to-end application in genomics: de novo long-read genome assembly, in which an unknown genome is reconstructed from short, redundant, and erroneous DNA sequences. Our main contribution is the design and development of a set of scalable distributed and parallel algorithms for de novo long-read genome assembly that can run on hundreds of nodes of an HPC system, reducing the runtime for mammalian genomes from days on a single processor to less than 20 minutes on a supercomputer. Our algorithms are presented as the Extreme-Scale Long-Read Berkeley Assembler (ELBA) pipeline, which encompasses the major phases of the overlap-layout-consensus paradigm that is most popular for long-read sequencing data. In ELBA, we view assembly through the lens of sparse linear algebra, where the core data structure is a sparse matrix. This dissertation paves the way for a highly productive paradigm for writing massively parallel codes for irregular and unstructured real-world computation.
ELBA is built for HPC systems with high-speed network and batch scheduling. However, we recognize that not every research community has access to government or institutional supercomputing facilities that have the necessary scale (e.g., hundreds of nodes) and hardware characteristics (e.g., a low-latency network) to realize the full potential of massively parallel algorithms such as those we have developed in this work. Thus, we believe that a long-term goal of HPC research is to democratize large-scale computing for science, not only through highly productive programming but also through widely accessible large-scale resources and systems. As a first step in demonstrating the applicability of the ideas presented in this dissertation to a cloud computing environment, we perform a benchmarking exercise to compare HPC and cloud systems. Our study shows that today's cloud systems can compete with traditional HPC systems, at least at moderate scales, due to significant advances in networking technologies.