Tumors develop after the accumulation of passenger and deleterious driver mutations, including single nucleotide mutations and large scale copy number alterations (CNAs). With the advent of next generation sequencing, large consortia, such as The Cancer Genome Atlas and the International Cancer Genome Consortium, have sequenced thousands of tumor and healthy blood (matched normal) samples from cancer patients. These bulk sequencing studies have yielded an unprecedented amount of information on the molecular level, including showing recurrent molecular signatures across tissue types, as well as patient specific aberrations [1]. While this information can be utilized in clinical settings, the averaging effect of bulk sequencing can obscure rare mutations, which, if not taken into account in treatment, can lead to future relapses. Furthermore, because we cannot ethically obtain longitudinal samples from treatment naive solid tumors to observe how tumors grow and change over time, reconstructing tumor evolutionary phylogenies from samples taken at a single time point is an active area of research.
By treating a collection of tumor cells as individuals in a population, single cell sequencing (SCS) can reveal finer levels of detail of rare events in a tumor population, and allows application of well studied population genetics methods for studying evolution. However, SCS techniques are very noisy due to low amounts of starting DNA. This can manifest as low and uneven coverage across the genome and allelic dropout, making point mutation calling particularly difficult [2]. Although cell dissociation and isolation remain challenges, we aim to computationally address problems and biases specifically introduced from whole genome amplification (WGA), a necessary step because of the small amount of starting material in each cell. Despite these challenges, the number of reads falling in each window along the genome (read depth) can be used to study copy number events.
Here, we present three methods to estimate copy number profiles and cell similarity. In Chapter 2, we present a novel method, SCONCE [3], that accounts for single cell whole genome sequencing noise and calls copy number events in single cells. SCONCE is based on a Hidden Markov Model that incorporates a Markov process continuous through time, to model the evolutionary history of a tumor, as well as a discrete process along the length of the genome, to estimate copy number alterations from changes in read depth. We show SCONCE outperforms competing methods across a wide range of simulated and published real datasets [4,5].
In Chapter 3, we present SCONCE2 [6], an expansion on SCONCE to jointly call copy number events across multiple cells and estimate cell similarity. SCONCE2 uses pairs of cells to model evolutionary relationships and estimate joint copy number profiles. By summarizing these joint copy number profiles across multiple cell pairs, SCONCE2 more accurately detects breakpoints and copy number events. Furthermore, SCONCE2 creates a novel cell similarity metric based on pairwise tree branch lengths, which can be used to estimate tumor phylogenies using neighbor-joining. Using a combination of public data [4,5] and simulations, compared to other methods, we show SCONCE2 more accurately calls copy number profiles, detects breakpoints, and estimates pairwise cell similarity, leading to tumor phylogeny estimates with less error.
Finally, in Chapter 4, we present SCONCEmut, a further expansion on SCONCE and SCONCE2, by utilizing genotype likelihoods. Although calling point mutations in single cell sequencing remains difficult due to noisy and low read depth, estimating the number and types of shared and independent mutations between cells can be incorporated into branch length estimates. Using a range of simulated and real data, we explore a model to jointly estimate mutation counts, using genotype likelihoods, copy number profiles, and tree branch lengths.
In this dissertation, we show SCONCE, SCONCE2, and SCONCEmut can be used to accurately call copy number events and study evolutionary relationships in single cell whole genome tumor sequencing data. When applied to additional datasets, investigators can gain further insight into and understanding of tumor evolution, potentially leading to more effective cancer detection and treatment protocols.