Search

Scholarly Works (8 results)

Sort By:

Article
Peer Reviewed

Multi-locus match probability in a finite population: a fundamental difference between the Moran and Wright–Fisher models

UC Berkeley Previously Published Works (2009)

Motivation

A fundamental problem in population genetics, which being also of importance to forensic science, is to compute the match probability (MP) that two individuals randomly chosen from a population have identical alleles at a collection of loci. At present, 11-13 unlinked autosomal microsatellite loci are typed for forensic use. In a finite population, the genealogical relationships of individuals can create statistical non-independence of alleles at unlinked loci. However, the so-called product rule, which is used in courts in the USA, computes the MP for multiple unlinked loci by assuming statistical independence, multiplying the one-locus MPs at those loci. Analytically testing the accuracy of the product rule for more than five loci has hitherto remained an open problem.

Results

In this article, we adopt a flexible graphical framework to compute multi-locus MPs analytically. We consider two standard models of random mating, namely the Wright-Fisher (WF) and Moran models. We succeed in computing haplotypic MPs for up to 10 loci in the WF model, and up to 13 loci in the Moran model. For a finite population and a large number of loci, we show that the MPs predicted by the product rule are highly sensitive to mutation rates in the range of interest, while the true MPs computed using our graphical framework are not. Furthermore, we show that the WF and Moran models may produce drastically different MPs for a finite population, and that this difference grows with the number of loci and mutation rates. Although the two models converge to the same coalescent or diffusion limit, in which the population size approaches infinity, we demonstrate that, when multiple loci are considered, the rate of convergence in the Moran model is significantly slower than that in the WF model.

Availability

A C++ implementation of the algorithms discussed in this article is available at http://www.cs.berkeley.edu/ approximately yss/software.html.

Cover page: Multi-locus match probability in a finite population: a fundamental difference between the Moran and Wright–Fisher models

Article
Peer Reviewed

Descartes’ rule of signs and the identifiability of population demographic models from genomic variation data

UC Berkeley Previously Published Works (2014)

The sample frequency spectrum (SFS) is a widely-used summary statistic of genomic variation in a sample of homologous DNA sequences. It provides a highly efficient dimensional reduction of large-scale population genomic data and its mathematical dependence on the underlying population demography is well understood, thus enabling the development of efficient inference algorithms. However, it has been recently shown that very different population demographies can actually generate the same SFS for arbitrarily large sample sizes. Although in principle this nonidentifiability issue poses a thorny challenge to statistical inference, the population size functions involved in the counterexamples are arguably not so biologically realistic. Here, we revisit this problem and examine the identifiability of demographic models under the restriction that the population sizes are piecewise-defined where each piece belongs to some family of biologically-motivated functions. Under this assumption, we prove that the expected SFS of a sample uniquely determines the underlying demographic model, provided that the sample is sufficiently large. We obtain a general bound on the sample size sufficient for identifiability; the bound depends on the number of pieces in the demographic model and also on the type of population size function in each piece. In the cases of piecewise-constant, piecewise-exponential and piecewise-generalized-exponential models, which are often assumed in population genomic inferences, we provide explicit formulas for the bounds as simple functions of the number of pieces. Lastly, we obtain analogous results for the "folded" SFS, which is often used when there is ambiguity as to which allelic type is ancestral. Our results are proved using a generalization of Descartes' rule of signs for polynomials to the Laplace transform of piecewise continuous functions.

Cover page: Descartes’ rule of signs and the identifiability of population demographic models from genomic variation data

Thesis
Peer Reviewed

Statistical, algorithmic, and robustness aspects of population demographic inference from genomic variation data

Bhaskar, Anand
Advisor(s): Song, Yun S

UC Berkeley Electronic Theses and Dissertations (2013)

The recent availability of large-sample high-throughput sequencing data has given us an unprecedented opportunity to very finely resolve the details of historical demographic processes that have shaped the genomes of modern human populations. Such understanding of population demography is important for several applications — to avoid false positives in genome-wide association studies; to calibrate null models of neutral genome evolution in order to find regions under selection; to study the impact of bottlenecks and small founder populations on genetic mutational load; to reconstruct large-scale historical human migration and admixture events; and so on.

In this dissertation, we consider some statistical, algorithmic and robustness aspects of demographic inference from genomic variation data. In particular, we study the problem of determining the historical effective size of a population from the sample frequency spectrum (SFS), which measures the distribution of allele frequencies in a sample of sequences drawn from the population.

From the statistical or information-theoretic perspective, it is known that this inverse problem does not have a unique solution in general, no matter how large the sample size. For any population allele frequency distribution, there exist infinitely many population size functions that are consistent with this distribution. While such a non-identifiability result might appear to pose a serious problem to statistical inference algorithms, we show that the situation is not so bad in practice. In particular, we prove that if the true population size function is piecewise-defined with each piece belonging to some family of biologically-motivated functions, then the SFS of a finite sample of sequences uniquely determines the underlying demography. We obtain a general bound on the sample size sufficient for identifiability; this bound depends on the number of pieces in the demographic model and on the family of functions for each piece. We also give concrete instantiations of this bound for piecewise-constant and piecewise-exponential models that are commonly used in demographic inference analyses.

From the algorithmic perspective, we build on analytic results for the expected SFS of a time-varying population size function and develop an efficient likelihood-based algorithm to infer piecewise-exponential

population size histories from large sample allele frequency data. By considering very large samples, our method can resolve details of the population history from the very recent past that are not otherwise accessible using smaller samples.

The third aspect of this dissertation is concerned with understanding the robustness of widely used evolutionary models to violations of model assumptions. Continuous-time evolutionary models like Kingman's coalescent and its dual diffusion process are derived from discrete models of random mating by assuming that the sample size being analyzed is much smaller than the the population size. However, the very large sample datasets being produced due to advances in high-throughput sequencing technologies are approaching the limits of this assumption. To investigate this issue, we develop exact algorithms for computation under the discrete-time Wright-Fisher model and use these algorithms to study the distortions in several genealogical quantities arising due to the coalescent approximation. Our findings indicate that for several demographic models inferred from large-scale sequence data, there can be substantial genealogical deviations introduced by the coalescent approximation that might influence the results of inference studies.

Cover page: Statistical, algorithmic, and robustness aspects of population demographic inference from genomic variation data

Article
Peer Reviewed

A novel spectral method for inferring general diploid selection from time series genetic data

UC Berkeley Previously Published Works (2014)

The increased availability of time series genetic variation data from experimental evolution studies and ancient DNA samples has created new opportunities to identify genomic regions under selective pressure and to estimate their associated fitness parameters. However, it is a challenging problem to compute the likelihood of non-neutral models for the population allele frequency dynamics, given the observed temporal DNA data. Here, we develop a novel spectral algorithm to analytically and efficiently integrate over all possible frequency trajectories between consecutive time points. This advance circumvents the limitations of existing methods which require fine-tuning the discretization of the population allele frequency space when numerically approximating requisite integrals. Furthermore, our method is flexible enough to handle general diploid models of selection where the heterozygote and homozygote fitness parameters can take any values, while previous methods focused on only a few restricted models of selection. We demonstrate the utility of our method on simulated data and also apply it to analyze ancient DNA data from genetic loci associated with coat coloration in horses. In contrast to previous studies, our exploration of the full fitness parameter space reveals that a heterozygote-advantage form of balancing selection may have been acting on these loci.

Cover page: A novel spectral method for inferring general diploid selection from time series genetic data

Article
Peer Reviewed

Geometry of the Sample Frequency Spectrum and the Perils of Demographic Inference

UC Berkeley Previously Published Works (2018)

The sample frequency spectrum (SFS), which describes the distribution of mutant alleles in a sample of DNA sequences, is a widely used summary statistic in population genetics. The expected SFS has a strong dependence on the historical population demography and this property is exploited by popular statistical methods to infer complex demographic histories from DNA sequence data. Most, if not all, of these inference methods exhibit pathological behavior, however. Specifically, they often display runaway behavior in optimization, where the inferred population sizes and epoch durations can degenerate to zero or diverge to infinity, and show undesirable sensitivity to perturbations in the data. The goal of this article is to provide theoretical insights into why such problems arise. To this end, we characterize the geometry of the expected SFS for piecewise-constant demographies and use our results to show that the aforementioned pathological behavior of popular inference methods is intrinsic to the geometry of the expected SFS. We provide explicit descriptions and visualizations for a toy model, and generalize our intuition to arbitrary sample sizes using tools from convex and algebraic geometry. We also develop a universal characterization result which shows that the expected SFS of a sample of size n under an arbitrary population history can be recapitulated by a piecewise-constant demography with only [Formula: see text] epochs, where [Formula: see text] is between [Formula: see text] and [Formula: see text] The set of expected SFS for piecewise-constant demographies with fewer than [Formula: see text] epochs is open and nonconvex, which causes the above phenomena for inference from data.

Cover page: Geometry of the Sample Frequency Spectrum and the Perils of Demographic Inference

Article
Peer Reviewed

Distortion of genealogical properties when the sample is very large

UC Berkeley Previously Published Works (2014)

Study sample sizes in human genetics are growing rapidly, and in due course it will become routine to analyze samples with hundreds of thousands, if not millions, of individuals. In addition to posing computational challenges, such large sample sizes call for carefully reexamining the theoretical foundation underlying commonly used analytical tools. Here, we study the accuracy of the coalescent, a central model for studying the ancestry of a sample of individuals. The coalescent arises as a limit of a large class of random mating models, and it is an accurate approximation to the original model provided that the population size is sufficiently larger than the sample size. We develop a method for performing exact computation in the discrete-time Wright-Fisher (DTWF) model and compare several key genealogical quantities of interest with the coalescent predictions. For recently inferred demographic scenarios, we find that there are a significant number of multiple- and simultaneous-merger events under the DTWF model, which are absent in the coalescent by construction. Furthermore, for large sample sizes, there are noticeable differences in the expected number of rare variants between the coalescent and the DTWF model. To balance the trade-off between accuracy and computational efficiency, we propose a hybrid algorithm that uses the DTWF model for the recent past and the coalescent for the more distant past. Our results demonstrate that the hybrid method with only a handful of generations of the DTWF model leads to a frequency spectrum that is quite close to the prediction of the full DTWF model.

Cover page: Distortion of genealogical properties when the sample is very large

Article
Peer Reviewed

Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data

UC Berkeley Previously Published Works (2015)

With the recent increase in study sample sizes in human genetics, there has been growing interest in inferring historical population demography from genomic variation data. Here, we present an efficient inference method that can scale up to very large samples, with tens or hundreds of thousands of individuals. Specifically, by utilizing analytic results on the expected frequency spectrum under the coalescent and by leveraging the technique of automatic differentiation, which allows us to compute gradients exactly, we develop a very efficient algorithm to infer piecewise-exponential models of the historical effective population size from the distribution of sample allele frequencies. Our method is orders of magnitude faster than previous demographic inference methods based on the frequency spectrum. In addition to inferring demography, our method can also accurately estimate locus-specific mutation rates. We perform extensive validation of our method on simulated data and show that it can accurately infer multiple recent epochs of rapid exponential growth, a signal that is difficult to pick up with small sample sizes. Lastly, we use our method to analyze data from recent sequencing studies, including a large-sample exome-sequencing data set of tens of thousands of individuals assayed at a few hundred genic regions.

Cover page: Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data

Article
Peer Reviewed

Genetic profiles of 103,106 individuals in the Taiwan Biobank provide insights into the health and history of Han Chinese

UC San Francisco Previously Published Works (2021)

Personalized medical care focuses on prediction of disease risk and response to medications. To build the risk models, access to both large-scale genomic resources and human genetic studies is required. The Taiwan Biobank (TWB) has generated high-coverage, whole-genome sequencing data from 1492 individuals and genome-wide SNP data from 103,106 individuals of Han Chinese ancestry using custom SNP arrays. Principal components analysis of the genotyping data showed that the full range of Han Chinese genetic variation was found in the cohort. The arrays also include thousands of known functional variants, allowing for simultaneous ascertainment of Mendelian disease-causing mutations and variants that affect drug metabolism. We found that 21.2% of the population are mutation carriers of autosomal recessive diseases, 3.1% have mutations in cancer-predisposing genes, and 87.3% carry variants that affect drug response. We highlight how TWB data provide insight into both population history and disease burden, while showing how widespread genetic testing can be used to improve clinical care.

Cover page: Genetic profiles of 103,106 individuals in the Taiwan Biobank provide insights into the health and history of Han Chinese