The identification of individuals that have a recent hybrid ancestry (between populations or species) has been a goal of naturalists for centuries. Since the 1960s, codominant genetic markers have been used with statistical and computational methods to identify F1 hybrids and back crosses. Existing hybrid inference methods assume that alleles at different loci undergo independent assortment (are unlinked or in population linkage equilibrium). Genomic datasets include thousands of markers that are located on the same chromosome and are in population linkage disequilibrium which violate this assumption. Existing methods may therefore be viewed as composite likelihoods when applied to genomic datasets and their performance in identifying hybrid ancestry (which is a model-choice problem) is unknown. Here we develop a new program Mongrail that implements a full-likelihood Bayesian hybrid inference method that explicitly models linkage and recombination, generating the posterior probability of different F1 or F2 hybrid, or backcross, genealogical classes. We use simulations to compare the statistical performance of Mongrail with that of an existing composite likelihood method (NewHybrids) and apply the method to analyze genome sequence data for hybridizing species of barred and spotted owls.
Chapter 1 reviews the different types of hybrid inference methods present in literature from the 1960s till present. The review traces the gradual development of the inference methods with advancement in sequencing technologies. We discuss how the assumption of independence among loci (applied by most existing methods) adversely affects the analysis of current genomic datasets.
In Chapter 2 we propose a hybridization model based on diplotypes under a two-generational pedigree when we consider two sympatric diploid populations. We present a novel way of calculating the exact likelihood of the SNP data under a model with recombination by using the knowledge of physical distance between the markers. Our method requires phased data and population haplotype frequencies to be known when calculating the likelihoods. But we also present some alternatives when these quantities are unknown. We use a point estimate to estimate the haplotype frequencies for the two populations using individuals who are unlikely to be hybrids. And for hybrid individual without any phase information we calculate the likelihood by integrating over all compatible diplotypes.
Chapter 3 describes the two major simulation study designs that were used to compare the two inference methods (NewHybrids and Mongrail) on datasets where the markers were linked. The first design (Comprehensive Simulation) involved generating a diverse set of haplotype frequency distributions. Whereas the second design (Coalescent Simulation) used a structured coalescent model with recombination, thus allowing the statistical performance of the two methods to be evaluated under biologically realistic conditions. Chapter 4 presents an exhaustive summary of the simulation results for both the study designs. We find that in general, Mongrail is more effective in distinguishing hybrids and backcrosses compared to NewHybrids under both the simulation study designs. One of the most noteworthy findings from the simulation study is that the number of chromosomes and the map-length of the chromosome, contribute more to power (to infer the correct genealogical class) than the number of markers. This outcome is extremely advantageous since it is more computationally challenging to increase the number of markers than increasing the number of chromosomes or map-length.
Chapter 5 presents the application of our method to a genomic dataset on spotted owls, barred owls and their hybrids. We give a brief background on the dataset and describe the methods we employed to analyze the dataset. Mongrail was able to infer the genealogical classes for all putative hybrids with high posterior probability.
Finally Chapter 6 briefly summarizes the findings from the simulation study designs and the empirical analysis. We conclude with a discussion about the strengths and weaknesses of our method and future research directions.