Journal club of one: ”Versatile simulations of admixture and accurate local ancestry inference with mixnmatch and ancestryinfer”

Admixture is the future of every sub-field of genetics, just in case you didn’t know. Both in the wild and domestic animals, populations or even species sometimes cross. This causes different patterns of relatedness than in well-mixed populations. Often we want to estimate ”local ancestry”, that is: what source population a piece of chromosome in an individual originates from. It is one of those genetics problems that is made harder by the absence of any way to observe it directly.

This recent paper (Schumer et al 2020; preprint version, which I read, here) presents a method for simulating admixed sequence data, and a method for inferring local ancestry from it. It does something I like, namely to pair analysis with fake-data simulation to check methods.

The simulation method is a built from four different simulators:

1. macs (Chen, Majoram & Wall 2009), which creates polymorphism data under neutral evolution from a given population history. They use macs to generate starting chromosomes from two ancestral populations.

2. Seq-Gen (Rambaut & Grassly 1997). Chromosomes from macs are strings of 0s and 1s representing the state at biallelic markers. If you want DNA-level realism, with base composition, nucleotide substitution models and so on, you need something else. I don’t really follow how they do this. You can tell from the source code that they use the local trees that macs spits out, which Seq-Gen can then simulate nucleotides from. As they put it, the resulting sequence ”lacks other complexities of real genome sequences such as repetitive elements and local variation in base composition”, but it is a step up from ”0000110100”.

3. SELAM (Corbett-Detig & Jones 2016), which simulates admixture between populations with population history and possibly selection. Here, SELAM‘s role is to simulate the actual recombination and interbreeding to create the patterns of local ancestry, that they will then fill with the sequences they generated before.

4. wgsim, which simulates short reads from a sequence. At this point, mixnmatch has turned a set of population genetic parameters into fasta files. That is pretty cool.

On the one hand, building on tried and true tools seems to be the right thing to do, less wheel-reinventing. It’s great that the phylogenetic simulator Seq-Gen from 1997 can be used in a paper published in 2020. On the other hand, looking at the dependencies for running mixnmatch made me a little pale: seven different bioinformatics or population genetics softwares (not including the dependencies you need to compile them), R, Perl and Python plus Biopython. Computational genetics is an adventure of software installation.

They use the simulator to test the performance of a hidden Markov model for inferring local ancestry (Corbett-Detig & Nielsen 2017) with different population histories and settings, and then apply it to swordtail fish data. In particular, one needs to set thresholds for picking ”ancestry informative” (i.e. sufficiently differentiated) markers between the ancestral populations, and that depends on population history and diversity.

In passing, they use the estimate the swordtail recombination landscape:

We used the locations of observed ancestry transitions in 139 F2 hybrids that we generated between X. birchmanni and X. malinche … to estimate the recombination rate in 5 Mb windows. … We compared inferred recombination rates in this F2 map to a linkage disequilibrium based recombination map for X. birchmanni that we had previously generated (Schumer et al., 2018). As expected, we observed a strong correlation in estimated recombination rate between the linkage disequilibrium based and crossover maps (R=0.82, Figure 4, Supporting Information 8). Simulations suggest that the observed correlation is consistent with the two recombination maps being indistinguishable, given the low resolution of the F2 map (Supporting Information 8).