Journal club of one: ”Versatile simulations of admixture and accurate local ancestry inference with mixnmatch and ancestryinfer”

Admixture is the future of every sub-field of genetics, just in case you didn’t know. Both in the wild and domestic animals, populations or even species sometimes cross. This causes different patterns of relatedness than in well-mixed populations. Often we want to estimate ”local ancestry”, that is: what source population a piece of chromosome in an individual originates from. It is one of those genetics problems that is made harder by the absence of any way to observe it directly.

This recent paper (Schumer et al 2020; preprint version, which I read, here) presents a method for simulating admixed sequence data, and a method for inferring local ancestry from it. It does something I like, namely to pair analysis with fake-data simulation to check methods.

The simulation method is a built from four different simulators:

1. macs (Chen, Majoram & Wall 2009), which creates polymorphism data under neutral evolution from a given population history. They use macs to generate starting chromosomes from two ancestral populations.

2. Seq-Gen (Rambaut & Grassly 1997). Chromosomes from macs are strings of 0s and 1s representing the state at biallelic markers. If you want DNA-level realism, with base composition, nucleotide substitution models and so on, you need something else. I don’t really follow how they do this. You can tell from the source code that they use the local trees that macs spits out, which Seq-Gen can then simulate nucleotides from. As they put it, the resulting sequence ”lacks other complexities of real genome sequences such as repetitive elements and local variation in base composition”, but it is a step up from ”0000110100”.

3. SELAM (Corbett-Detig & Jones 2016), which simulates admixture between populations with population history and possibly selection. Here, SELAM‘s role is to simulate the actual recombination and interbreeding to create the patterns of local ancestry, that they will then fill with the sequences they generated before.

4. wgsim, which simulates short reads from a sequence. At this point, mixnmatch has turned a set of population genetic parameters into fasta files. That is pretty cool.

On the one hand, building on tried and true tools seems to be the right thing to do, less wheel-reinventing. It’s great that the phylogenetic simulator Seq-Gen from 1997 can be used in a paper published in 2020. On the other hand, looking at the dependencies for running mixnmatch made me a little pale: seven different bioinformatics or population genetics softwares (not including the dependencies you need to compile them), R, Perl and Python plus Biopython. Computational genetics is an adventure of software installation.

They use the simulator to test the performance of a hidden Markov model for inferring local ancestry (Corbett-Detig & Nielsen 2017) with different population histories and settings, and then apply it to swordtail fish data. In particular, one needs to set thresholds for picking ”ancestry informative” (i.e. sufficiently differentiated) markers between the ancestral populations, and that depends on population history and diversity.

In passing, they use the estimate the swordtail recombination landscape:

We used the locations of observed ancestry transitions in 139 F2 hybrids that we generated between X. birchmanni and X. malinche … to estimate the recombination rate in 5 Mb windows. … We compared inferred recombination rates in this F2 map to a linkage disequilibrium based recombination map for X. birchmanni that we had previously generated (Schumer et al., 2018). As expected, we observed a strong correlation in estimated recombination rate between the linkage disequilibrium based and crossover maps (R=0.82, Figure 4, Supporting Information 8). Simulations suggest that the observed correlation is consistent with the two recombination maps being indistinguishable, given the low resolution of the F2 map (Supporting Information 8).

Journal club of one: ”Genomic predictions for crossbred dairy cattle”

A lot of dairy cattle is crossbred, but genomic evaluation is often done within breed. What about the crossbred individuals? This paper (VanRaden et al. 2020) describes the US Council on Dairy Cattle Breeding’s crossbred genomic prediction that started 2019.

In short, the method goes like this: They describe each crossbred individual in terms of their ”genomic breed composition”, get predictions for each them based on models from all the breeds separately, and then combine the results in proportion to the genomic breed composition. The paper describes how they estimate the genomic breed composition, and evaluated accuracy by predicting held-out new data from older data.

The genomic breed composition is a delightfully elegant hack: They treat ”how much breed X is this animal” as a series of traits and run a genomic evaluation on them. The training set: individuals from sets of reference breeds with their trait value set to 100% for the breed they belong to and 0% for other breeds. ”Marker effects for GBC [genomic breed composition] were then estimated using the same software as for all other traits.” Neat. After some adjustment, they can be interpreted as breed percentages, called ”base breed representation”.

As they already run genomic evaluations from each breed, they can take these marker effects and then animal’s genotypes, and get one estimate for each breed. Then they combine them, weighting by the base breed representation.

Does it work? Yes, in the sense that it provides genomic estimates for animals that otherwise wouldn’t have any, and that it beats parent average estimates.

Accuracy of GPTA was higher than that of [parent average] for crossbred cows using truncated data from 2012 to predict later phenotypes in 2016 for all traits except productive life. Separate regressions for the 3 BBR categories of crossbreds suggest that the methods perform equally well at 50% BBR, 75% BBR, and 90% BBR.

They mention in passing comparing these estimates to estimates from a common set of marker effects for all breeds, but there is no detail about that model or how it compared in accuracy.

The discussion starts with this sentence:

More breeders now genotype their whole herds and may expect evaluations for all genotyped animals in the future.

That sounds like a reasonable expectation, doesn’t it? Before what they could do with crossbred genotypes was to throw it away. There are lots of other things that might be possible with crossbred evaluation in the future (pulling in crossbred data into the evaluation itself, accounting for ancestry in different parts of the genome, estimating breed-of-origin of alleles, looking at dominance etc etc).

My favourite result in the paper is Table 8, which shows:

Example BBR for animals from different breeding systems are shown in Table 8. The HO cow from a 1964 control line had 1960s genetics from a University of Minnesota experimental selection project and a relatively low relationship to the current HO population because of changes in breed allele frequencies over the past half-century. The Danish JE cow has alleles that differ somewhat from the North American JE population. Other examples in the table show various breed crosses, and the example for an animal from a breed with no reference population shows that genetic contributions from some other breed may be evenly distributed among the included breeds so that BBR percentages sum to 100. These examples illustrate that GBC can be very effective at detecting significant percentages of DNA contributed by another breed.

Literature

VanRaden, P. M., et al. ”Genomic predictions for crossbred dairy cattle.” Journal of Dairy Science 103.2 (2020): 1620-1631.

Paper: ”Mixed ancestry and admixture in Kauai’s feral chickens: invasion of domestic genes into ancient Red Junglefowl reservoirs”

We have a new paper almost out (now in early view) in Molecular Ecology about the chickens on the Pacific island Kauai. These chickens are pretty famous for being everywhere on the island. Where do they come from? If you use your favourite search engine you’ll find an explanation with two possible origins: ancient wild birds brought over by the Polynesians and escaped domestic chickens. This post on Kauaiblog is great:

Hawaii’s official State bird is the Hawaiian Goose, or Nene, but on Kauai, everyone jokes that the “official” birds of the Garden Island are feral chickens, especially the wild roosters.

Wikepedia says the “mua” or red jungle fowl were brought to Kauai by the Polynesians as a source of food, thriving on an island where they have no real predators. /…/
Most locals agree that wild chickens proliferated after Hurricane Iniki ripped across Kauai in 1992, destroying chicken coops and releasing domesticated hens, and well as roosters being bred for cockfighting. Now these brilliantly feathered fowl inhabit every part of this tropical paradise, crowing at all hours of the day and night to the delight or dismay of tourists and locals alike.

In this paper, we look at phenotypes and genetics and find that this dual origin explanation is probably true.

jeff_trimble_kauai_chickens_cc_by_nc_sa

(Chickens on Kauai. This is not from our paper, but by Jeff Trimble (cc:by-sa-nc) published on Flickr. There are so many pretty chicken pictures there!)

Dom, Eben, and Pamela went to Kauai to photograph, record to and collect DNA from the chickens. (I stayed at home and did sequence bioinformatics.) The Kauai chickens look and sound like mixture of wild and domestic chickens. Some of them have the typical Junglefowl plumage, and other have flecks of white. Their crows vary in the length of the characteristic fourth syllable. Also, some of them have yellow legs, a trait that domestic chickens seem to have gotten not from the Red but from the Grey Junglefowl.

We looked at DNA sequences by massively parallel (SOLiD) sequencing of 23 individuals. We find mitochondrial sequences that fall in two haplogroups: E and D. The presence of the D haplogroup, which is the dominating one in ancient DNA sequences from the Pacific, means that there is a Pacific component to their ancestry. The E group, on the other hand, occurs in domestic chickens. It also shows up in some ancient DNA samples from the Pacific, but not from Kauai (and there is a scientific debate about these sequences). The nuclear genome analysis is pretty inconclusive. I think what we would need is some samples of possible domestic source populations (Where did the escapee  chickens came from? Are there other traditional domestic sources?) and a better sampling of Red Junglefowl to make better sense of it.

When we take the plumage, vocalisation and mitochondrial DNA together, it looks like this is a feral admixed population of either Red Junglefowl or traditional Pacific chickens mixed with domestics. A very interesting population indeed.

Kenneth Chang wrote about the paper in New York Times; includes quotes from Eben and Dom.

E Gering, M Johnsson, P Willis, T Getty, D Wright (2015) Mixed ancestry and admixture in Kauai’s feral chickens: invasion of domestic genes into ancient Red Junglefowl reservoirs. Molecular ecology