Simulating shared segments between relatives

A few months ago I saw this nice figure from Amy Williams of the number of DNA segments that are expected to be shared between relatives. I thought it would be fun to simulate segment sharing with AlphaSimR.

Because DNA comes in chromosomes that don’t break up and recombine that much, the shared DNA between relatives tends to come in long chunks — segments that are identical by descent. The distribution of segment lengths can sometimes be used to tell apart relationships that would otherwise give the same average (e.g., Yengo et al. 2019, Qiao et al. 2021).

But let’s not do anything sophisticated. Instead, we take three very simple pedigrees — anyone who’s taken introductory genetics will recognize these ones — and look at relationships between full-sibs, half-sibs and cousins. We’ll also look at the inbred offspring of matings between full-sibs, half-sibs and cousins to see that the proportion that they share between their two copies of the genome lines up with the expected inbreeding.

There won’t be any direct comparison to the values that Williams’ simulation, because it simulated more distant relationships than this, starting with cousins and then moving further away. This is probably more interesting, especially for human genealogical genetics.

The code is on GitHub if you wants to follow along.

The pedigrees

Here are the three pedigrees, drawn with the kinship2 package:

A pedigree, here, is really a table of individuals, where each column tells us their identifier, their parents, and optionally their sex, like this:

id, mother, father, sex
1, NA, NA, M
2, NA, NA, F
3, NA, NA, M
4, 2, 1, F
5, 2, 1, M
6, NA, NA, F
7, 4, 3, M
8, 6, 5, F
9, 8, 7, F

We can use GeneticsPed to check the relatedness and inbreeding if we don’t trust that I’ve entered the pedigrees right.

library(GeneticsPed)
library(purrr)
library(readr)

ped_fullsib <- read_csv("pedigrees/inbreeding_fullsib.txt")
ped_halfsib <- read_csv("pedigrees/inbreeding_halfsib.txt")
ped_cousin <- read_csv("pedigrees/inbreeding_cousin.txt")

inbreeding_ped <- function(ped) {

    inbreeding(Pedigree(ped))

}

print(map(list(ped_fullsib, ped_halfsib, ped_cousin), inbreeding_ped))
[[1]]
   1    2    3    4    5 
0.00 0.00 0.00 0.00 0.25 

[[2]]
    1     2     3     4     5     6 
0.000 0.000 0.000 0.000 0.000 0.125 

[[3]]
     1      2      3      4      5      6      7      8      9 
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0625

Comparing haplotypes

We need some functions to compare haplotypes and individuals:

library(AlphaSimR)
library(dplyr)
library(purrr)
library(tibble)

## Find shared segments between two haplotypes expressed as vectors
## map is a vector of marker positions

compare_haplotypes <- function(h1, h2, map) {
    sharing <- h1 == h2

    runs <- rle(sharing)
    end <- cumsum(runs$lengths)
    start <- c(1, end[-length(end)] + 1)

    segments <- tibble(start = start,
                       end = end,
                       start_pos = map[start],
                       end_pos = map[end],
                       segment_length = end_pos - start_pos,
                       value = runs$values)

    segments[segments$value,]
}

We will have haplotypes of the variants that go together on a chromosome, and we want to find segments that are shared between them. We do this with a logical vector that tests each variant for equality, and then use the rle to turn this into run-length encoding. We extract the start and end position of the runs and then keep only the runs of equality.

Building on that function, we want to find the shared segments on a chromosome between two individuals. That is, we make all the pairwise comparisons between the haplotypes they carry and combine them.

## Find shared segments between two individuals (expressed as
## matrices of haplotypes) for one chromosome

compare_individuals_chr <- function(ind1, ind2, map) {

    h1_1 <- as.vector(ind1[1,])
    h1_2 <- as.vector(ind1[2,])

    h2_1 <- as.vector(ind2[1,])
    h2_2 <- as.vector(ind2[2,])

    sharing1 <- compare_haplotypes(h1_1, h2_1, map)
    sharing2 <- compare_haplotypes(h1_1, h2_2, map)
    sharing3 <- compare_haplotypes(h1_2, h2_1, map)
    sharing4 <- compare_haplotypes(h1_2, h2_2, map)

    bind_rows(sharing1, sharing2, sharing3, sharing4)
}

Finally, we use that function to compare individuals along all the chromosomes.

This function takes in a population and simulation parameter object from AlphaSimR, and two target individuals to be compared.

We use AlphaSimR‘s pullIbdHaplo function to extract tracked founder haplotypes (see below) and then loop over chromosomes to apply the above comparison functions.

## Find shared segments between two target individuals in a
## population

compare_individuals <- function(pop,
                                target_individuals,
                                simparam) {

    n_chr <- simparam$nChr

    ind1_ix <- paste(target_individuals[1], c("_1", "_2"), sep = "")
    ind2_ix <- paste(target_individuals[2], c("_1", "_2"), sep = "")

    ibd <- pullIbdHaplo(pop,
                        simParam = simparam)

    map <- simparam$genMap
    loci_per_chr <- map_dbl(map, length)

    chr_ends <- cumsum(loci_per_chr)
    chr_starts <- c(1, chr_ends[-n_chr] + 1)

    results <- vector(mode = "list",
                      length = n_chr)

    for (chr_ix in 1:n_chr) {

        ind1 <- ibd[ind1_ix, chr_starts[chr_ix]:chr_ends[chr_ix]]
        ind2 <- ibd[ind2_ix, chr_starts[chr_ix]:chr_ends[chr_ix]]

        results[[chr_ix]] <- compare_individuals_chr(ind1, ind2, map[[chr_ix]])
        results[[chr_ix]]$chr <- chr_ix
    }

    bind_rows(results)
}

(You might think it would be more elegant, when looping over chromosomes, to pull out the identity-by-descent data for each chromosome at a time. This won’t work on version 1.0.4 though, because of a problem with pullIbdHaplo which has been fixed in the development version.)

We use an analogous function to compare the haplotypes carried by one individual. See the details on GitHub if you’re interested.

Building the simulation

We are ready to run our simulation: This code creates a few founder individuals that will initiate the pedigree, and sets up a basic simulation. The key simulation parameter is to set setTrackRec(TRUE) to turn on tracking of recombinations and founder haplotypes.

source("R/simulation_functions.R")

## Set up simulation

founders <- runMacs(nInd = 10,
                    nChr = 25)

simparam <- SimParam$new(founders)

simparam$setTrackRec(TRUE)

founderpop <- newPop(founders,
                     simParam = simparam)

To simulate a pedigree, we use pedigreeCross, a built-in function to simulate a given pedigree, and then apply our comparison functions to the resulting simulated population.

## Run the simulation for a pedigree one replicate

simulate_pedigree <- function(ped,
                              target_individuals,
                              focal_individual,
                              founderpop,
                              simparam) {
    pop <- pedigreeCross(founderPop = founderpop,
                         id = ped$id,
                         mother = ped$mother,
                         father = ped$father,
                         simParam = simparam)
    shared_parents <- compare_individuals(pop,
                                          target_individuals,
                                          simparam)
    shared_inbred <- compare_self(pop,
                                  focal_individual,
                                  simparam)
    list(population = pop,
         shared_segments_parents = shared_parents,
         shared_segments_self_inbred = shared_inbred)
}

Results

First we can check how large proportion of the genome of our inbred individuals is shared between their two haplotypes, averaged over 100 replicates. That is, how much of the genome is homozygous identical by descent — what is their genomic inbreeding? It lines up with the expectation form pedigree: 0.25 for the half-sib pedigree, close to 0.125 for the full-sib pedigree and close to 0.0625 for the cousin pedigree. The proportion shared by the parents is, as it should, about double that.

      case inbred_self_sharing parent_sharing
  full-sib        0.25 (0.052)    0.5 (0.041)
  half-sib        0.13 (0.038)   0.25 (0.029)
    cousin       0.064 (0.027)   0.13 (0.022)

Table of the mean proportion of genome shared between the two genome copies in inbred individuals and between their parents. Standard deviations in parentheses.

This is a nice consistency check, but not really what we wanted. The point of explicitly simulating chromosomes and recombinations is to look at segments, not just total sharing.

With a little counting and summarisation, we can plot the distributions of segment lengths. The horizontal axis is the length of the segments expressed in centimorgan. The vertical axis is the number of shared
segments of this length or longer. Each line is a replicate.

If we look at the summaries (table below), full-sibs share on average 74 segments greater than 1 cM in length, half-sibs 37, and cousins 29.

In real data, short segments might be harder to detect, but because we’re using simulated fake data, we don’t have to worry about phasing errors or false positive sharing.

If we look only at long segments (> 20 cM), full-sibs share on average 46 segments, half-sibs 23, and cousins 13. (Also, similar to Williams’ simulations, none of the cousins simulated here had less than five long segments shared.)

  case     `1 cM`   `10 cM`  `20 cM`  `30 cM`  `40 cM`  
  full-sib 74 (5.2) 60 (4.2) 46 (3.6) 34 (4)   24 (3.8) 
  half-sib 37 (3.4) 30 (3.1) 23 (3.3) 17 (2.8) 13 (2.6) 
  cousin   29 (3.8) 20 (3.3) 13 (3.2) 7.6 (2)  4.3 (1.8)

Table of the mean number of shared segments of different minimum length. Standard deviations in parentheses.

We an also look at the average length of the segments shared, and note that while full-sibs and half-sibs differ in the number of segments, and total segment length shared (above), the length of individual segments is about the same:

  case     mean_length_sd
  full-sib 0.33 (0.032)  
  half-sib 0.34 (0.042)  
  cousin   0.21 (0.03)

Table of the mean length shared segments. Standard deviations in parentheses.

Limitations

Williams’ simulation, using the ped-sim tool, had a more detailed model of recombination in the human genome, with different interference parameters for each chromosome, sex-specific recombination and so on. In that way, it is much more realistic.
We’re not modelling any one genome in particular, but a very generic genome. Each chromosome is 100 cM long for example; one can imagine that a genome with many short chromosomes would give a different distribution. This can be changed, though; the chromosome size is the easiest, if we just pick a species.

Literature

Yengo, L., Wray, N. R., & Visscher, P. M. (2019). Extreme inbreeding in a European ancestry sample from the contemporary UK population. Nature communications, 10(1), 1-11.

Qiao, Y., Sannerud, J. G., Basu-Roy, S., Hayward, C., & Williams, A. L. (2021). Distinguishing pedigree relationships via multi-way identity by descent sharing and sex-specific genetic maps. The American Journal of Human Genetics, 108(1), 68-83.

Estimating recent population history from linkage disequilibrium with GONE and SNeP

In this post, we will look at running two programs that infer population history — understood as changes in linkage disequilibrium over time — from genotype data. The post will chronicle running them on some simulated data; it will be light on theory, and light on methods evaluation.

Linkage disequilibrium, i.e. correlation between alleles at different genetic variants, breaks down over time when alleles are shuffled by recombination. The efficiency of that process depends on the distance between the variants (because variants close to each other on the same chromosome will recombine less often) and the population size (because more individuals means more recombinations). Those relationships mean that the strength of linkage disequilibrium at a particular distance between variants is related to the population size at a particular time. (Roughly, and assuming a lot.) There are several methods that make use of the relationship between effective population size, recombination rate and linkage disequilibrium to estimate population history.

The programs

The two programs we’ll look at are SNeP and GONE. They both first calculate different statistics of pairwise linkage disequilibrium between markers. SNeP groups pairs of markers into groups based on the distance between them, and estimates the effective population size for each group and how many generations ago each group represents. GONE goes further: it uses a formula for the expected linkage disequilibrium from a sequence of effective population sizes and a genetic algorithm to find such a sequence that fits the observed linkage disequilibrium at different distances.

Paper about GONE: Santiago, E., Novo, I., Pardiñas, A. F., Saura, M., Wang, J., & Caballero, A. (2020). Recent demographic history inferred by high-resolution analysis of linkage disequilibrium. Molecular Biology and Evolution, 37(12), 3642-3653.

Paper about SNeP: Barbato, M., Orozco-terWengel, P., Tapio, M., & Bruford, M. W. (2015). SNeP: a tool to estimate trends in recent effective population size trajectories using genome-wide SNP data. Frontiers in genetics, 6, 109.

These methods are suited for estimating recent population history in single closed populations. There are other methods, e.g. the Pairwise Markovian coealescent and methods based on Approximate Bayesian Computation, that try to reach further back in time or deal with connected populations.

(Humorously, barring one capitalisation difference, GONE shares it’s name with an unrelated program related to effective population sizes, GONe … There are not enough linkage disequilibrium puns to go around, I guess.)

Some fake data

First, let us generate some fake data to run the programs on. We will use the Markovian coalescent simulator MaCS inside AlphaSimR. That is, we won’t really make use of any feature of AlphaSimR except that it’s a convenient way to run MaCS.

There is a GitHub repo if you want to follow along.

We simulate a constant population, a population that decreased in size relatively recently, a population that increased in size recently, and a population that decreased in longer ago. The latter should be outside of what these methods can comfortably estimate. Finally, let’s also include a population that has migration from an other (unseen) population. Again, that should be a case these methods struggle with.

Simulated true population histories. Note that the horizontal axis is generations ago, meaning that if you read left to right, it runs backwards in time. This is typical when showing population histories like this, but can be confusing. Also not the different scales on the horizontal axis.

library(AlphaSimR)
library(purrr)
library(tibble)

## Population histories

recent_decrease <- tibble(generations = c(1, 50, 100, 150),
                          Ne = c(1000, 1500, 2000, 3000))

recent_increase <- tibble(generations = c(1, 50, 100, 150),
                          Ne = c(3000, 2000, 1500, 1000))

ancient_decrease <- tibble(generations = recent_decrease$generations + 500,
                           Ne = recent_decrease$Ne)

We can feed these population histories (almost) directly into AlphaSimR’s runMacs2 function. The migration case is a little bit more work because we will to modify the command, but AlphaSimR still helps us. MaCS takes a command line written using the same arguments as the paradigmatic ms program. The runMacs2 function we used above generates the MaCS command line for us; we can ask it to just return the command for us to modify. The split argument tells us that we want two populations that split 100 generations ago.

runMacs2(nInd = 100,
         Ne = recent_decrease$Ne[1],
         histGen = recent_decrease$generations[-1],
         histNe = recent_decrease$Ne[-1],
         split = 100,
         returnCommand = TRUE)

The resulting command looks like this:

"1e+08 -t 1e-04 -r 4e-05 -I 2 100 100  -eN 0.0125 1.5 -eN 0.025 2 -eN 0.0375 3 -ej 0.025001 2 1"

The first part is the number of basepairs on the chromosome, -t flag is for the population mutation rate \theta = 4 N_e \mu, -r for the recombination rate (also multiplied by four times the effective population size). The -eN arguments change the population size, and the -ej argument is for splitting and joining populations.

We can check that these numbers make sense: The population mutation rate of 10-4 is the typical per nucleotide mutation rate of 2.5 × 10-8 multiplied by four times the final population size of 1000. The scaled recombination rate of 4 × 10-5 is typical per nucleotide recombination rate of 10-8 multiplied by the same.

The population size changes (-eN arguments) are of the format scaled time followed by a scaling of the final population size. Times are scaled by four times the final population size, again. This means that 0.0125 followed by 1.5 refers to that 4 × 0.0125 × 1000 = 50 generations ago population size was 1.5 times the final population size, namely 1500. And so on.

-I (I assume for ”island” as in the ”island model”) sets up the two populations, each with 100 individuals and a migration rate between them. We modify this by setting it to 200 for the first population (because we want diploid individuals, so we need double the number of chromosomes) and 0 for the other; that is, this ghost population will not be observed, only used to influence the first one by migration. Then we set the third parameter, that AlphaSimR has not used: the migration rate. Again, this is expressed as four times the final population size, so for a migration rate of 0.05 we put 200.

Now we can run all cases to generate samples of 100 individuals.

migration_command <- "1e+08 -t 1e-04 -r 4e-05 -I 2 200 0 200  -eN 0.0125 1.5 -eN 0.025 2 -eN 0.0375 3 -ej 0.025001 2 1"

pops <- list(pop_constant = runMacs2(nInd = 100,
                                     nChr = 5,
                                     histNe = NULL,
                                     histGen = NULL,
                                     Ne = 1000),
             
             pop_recent = runMacs2(nInd = 100,
                                   nChr = 5,
                                   Ne = recent_decrease$Ne[1],
                                   histGen = recent_decrease$generations[-1],
                                   histNe = recent_decrease$Ne[-1]),
             
             pop_increase = runMacs2(nInd = 100,
                                     nChr = 5,
                                     Ne = recent_increase$Ne[1],
                                     histGen = recent_increase$generations[-1],
                                     histNe = recent_increase$Ne[-1]),
             
             pop_ancient = runMacs2(nInd = 100,
                                    nChr = 5,
                                    Ne = ancient_decrease$Ne[1],
                                    histGen = ancient_decrease$generations[-1],
                                    histNe = ancient_decrease$Ne[-1]),
             
             pop_migration = runMacs(nInd = 100,
                                     nChr = 5,
                                     manualCommand = migration_command,
                                     manualGenLen = 1))

Both GONE and SNeP work with text files in Plink ped/map format. Look in the repository if you want to see the not too exciting details of how we extract 10 000 markers and save them to Plink format together with their positions.

Running GONE

GONE source code and binaries are found in their GitHub repository, which we can clone or simply download from the web. As I’m running on Linux, we will use the binaries in the Linux subfolder. Before doing anything else, we will have to set the permissions for each of the binaries stored in the PROGRAMMES directory, with chmod u+x to make them executable.

GONE consists of a set of binaries and a bash script that runs them in order. As the as the script assumes it’s always running from the directory where you put GONE and always writes the output files into the same directory, the easiest way to handle it is to duplicate the entire GONE directory for every run. This also duplicates the INPUT_PARAMETERS_FILE file that one would modify to change settings. Then, invoking GONE with default settings would be as simple as opening a terminal, moving to the working directory and running the script, feeding it the base name of the Plink file:

./script_GONE.sh pop_constant

Thus, we write a prep script like this that copies the entire folder, puts the data into it, and then runs GONE:

#!/bin/bash

## Gone needs all the content of the operating system-specific subfolder to be copied
## into a working directory to run from. Therefore we create the "gone" directory
## and copy in the Linux version of the software from the tools directory.

mkdir gone

cd gone

cp -r ../tools/GONE/Linux/* .

## Loop over all the cases and invoke the GONE runscript. Again, because GONE
## needs the data to be in the same directory, we copy the data files into the
## working directory.

for CASE in pop_constant pop_recent pop_ancient pop_migration pop_increase; do

  cp ../simulation/${CASE}.* .
  
  ./script_GONE.sh ${CASE}
  
done

GONE puts its output files (named with the prefixes Output_Ne_, Output_d2_ and OUTPUT_ followed by the base name of the input files) in the same directory. The most interesting is Output_Ne_ which contains the estimates and the all caps OUTPUT_ file that contains logging information about the run.

Estimates look like this:

Ne averages over 40 independent estimates.
Generation      Geometric_mean
1       1616.29
2       1616.29
3       1616.29
4       1616.29
5       1291.22
6       1221.75
7       1194.16
8       1157.95
...

And the log looks like this:

TOTAL NUMBER OF SNPs
10000


HARDY-WEINBERG DEVIATION
-0.009012       Hardy-Weinberg deviation (sample)
-0.003987       Hardy-Weinberg deviation (population)


CHROMOSOME 1
 NIND(real sample)=100
 NSNP=2000
 NSNP_calculations=2000
 NSNP_+2alleles=0
 NSNP_zeroes=0
 NSNP_monomorphic=0
 NIND_corrected=100.000000
 freq_MAF=0.005000
 F_dev_HW (sample)=-0.009017
 F_dev_HW (pop)=-0.003992
 Genetic distances available in map file
...

I will now discuss a couple of issues I ran into. Note, this should not be construed as any kind of criticism of the programs or their authors. Everyone is responsible for their own inability to properly read manuals; I just leave them here in case they are helpful to someone else.

  • If you forget to set the permissions of the binaries, the error message will look like this:/
DIVIDE .ped AND .map FILES IN CHROMOSOMES
./script_GONE.sh: line 96: ./PROGRAMMES/MANAGE_CHROMOSOMES2: Permission denied
RUNNING ANALYSIS OF CHROMOSOMES ...
cp: cannot stat 'chromosome*': No such file or directory
bash: ./PROGRAMMES/LD_SNP_REAL3: Permission denied
...
  • Whereas Plink allows various different kinds of allele symbols in .ped files, GONE does not like allele codes that don’t look like A, C, G or T. The error message for other symbols looks like this:
DIVIDE .ped AND .map FILES IN CHROMOSOMES
RUNNING ANALYSIS OF CHROMOSOMES ...
CHROMOSOME ANALYSES took 0 seconds
Running GONE
 Format error in file outfileLD
 Format error in file outfileLD
 Format error in file outfileLD
 Format error in file outfileLD
 Format error in file outfileLD
...

Running SNeP

SNeP is only available as binaries on its Sourceforge page. Again, I’m using Linux binaries, so I downloaded the Linux binary from there and put it into a tools folder. The binary can be run from any directory, controlling the settings with command line flags. This would run SNeP on one of our simulated datasets, using the Haldane mapping function and correction of linkage disequilibrium values for sample size; these seem to be reasonable defaults:

./SNeP1.1 -ped simulation/pop_constant.ped -out snep/pop_constant -samplesize 2 -haldane

Thus, we write a run script like this:

#!/bin/bash

## As opposed to GONE, SNeP comes as one binary that can run from any directory. We still create
## a working directory to keep the output files in.

mkdir snep

## We loop over all cases, reading the data from the "simulation" directory,
## and directing the output to the "snep" directory. The settings are to
## correct r-squared for sample size using the factor 2, and to use the Haldane
## mapping function. We direct the output to a text file for logging purposes.

for CASE in pop_constant pop_recent pop_ancient pop_migration pop_increase; do

  ./tools/snep/SNeP1.1 \
    -ped simulation/${CASE}.ped \
    -out snep/${CASE} \
    -samplesize 2 \
    -haldane > snep/${CASE}_out.txt
  
done

SNeP creates two ouptut files with the given prefix, one with the extension .NeAll with the estimates a log file with the suffix SNeP.log file. Above, we also save the standard output to another log file.

Estimates look like this:

GenAgo	Ne	dist	r2	r2SD	items
13	593	3750544	0.0165248	0.0241242	37756
15	628	3272690	0.017411	0.0256172	34416
18	661	2843495	0.0184924	0.0281098	30785
20	681	2460406	0.0200596	0.0310288	27618
24	721	2117017	0.0214468	0.0313662	24898
...

Issues I ran into:

  • There are two versions of SNeP on Sourceforge, version 1.1 and version 11.1. According to the readme, SNeP requires ”GCC 4.8.2 or newer”, which I guess is a way to say that it needs a recent enough version of GLIBC, the runtime library that includes the C++ standard library. SNeP 1.11 appears to depend on GLIBC 2.29, and because I have 2.27, I had to use SNeP version 1.1 instead. It might be possible that it doesn’t require the new version of glibc, and that one could build from source on my system — but the source is not available, so who knows; this is a problem with distributing software as binaries only.
  • You cannot specify just a file name for your output. It needs to be a directory followed by a file name; that is, ”snep_constant” is not okay, but ”./snep_constant” is. The error looks like this:
/tools/snep/SNeP1.1 -ped constant.ped -out snep_constant
*************************************
*                SNeP               *
*                v1.1               *
*       barbatom@cardiff.ac.uk      *
*************************************

Sat Dec 25 13:09:38 2021

The -out path provided gives an error, aborting.

The moment you’ve been waiting for

Let’s read the results and look at the estimates!

Estimates from GONE, with default settings, and SNeP, with reasonable setting used in the SNeP paper, applied to simulated data from different scenarios. Grey dots are estimates, and black lines the true simulated population history. The estimates go on for a while, but as per the GONE paper’s recommendations, we should not pay attention to estimates further back in time where these methods are not accurate. That is, we should probably concentrate on the first 100 generations or so.

Again, this isn’t a systematic methods evaluation, so this shouldn’t be taken as a criticism of the programs or methods. But we can note that in these circumstances, the estimates capture some features of the simulated population histories but gets other features quite wrong. GONE estimates a recent decrease in the scenario with a recent decrease, but not the further decreases before, and a recent increase when there was a recent increase, but overestimates its size by a few thousand. In the migration scenario, GONE shows the kind of artefact the authors tell us to expect, namely a drastic population size drop. Unexpectedly, though, it estimates a similar drastic drop in the scenario with constant population size. SNeP captures the recent decrease, but underestimates its size. In fact, SNeP estimates a similar shape in all scenarios, both for increased, decreased and constant populations.

The plotting code looks something like this (see GitHub for the details): we create the file names, read all the output files into the same data frame with map_dfr, label them by what case they belong to by joining with the data frame of labels (with inner_join from dplyr) and make a plot with ggplot2. The true_descriptions data frame contains the true population histories used to simulate the data.

library(ggplot2)
library(readr)

cases <- tibble(case = c("pop_constant",
                         "pop_recent",
                         "pop_ancient",
                         "pop_migration",
                         "pop_increase"),
                description = c("Constant",
                                "Recent decrease",
                                "Ancient decrease",
                                "Recent decrease with migration",
                                "Recent increase"))

snep_file_names <- paste("snep/", cases$case, ".NeAll", sep = "")
names(snep_file_names) <- cases$case

snep <- map_dfr(snep_file_names, read_tsv, .id = "case")


snep_descriptions <- inner_join(snep, cases)
snep_descriptions$description <- factor(snep_descriptions$description,
                                        levels = cases$description)

## Make both a plot of the entire range of estimates, and a plot of the
## first 200 generations, which is the region where estimates are expected
## to be of higher quality
plot_snep_unconstrained <- ggplot() +
  geom_point(aes(x = GenAgo, y = Ne),
             data = snep_descriptions,
             colour = "grey") +
  facet_wrap(~ description,
             scale = "free_y",
             ncol = 2) +
  geom_segment(aes(x = start,
                   y = Ne,
                   xend = end,
                   yend = Ne),
               data = true_descriptions) +
  theme_bw() +
  theme(panel.grid = element_blank(),
        strip.background = element_blank()) +
  xlab("Generations ago")

plot_snep <- plot_snep_unconstrained +
  coord_cartesian(xlim = c(0, 200), ylim = c(0, 3000))

När kartan inte stämmer med terrängen gäller terrängen

When the results of a method don’t agree with the parameters of simulated data, the problem can either lie with the method or with the simulated data. And in this case, coalescent simulation is known to have problems with linkage disequilibrium. Here is a figure (Fig A, of appendix 2) of Nelson et al. (2020) who study the problems with coalescent simulations over long regions (such as the ones we simulated here).

The problem occurs for variants that are far apart (e.g. at around 100 = 1 expected recombinations between loci), where there should still be linkage disequilibrium, whereas the coalescent simulator (in this figure, ”msprime (Hudson)”) gives too low linkage disequilibrium. When we try to estimate effective population size late in population history, the methods rely on linkage equilibrium between markers far apart, and if they have too low linkage disequilibrium, the estimated population size should be too large. This does not appear to be what is going on here, but there might be more subtle problems with the simulated linkage disequilibrium that fools these methods; we could try something like Nelson et al.’s hybrid method or a forward simulation instead.

Literature

Barbato, M., Orozco-terWengel, P., Tapio, M., & Bruford, M. W. (2015). SNeP: a tool to estimate trends in recent effective population size trajectories using genome-wide SNP data. Frontiers in genetics, 6, 109.

Nelson, D., Kelleher, J., Ragsdale, A. P., Moreau, C., McVean, G., & Gravel, S. (2020). Accounting for long-range correlations in genome-wide simulations of large cohorts. PLoS genetics, 16(5), e1008619.

Santiago, E., Novo, I., Pardiñas, A. F., Saura, M., Wang, J., & Caballero, A. (2020). Recent demographic history inferred by high-resolution analysis of linkage disequilibrium. Molecular Biology and Evolution, 37(12), 3642-3653.

Journal club of one: ”A unifying concept of animal breeding programmes”

The developers of the MoBPS breeding programme simulators have published three papers about it over the last years: one about the MoBPS R package (Pook et al. 2020), one about their web server MoBPSweb (Pook et al. 2021), and one that discusses the logic of the specification for breeding programmes they use in the web interface (Simianer et al. 2021). The idea is that this specification can be used to describe breeding programmes in a precise and interoperable manner. The latter — about breeding programme specification — reads as if the authors had a jolly good time thinking about what breeding and breeding programmes are; at least, I had such feelings while reading it. It is accompanied by an editorial by Simianer (2021) talking about what he thinks are the important research directions for animal breeding research. Using simulation to aid breeding programme design is one of them.

Defining and specifying breeding programmes

After defining a breeding programme in a narrow genetic sense as a process achieving genetic change (more about that later), Simianer et al. (2021) go on to define a specification of such a breeding programme — or more precisely, of a model of a breeding programme. They use the word ”definition” for both things, but they really talk about two different things: defining what ”a breeding programme” is and specifying a particular model of a particular breeding programme.

Here, I think it is helpful to think of Guest & Martin’s (2020) distinction, borrowed from psychology, between a specification of a model and an implementation of a model. A specification is a description of a model based in natural language or math. Breeding programme modelling often takes the shape of software packages, where the software implementation is the model and a specification is quite vague. Simianer et al. (2021) can be seen as a step towards a way of writing specifications, on a higher level in Guest & Martin’s hierarchy, of what such a breeding programme model should achieve.

They argue that such a specification (”formal description”) needs to be comprehensive, unambiguous and reproducible. They claim that this can be achieved with two parts: the breeding environment and the breeding structure.

The breeding environment includes:

  • founder populations,
  • quantitative genetic parameters for traits,
  • genetic architectures for traits,
  • economic values for traits in breeding goal,
  • description of genomic information,
  • description of breeding value estimation methods.

Thus, the ”formal” specification depends on a lot of information that is either unknowable in practice (genetic architecture), estimated with error (genetic parameters), hard to describe other than qualitatively (founder population) and dependent on particular software implementations and procedures (breeding value estimation). This illustrates the need to distinguish the map from the territory — to the extent that the specification is exact, it describes a model of a breeding programme, not a real breeding programme.

The other part of their specification is the graph-based model of breeding structure. I think this is their key new idea. The breeding structure, in their specification, consists of nodes that represent groups of elementary objects (I would say ”populations”) and edges that represent transformations that create new populations (such as selection or mating) or are a shorthand for repeating groups of edges and nodes.

The elementary objects could be individuals, but they also give the example of gametes and genes (I presume they mean in the sense of alleles) as possible elementary objects. One could also imagine groups of genetically identical individuals (”genotypes” in a plant breeding sense). Nodes contain a given number of individuals, and can also have a duration.

Edges are directed, and correspond to processes such as ageing, selection, reproduction, splitting or merging populations. They will carry attributes related to the transformation. Edges can have a time associated with them that it takes for the transformation to happen (e.g. for animals to gestate or grow to a particular age). Here is an example from Pook et al. (2021) of a breeding structure graph for dairy cattle:

If we ignore the red edges for now, we can follow the flow of reproduction (yellow edges) and selection (green edges): The part on the left is what is going on in the breeding company: cows (BC-Cows) reproduce with selected bulls (BC-SelectedBulls), and their offspring become the next generation of breeding company cows and bulls (BC-NextCows and BC-NextBulls). On the right is the operation of a farm, where semen from the breeding company is used to inseminate cows and heifers (heifer, cow-L1, cow-L2, cow-L3) to produce calfs (calf-h, calf-L1, calf-L2, calf-L3). Each cycle each of these groups, through a selection operation, give rise to the next group (heifers becoming cows, first lactation cows becoming second lactation cows etc).

Breeding loops vs breeding graphs vs breeding forms

Except for the edges that specify breeding operations, there is also a special meta-edge type, the repeat edge, that is used to simplify breeding graphs with repeated operations.

A useful edge class to describe breeding programmes that are composed of several breeding cycles is ”repeat.” It can be used to copy resulting nodes from one breeding cycle into the nodes of origin of the next cycle, assuming that exactly the same breeding activities are to be repeated in each cycle. The “repeat” edge has the attribute “number of repeats” which allows to determine the desired number of cycles.

In the MoBPSweb paper (Pook et al. 2020), they describe how it is implemented in MoBPS: Given a breeding specification in MoBPSweb JSON format, the simulator will generate a directed graph by copying the nodes on the breeding cycle as many times as is specified by the repeat number. In this way, repeat edges are eliminated to make the breeding graph acyclic.

The conversion of the breeding scheme itself is done by first detecting if the breeding scheme has any “Repeat” edges (Simianer et al. 2020), which are used to indicate that a given part of the breeding programme is carried out multiple times (breeding cycles). If that is the case, it will subsequently check which nodes can be generated without the use of any repeat. Next, all repeats that can be executed based on the already available nodes are executed by generating copies of all nodes between the node of origin and the target node of the repeat (including the node of origin and excluding the target node). Nodes generated via repeat are serial-numbered via “_1,” “_2” etc. to indicate the repeat number. This procedure is repeated until all repeat edges are resolved, leading to a breeding programme without any repeats remaining.

There are at least three ways to specify the breeding structures for breeding programme simulations: these breeding graphs, breeding loops (or more generally, specifying breeding in a programming language) and breeding forms (when you get a pre-defined breeding structure and are allowed fill in the numbers).

If I’m going to compare the graph specification to what I’m more familiar with, this is how you would create a breeding structure in AlphaSimR:

library(AlphaSimR)

## Breeding environment

founderpop <- runMacs(nInd = 100,
                      nChr = 20)
simparam <- SimParam$new(founderpop)
simparam$setSexes("yes_sys")
simparam$addTraitA(nQtlPerChr = 100)
simparam$setVarE(h2 = 0.3)

## Breeding structure

n_time_steps <- 10

populations <- vector(mode = "list",
                      length = n_time_steps + 1)

populations[[1]] <- newPop(founderpop,
                           simParam = simparam)

for (gen_ix in 2:(n_time_steps + 1)) {

  ## Breeding cycle happens here

}

In the AlphaSimR script, the action typically happens within the loop. You apply different functions on population objects to make your selection, move individuals between parts of the breeding programme, create offspring etc. That is, in the MoBPS breeding structure, populations are nodes and actions are edges. In AlphaSimR, populations are objects and operations are functions. In order to not have to copy paste your breeding code, you use the control flow structures of R to make a loop (or some functional equivalent). In MoBPS graph structure, in order to not have to create every node and edge manually, you use the Repeat edge.

Breeding graphs with many repeat edges with different times attached to them have the potential to be complicated, and the same is true of breeding loops. I would have to use both of them more to have an opinion about what is more or less intuitive.

Now that we’ve described what they do in the paper, let’s look at some complications.

Formal specifications only apply to idealised breeding programmes

The authors claim that their concept provides a formal breeding programme specification (in their words, ”formal description”) that can be fully understood and implemented by breeders. It seems like the specification fails to live up to this ambition, and it appears doubtful whether any type of specification can. This is because they do not distinguish between specifying a model of a breeding programme and specifying a real implementation of a breeding programme.

First, as mentioned above, the ”breeding environment” as described by them, contains information that can never be specified for any real population, such as the genetic architecture of complex traits.

Second, their breeding structure is described in terms of fixed numbers, which will never be precise due to mortality, conception rates, logistics and practical concerns. They note such fluctuations in population size as a limitation in the Discussion. To some extent, random mortality, reproductive success etc an be modelled by putting random distributions on various parameters. (I am not sure how easy this is to do in the MoBPS framework; it is certainly possible.) However, this adds uncertainty about what these hyperparameter should be and whether they are realistic.

Such complications would just be nit-picking if the authors had not suggested that their specification can be used to communicate breeding programmes between breeders and between breeders and authorities, such as when a breeding programme is seeking approval. They acknowledge that the authorities, for example in the EU, want more detailed information that are beyond the scope of their specification.

And the concept is not very formal in the first place

Despite the claimed formality, every class of object in the breeding structure is left open, with many possible actions and many possible attributes that are never fully defined.

It is somewhat ambiguous what is to be the ”formal” specification — it cannot be the description in the paper as it is not very formal or complete; it shouldn’t be the implementation in MoBPS and MoBPSweb, as the concept is claimed to be universal; maybe it is the JSON specification of the breeding structure and background as described in the MoBPSweb paper (Pook et al. 2020). The latter seems the best candidate for a well-defined formal way to specify breeding programme models, but then again, the JSON format appears not to have a published specification, and appears to contain implementation-specific details relating to MoBPS.

This also matters to the suggested use of the specification to communicate real breeding programme designs. What, precisely, is it that will be communicated? Are breed societies and authorities expected to communicate with breeding graphs, JSON files, or with verbal descriptions using their terms (e.g. ”breeding environment”, ”breeding structure”, their node names and parameters for breeding activities)?

There is almost never a need for a definition

As I mentioned before, the paper starts by asking what a breeding programme is. They refer to different descriptions of breeding programme design from textbooks, and a legal definition from EU regulation 2016/1012; article 2, paragraph 26, which goes:

‘breeding programme’ means a set of systematic actions, including recording, selection, breeding and exchange of breeding animals and their germinal products, designed and implemented to preserve or enhance desired phenotypic and/or genotypic characteristics in the target breeding population.

There seems to be some agreement that a breeding programme, in addition to being the management of reproduction of a domestic animal population, also is systematic and goal-directed activity. Despite these descriptions of breeding programmes and their shared similarity, the authors argue that there is no clear formal definition of what a breeding programme is, and that this would be useful to delineate and specify breeding programmes.

They define a breeding programme as an organised process that aims to change the genetic composition in a desired direction, from one group of individuals to a group of individuals at a later time. The breeding programme comprises those individuals and activities that contribute to this process. For example, crossbred individuals in a multiplier part of a terminal crossbreeding programme would be included to the extent that they contribute information to the breeding of nucleus animals.

We define a breeding programme as a structured, man-driven process in time that starts with a group of individuals X at time t_1 and leads to a group of individuals Y at time t_2 > t_1. The objective of a breeding programme is to transform the genetic characteristics of group X to group Y in a desired direction, and a breeding programme is characterized by the fact that the implemented actions aim at achieving this transformation.

They actually do not elaborate on what it means that a genetic change has direction, but since they want the definition to apply both to farm animal and conservation breeding programmes, the genetic goals could be formulated both in terms of changes in genetic values for traits and in terms of genetic relationships.

Under many circumstances, this is a reasonable proxy also for an economic target: The breeding structures and interventions considered in theoretical breeding programme designs can often be evaluated in terms of their effect on the response to selection, and if the response improves, so will the economic benefit. However, this definition seems a little unnecessary and narrow. If you wanted to, say, add a terminal crossbreeding step to the simulation and evaluate the performance in terms of the total profitability of the crossbreeding programme (that is, something that is outside of the breeding programme in the sense of the above definition), nothing is stopping you, and the idea is not in principle outside of the scope of animal breeding.

Finally, an interesting remark about efficiency

When discussing the possibility of using their concept to communicate breeding programmes to authorities when seeking approval the authors argue that authorities should not take efficiency of the breeding programme into account when they evaluate breeding programmes for approval. They state this point forcefully without explaining their reasoning:

It should be noted, though, that an evaluation of the efficiency of breeding programme is not, and should never be, a precondition for the approval of a breeding programme by the authorities.

This raises the question: why not? There might be both environmental, economical and animal ethical reasons to consider not approving breeding programmes that can be shown to make inefficient use of resources. Maybe such evaluation would be impractical — breeding programme analysis and simulation might have to be put on a firmer scientific grounding and be made more reproducible and transparent before we trust it to make such decisions — but efficiency does seem like an appropriate thing to want in a breeding scheme, also from the perspective of society and the authorities.

I am not advocating any particular new regulation for breeding programmes here, but I wonder where the ”should never” came from. This reads like a comment added to appease a reviewer — the passage is missing from the preprint version.

Literature

Pook, T., Schlather, M., & Simianer, H. (2020a). MoBPS-modular breeding program simulator. G3: Genes, Genomes, Genetics, 10(6), 1915-1918. https://academic.oup.com/g3journal/article/10/6/1915/6026363

Pook, T., Büttgen, L., Ganesan, A., Ha, N. T., & Simianer, H. (2021). MoBPSweb: A web-based framework to simulate and compare breeding programs. G3, 11(2), jkab023. https://academic.oup.com/g3journal/article/11/2/jkab023/6128572

Simianer, H., Büttgen, L., Ganesan, A., Ha, N. T., & Pook, T. (2021). A unifying concept of animal breeding programmes. Journal of Animal Breeding and Genetics, 138 (2), 137-150. https://onlinelibrary.wiley.com/doi/full/10.1111/jbg.12534

Simianer, H. (2021), Harvest Moon: Some personal thoughts on past and future directions in animal breeding research. J Anim Breed Genet, 138: 135-136. https://doi.org/10.1111/jbg.12538

Guest, O., & Martin, A. E. (2020). How computational modeling can force theory building in psychological science. Perspectives on Psychological Science, 1745691620970585. https://journals.sagepub.com/doi/full/10.1177/1745691620970585

Twin lambs with different fathers

I just learned that in sheep, lambs from the same litter pretty often have different fathers, if the ewe has mated with different males. Berry et al. (2020) looked at sheep flocks on Irland that used more than one ram, and:

Of the 539 pairs of twins included in the analysis, 160 (i.e. 30%) were sired by two different rams. Of the 137 sets of triplets included in the analysis, 73 (i.e. 53%) were sired by more than one ram. Of the nine sets of quadruplets, eight were sired by two rams with the remaining litter being mono‐paternal. The overall incidence of heteropaternal superfecundation among litters was therefore 35%. Given that the incidence of multiple births in these flocks was 65%, heteropaternal superfecundation is expected to be relatively common in sheep; this is especially true as all but two of the litter‐mates were polyzygotic.

They figured this out by looking at individuals genotyped on SNP chips with tens of thousands of SNPs, with both lambs and the potential parents genotyped, so there can’t be much uncertainty in the assignment. You don’t need that many genotyped markers to get a confident assignment, and they don’t have that many rams to choose from.

Time for some Mendelian inheritance

Let’s simulate a situation like this: We set up a population and a marker panel for genotyping, split them into ewes and rams, and make some lambs.

library(AlphaSimR)

founderpop <- runMacs(nInd = 105,
                      nChr = 10,
                      segSites = 100)

simparam <- SimParam$new(founderpop)
simparam$setGender("no")

simparam$addSnpChip(nSnpPerChr = 100)

parents <- newPop(founderpop,
                  simParam = simparam)

ewes <- parents[1:100]
rams <- parents[101:105]

lambs <- randCross2(females = ewes,
                    males = rams,
                    nCrosses = 100,
                    nProgeny = 2,
                    simParam = simparam)

Now, if we have the genotypes of a lamb and its mother, how do we know the father? In this paper, they use exclusion methods: They compared the genotypes from the offspring with the parents and used inheritance rules to exclude rams that can't be the father because if they were, the offspring couldn't have the genotypes it had. Such breaking of regular inheritance patterns would be a "Mendelian inconsistency". This is the simplest kind of parentage assignment; fancier methods will calculate the probabilities of different genotypes, and allow you to reconstruct unknown relationships.

We can do this in two ways:

  • ignore the ewe’s genotypes and look for opposite homozygotes between lamb and ram, which are impossible regardless of the mother’s genotype
  • use both the ewe’s and ram’s genotypes to look what lamb genotypes are possible from a cross between them; this adds a few more cases where we can exclude a ram even if the lamb is heterozygous

To do the first, we count the number of opposite homozygous markers. In this genotype coding, 0 and 2 are homozygotes, and 1 is a heterozygous marker.

opposite_homozygotes <- function(ram,
                                 lamb) {
    sum(lamb == 0 & ram == 2) +
        sum(lamb == 2 & ram == 0)

}

When we include the ewe's genotype, there are a few more possible cases. We could enumerate all of them, but here is some R code to generate them. We first get all possible gametes from each parent, we combine the gametes in all possible combinations, and that gives us the possible lamb genotypes at that marker. If the lamb does, in fact, not have any of those genotypes, we declare the marker inconsistent. Repeat for all markers.

## Generate the possible gametes from a genotype

possible_gametes <- function(genotype) {

    if (genotype == 0) {
        gametes <- 0
    } else if (genotype == 1) {
        gametes <- c(0, 1)
    } else if (genotype == 2) {
        gametes <- 1
    }

    gametes
}

## Generate the possible genotypes for an offspring from
## parent possible gametes

possible_genotypes <- function(father_gametes,
                               mother_gametes) {

    possible_combinations <- expand.grid(father_gametes, mother_gametes)
    resulting_genotypes <- rowSums(possible_combinations)
    unique(resulting_genotypes)
}

## Check offspring genotypes for consistency with parent genotypes

mendelian_inconsistency <- function(ewe,
                                    ram,
                                    lamb) {

    n_markers <- length(ewe)
    inconsistent <- logical(n_markers)

    for (marker_ix in 1:n_markers) {

        possible_lamb_genotypes <-
          possible_genotypes(possible_gametes(ewe[marker_ix]),
                             possible_gametes(ram[marker_ix]))

        inconsistent[marker_ix] <-
          !lamb[marker_ix] %in% possible_lamb_genotypes
    }

    sum(inconsistent)
}

(These functions assume that we have genotypes in vectors. The full code that extracts this information from the simulated data and repeats for all markers is on Gitbhub.)

Here is the outcome for a set of random lambs. The red dots point out the true fathers: because we have perfect genotype data simulated without errors, the true father always has 100% consistent markers.

If we compare how many markers are found inconsistent with the two methods, we get a pattern like this graph. Including the ewe’s genotypes lets us discover a lot more inconsistent markers, but in this case, with plentiful and error-free markers, it doesn’t make a difference.

Thresholds and errors

If I have any complaint with the paper, it’s that the parentage analysis isn’t really described in the methods. This is what it says:

Parentage testing using simple exclusion‐based approaches is determined by the proportion of opposing homozygotes in putative sire–offspring pairs.

/…/

Maternal verification was undertaken using the exclusion method (Double et al . 1997) comparing the genotype of the dam with that of her putative progeny and only validated dam–offspring pairs were retained. Genotypes of the mature rams in the flock were compared with all lambs born in that flock using the exclusion method.

(The reference is related to exclusion methods, but it’s describing how to calculate exclusion probabilities in a certain circumstance. That is, it’s part of a methodological conversation about exclusion methods, but doesn’t actually describe what they did.)

I don’t doubt that they did it well. Still, it would be interesting to know the details, because in the absence of perfect genotype data, they must have had some thresholds for error and some criterion for deciding which ram was right, even if it seemed obvious.

Literature

Berry, D. P., et al. ”Heteropaternal superfecundation frequently occurs in multiple‐bearing mob‐mated sheep.” Animal Genetics (2020).

Using R: 10 years with R

Yesterday, 29 Feburary 2020, was the 20th anniversary of the release R 1.0.0. Jozef Hajnala’s blog has a cute anniversary post with some trivia. I realised that it is also (not to the day, but to the year) my R anniversary.

I started using R in 2010, during my MSc project in Linköping. Daniel Nätt, who was a PhD student there at the time, was using it for gene expression and DNA methylation work. I think that was the reason he was pulled into R; he needed the Bioconductor packages for microarrays. He introduced me. Thanks, Daniel!

I think I must first have used it to do something with qPCR melting curves. I remember that I wrote some function to reshape/pivot data between long and wide format. It was probably an atrocity of nested loops and hard bracket indexing. Coming right from an undergraduate programme with courses using Ada and C++, even if we had also used Minitab for statistics and Matlab for engineering, I spoke R with a strong accent. At any rate, I was primed to think that doing my data analysis with code was a good idea, and jumped at the opportunity to learn a tool for it. Thanks, undergraduate programme!

I think the easiest thing to love about R is the package system. You can certainly end up in dependency hell with R and metaphorically shoot your own foot, especially on a shared high performance computing system. But I wouldn’t run into any of that until after several years. I was, and still am, impressed by how packages just worked, and could do almost anything. So, the Bioconductor packages were probably, indirectly, why I was introduced to R, and after that, my R story can be told in a series of packages. Thanks, CRAN!

The next package was R/qtl, that I relied on for my PhD. I had my own copy of the R/qtl book. For a period, I probably wrote thing every day:

library(qtl)

cross <- read.cross(file = "F8_geno_trim.csv", format = "csv")

R/qtl is one of my favourite pieces or research software, relatively friendly and with lots of documentation. Thanks, R/qtl developers!

Of course it was Dom Wright, who was my PhD supervisor, who introduced me to R/qtl, and I think it was also he who introduced me to ggplot2. At least he used it, and at some point we were together trying to fix the formatting of a graph, probably with some ugly hack. I decided to use ggplot2 as much as possible, and as it is wont to, ggplot2 made me care about rearranging data, thus leading to reshape2 and plyr. ”The magic is not in plotting the data but in tidying and rearranging the data for plotting.” After a while, most everything I wrote used the ddply function in some way. Thank you, Hadley Wickham!

Then came the contemporary tidyverse. For the longest time, I was uneasy with tidyr, and I’m still not a regular purrr user, but one can’t avoid loving dplyr. How much? My talk at the Swedish Bioinformatics Workshop in 2016 had a slide expressing my love of the filter function. It did not receive the cheers that the function deserves. Maybe the audience were Python users. With new file reading functions, new data frames and functions to manipulate data frames, modern R has become smoother and friendlier. Thanks, tidyverse developers!

The history of R on this blog started in 2011, originally as a way to make notes for myself or, ”a fellow user who’s trying to google his or her way to a solution”. This turned into a series of things to help teach R to biologists around me.

There was the Slightly different introduction to R series of blog posts. It used packages that feel somewhat outdated, and today, I don’t think there’s anything even slightly different about advocating RStudio, and teaching ggplot2 from the beginning.

This spawned a couple of seminars in course for PhD students, which were updated for the Wright lab computation lunches, and eventually turned into a course of its own given in 2017. It would be fun to update it and give it again.

The last few years, I’ve been using R for reasonably large genome datasets in a HPC environment, and gotten back to the beginnings, I guess, by using Bioconducor a lot more. However, the package that I think epitomises the last years of my R use is AlphaSimR, developed by colleagues in Edinburgh. It’s great to be able throw together a quick simulation to check how some feature of genetics behaves. AlphaSimR itself is also an example of how far the R/C++ integration has come with RCpp and RCppArmadillo. Thanks, Chris!

In summary, R is my tool of choice for almost anything. I hope we’ll still be using it, in new and interesting ways, in another ten years. Thank you, R core team!

Using R: Animal model with hglm and Stan (with Cholesky trick)

A few weeks ago I posted about fitting the quantitative genetic animal model with MCMCglmm and R-INLA. Since then, I listened to a talk by Lars Rönnegård, one of the creators of the hglm package, and this paper was published in GSE about animal models in Stan.

hglm

The hglm package fits hierarchical generalised linear models. That includes the animal model with pedigree or genomic relatedness. Hierarchical generalised linear models also allow you to model the dispersion of random effects, which lets you do tricks like variance QTL mapping (Rönnegård & Valdar 2011), breeding values for variances (Rönnegård et al. 2010) or genomic prediction models with predictors of marker variance (Mouresan, Selle & Rönnegård 2019). But let’s not get ahead of ourselves. How do we fit an animal model?

Here is the matrix formulation of the animal model that we skim through in every paper. It’s in this post because we will use the design matrix interface to hglm, which needs us to give it these matrices (this is not a paper, so we’re not legally obliged to include it):

\mathbf{y} = \mu + \mathbf{X} \mathbf{b} + \mathbf{Z} \mathbf{a} + \mathbf{e}

The terms are the the trait value, intercept, fixed coefficients and their design matrix, genetic coefficients and their design matrix, and the residual. The design matrix Z will contain one row and column for each individual, with a 1 to indicate its position in the phenotype table and pedigree and the rest zeros. If we sort our files, it’s an identity matrix.

The trick with the genetic coefficients is that they’re correlated, with a specific known correlation structure that we know from the pedigree (or in genomic models, from markers). It turns out (Lee, Nelder & Pawitan 2017, chapter 8) that you can change the Z matrix around so that it lets you fit the model with an identity covariance matrix, while still accounting for the correlations between relatives. You replace the random effects for relatedness with some transformed random effects that capture the same structure. One way to do this is with Cholesky decomposition.

\mathbf{Z_{fudged}} = \mathbf{Z_0} \mathbf{L}

As an example of what the Cholesky decomposition does, here is slice of the additive relationship matrix of 100 simulated individuals (the last generation of one replicate of these simulations) and the resulting matrix from Cholesky decomposition.

So instead of

\mathbf{a} \sim N(0, \mathbf{A} \sigma)

We can fit

\mathbf{a_{fudged}} \sim N(0, \mathbf{I} \sigma)

This lets us fit the animal model with hglm, by putting in a modified Z matrix.

Assuming we have data frames with a pedigree and a phenotype (like, again, from these simulations):

library(AGHmatrix)
library(hglm)

A  <- Amatrix(ped)

Z0  <- diag(1000)
L <- t(chol(A))
Z  <- Z0 %*% L
X <- model.matrix(~1, pheno)

model <- hglm(y = pheno$pheno,
              X = X,
              Z = Z,
              conv = 1e-8)

est_h2  <- model$varRanef / (model$varRanef + model$varFix)

(I found the recommendation to decrease the convergence criterion from the default for animal models in a YouTube video by Xia Chen.)

Stan

When we turn to Stan, we will meet the Cholesky trick again. Stan is a software for Markov Chain Monte Carlo, built to fit hierarchical linear models, and related high-dimensional models, more effectively than other sampling strategies (like Gibbs). rstan is a helpful package for running Stan from within R.

Nishio & Arakawa (2019) recently published a Stan script to fit an animal model, comparing Stan to a Gibbs sampler (and a related MCMC sampler that they also didn’t publish the code for). If we look into their Stan model code, they also do a Cholesky decomposition to be able to use an identity matrix for the variance.

First, they decompose the additive relationship matrix that the program takes in:

transformed data{
  matrix[K,K] LA;
  LA = cholesky_decompose(A);
}

And then, they express the model like this:

vector[N] mu;
vector[K] a;
a_decompose ~ normal(0, 1);
a = sigma_G * (LA * a_decompose);
mu = X * b + Z * a;
Y ~ normal(mu, sigma_R);

We can add this line to the generated quantities block of the Stan program to get heritability estimates directly:

real h2;
h2 = sigma_U / (sigma_U + sigma_E)

Here, we’ve saved their model to a stan file, and now we can run it from R:

pheno$scaled_pheno <- as.vector(scale(pheno$pheno))

model_stan <- stan(file = "nishio_arakawa.stan",
                   data = list(Y = pheno$scaled_pheno,
                               X = X,
                               A = A,
                               Z = Z0,
                               J = 1,
                               K = 1000,
                               N = 1000))

est_h2_stan <- summary(model_stan, pars = "h2")$summary

Important note that I always forget: It's important to scale your traits before you run this model. If not, the priors might be all wrong.

The last line pulls out the summary for the heritability parameter (that we added above). This gives us an estimate and an interval.

The paper also contains this entertaining passage about performance, which reads as if it was a response to a comment, actual or anticipated:

R language is highly extensible and provides a myriad of statistical and graphical techniques. However, R language has poor computation time compared to Fortran, which is especially well suited to numeric computation and scientific computing. In the present study, we developed the programs for GS and HMC in R but did not examine computation time; instead, we focused on examining the performance of estimating genetic parameters and breeding values.

Yes, two of their samplers (Gibbs and HMC) were written in R, but the one they end up advocating (and the one used above), is in Stan. Stan code gets translated into C++ and then compiled to machine code.

Stan with brms

If rstan lets us run Stan code from R and examine the output, brms lets us write down models in relatively straightforward R syntax. It’s like the MCMCglmm of the Stan world. We can fit an animal model with brms too, by directly plugging in the relationship matrix:

model_brms <- brm(scaled_pheno ~ 1 + (1|animal),
                  data = pheno,
                  family = gaussian(),
                  cov_ranef = list(animal = A),
                  chains = 4,
                  cores = 1,
                  iter = 2000)

Then, we can pull out the posterior samples for the variability, here expressed as standard errors, compute the heritability and then get the estimates (and interval, if we want):

posterior_brms <- posterior_samples(model_brms,
                                    pars = c("sd_animal", "sigma"))

h2_brms  <- posterior_brms[,1]^2 /
    (posterior_brms[,1]^2 + posterior_brms[,2]^2)

est_h2_brms <- mean(h2_brms)

(Code is on GitHub: both for the graphs above, and the models.)

Using R: Animal model with simulated data

Last week’s post just happened to use MCMCglmm as an example of an R package that can get confused by tibble-style data frames. To make that example, I simulated some pedigree and trait data. Just for fun, let’s look at the simulation code, and use MCMCglmm and AnimalINLA to get heritability estimates.

First, here is some AlphaSimR code that creates a small random mating population, and collects trait and pedigree:

library(AlphaSimR)

## Founder population
FOUNDERPOP <- runMacs(nInd = 100,
                      nChr = 20,
                      inbred = FALSE,
                      species = "GENERIC")

## Simulation parameters 
SIMPARAM <- SimParam$new(FOUNDERPOP)
SIMPARAM$addTraitA(nQtlPerChr = 100,
                   mean = 100,
                   var = 10)
SIMPARAM$setGender("yes_sys")
SIMPARAM$setVarE(h2 = 0.3)
 
## Random mating for 9 more generations
generations <- vector(mode = "list", length = 10) 
generations[[1]] <- newPop(FOUNDERPOP,
                           simParam = SIMPARAM)


for (gen in 2:10) {

    generations[[gen]] <- randCross(generations[[gen - 1]],
                                    nCrosses = 10,
                                    nProgeny = 10,
                                    simParam = SIMPARAM)

}

## Put them all together
combined <- Reduce(c, generations)


## Extract phentoypes
pheno <- data.frame(animal = combined@id,
                    pheno = combined@pheno[,1])

## Extract pedigree
ped <- data.frame(id = combined@id,
                  dam = combined@mother,
                  sire =combined@father)
ped$dam[ped$dam == 0] <- NA
ped$sire[ped$sire == 0] <- NA

## Write out the files
write.csv(pheno,
          file = "sim_pheno.csv",
          row.names = FALSE,
          quote = FALSE)

write.csv(ped,
          file = "sim_ped.csv",
          row.names = FALSE,
          quote = FALSE)

In turn, we:

  1. Set up a founder population with a AlphaSimR’s generic livestock-like population history, and 20 chromosomes.
  2. Choose simulation parameters: we have an organism with separate sexes, a quantitative trait with an additive polygenic architecture, and we want an environmental variance to give us a heritability of 0.3.
  3. We store away the founders as the first generation, then run a loop to give us nine additional generations of random mating.
  4. Combine the resulting generations into one population.
  5. Extract phenotypes and pedigree into their own data frames.
  6. Optionally, save the latter data frames to files (for the last post).

Now that we have some data, we can fit a quantitative genetic pedigree model (”animal model”) to estimate genetic parameters. We’re going to try two methods to fit it: Markov Chain Monte Carlo and (the unfortunately named) Integrated Nested Laplace Approximation. MCMC explores the posterior distribution by sampling; I’m not sure where I heard it described as ”exploring a mountain by random teleportation”. INLA makes approximations to the posterior that can be integrated numerically; I guess it’s more like building a sculpture of the mountain.

First, a Gaussian animal model in MCMCglmm:

library(MCMCglmm)

## Gamma priors for variances
prior_gamma <- list(R = list(V = 1, nu = 1),
                    G = list(G1 = list(V = 1, nu = 1)))
    
## Fit the model
model_mcmc  <- MCMCglmm(scaled ~ 1,
                        random = ~ animal,
                        family = "gaussian",
                        prior = prior_gamma,
                        pedigree = ped,
                        data = pheno,
                        nitt = 100000,
                        burnin = 10000,
                        thin = 10)

## Calculate heritability for heritability from variance components
h2_mcmc_object  <- model_mcmc$VCV[, "animal"] /
    (model_mcmc$VCV[, "animal"] + model_mcmc$VCV[, "units"])
 
## Summarise results from that posterior
h2_mcmc  <- data.frame(mean = mean(h2_mcmc_object),
                       lower = quantile(h2_mcmc_object, 0.025),
                       upper = quantile(h2_mcmc_object, 0.975),
                       method = "MCMC",
                       stringsAsFactors = FALSE)

And here is a similar animal model in AnimalINLA:

library(AnimalINLA)

## Format pedigree to AnimalINLA's tastes
ped_inla <- ped
ped_inla$id  <- as.numeric(ped_inla$id)
ped_inla$dam  <- as.numeric(ped_inla$dam)
ped_inla$dam[is.na(ped_inla$dam)] <- 0
ped_inla$sire  <- as.numeric(ped_inla$sire)
ped_inla$sire[is.na(ped_inla$sire)] <- 0
    
## Turn to relationship matrix
A_inv <- compute.Ainverse(ped_inla)
    
## Fit the model
model_inla  <- animal.inla(response = scaled,
                           genetic = "animal",
                           Ainverse = A_inv,
                           type.data = "gaussian",
                           data = pheno,
                           verbose = TRUE)

## Pull out summaries from the model object
summary_inla  <- summary(model_inla)

## Summarise results
h2_inla  <- data.frame(mean = summary_inla$summary.hyperparam["Heritability", "mean"],
                       lower = summary_inla$summary.hyperparam["Heritability", "0.025quant"],
                       upper = summary_inla$summary.hyperparam["Heritability", "0.975quant"],
                       method = "INLA",
                       stringsAsFactors = FALSE)

If we wrap this all in a loop, we can see how the estimation methods do on replicate data (full script on GitHub). Here are estimates and intervals from ten replicates (black dots show the actual heritability in the first generation):

As you can see, the MCMC and INLA estimates agree pretty well and mostly hit the mark. In the one replicate dataset where they falter, they falter together.

‘Simulating genetic data with R: an example with deleterious variants (and a pun)’

A few weeks ago, I gave a talk at the Edinburgh R users group EdinbR on the RAGE paper. Since this is an R meetup, the talk concentrated on the mechanics of genetic data simulation and with the paper as a case study. I showed off some of what Chris Gaynor’s AlphaSimR can do, and how we built on that to make the specifics of this simulation study. The slides are on the EdinbR Github.

Genetic simulation is useful for all kinds of things. Sure, they’re only as good as the theory that underpins them, but the willingness to try things out in simulations is one of the things I always liked about breeding research.

This is my description of the logic of genetic simulation: we think of the genome as a large table of genotypes, drawn from some distribution of allele frequencies.

To make an utterly minimal simulation, we could draw allele frequencies from some distribution (like a Beta distribution), and then draw the genotypes from a binomial distribution. Done!

However, there is a ton of nuance we would like to have: chromosomes, linkage between variants, sexes, mating, selection …

AlphaSimR addresses all of this, and allows you to throw individuals and populations around to build pretty complicated designs. Here is the small example simulation I used in the talk.


library(AlphaSimR)
library(ggplot2)

## Generate founder chromsomes

FOUNDERPOP <- runMacs(nInd = 1000,
                      nChr = 10,
                      segSites = 5000,
                      inbred = FALSE,
                      species = "GENERIC")


## Simulation parameters

SIMPARAM <- SimParam$new(FOUNDERPOP)
SIMPARAM$addTraitA(nQtlPerChr = 100,
                   mean = 100,
                   var = 10)
SIMPARAM$addSnpChip(nSnpPerChr = 1000)
SIMPARAM$setGender("yes_sys")


## Founding population

pop <- newPop(FOUNDERPOP,
              simParam = SIMPARAM)

pop <- setPheno(pop,
                varE = 20,
                simParam = SIMPARAM)


## Breeding

print("Breeding")
breeding <- vector(length = 11, mode = "list")
breeding[[1]] <- pop

for (i in 2:11) {
    print(i)
    sires <- selectInd(pop = breeding[[i - 1]],
                       gender = "M",
                       nInd = 25,
                       trait = 1,
                       use = "pheno",
                       simParam = SIMPARAM)

    dams <- selectInd(pop = breeding[[i - 1]],
                      nInd = 500,
                      gender = "F",
                      trait = 1,
                      use = "pheno",
                      simParam = SIMPARAM)

    breeding[[i]] <- randCross2(males = sires,
                                females = dams,
                                nCrosses = 500,
                                nProgeny = 10,
                                simParam = SIMPARAM)
    breeding[[i]] <- setPheno(breeding[[i]],
                              varE = 20,
                              simParam = SIMPARAM)
}



## Look at genetic gain and shift in causative variant allele frequency

mean_g <- unlist(lapply(breeding, meanG))
sd_g <- sqrt(unlist(lapply(breeding, varG)))

plot_gain <- qplot(x = 1:11,
                   y = mean_g,
                   ymin = mean_g - sd_g,
                   ymax = mean_g + sd_g,
                   geom = "pointrange",
                   main = "Genetic mean and standard deviation",
                   xlab = "Generation", ylab = "Genetic mean")

start_geno <- pullQtlGeno(breeding[[1]], simParam = SIMPARAM)
start_freq <- colSums(start_geno)/(2 * nrow(start_geno))

end_geno <- pullQtlGeno(breeding[[11]], simParam = SIMPARAM)
end_freq <- colSums(end_geno)/(2 * nrow(end_geno))

plot_freq_before <- qplot(start_freq, main = "Causative variant frequency before") 
plot_freq_after <- qplot(end_freq, main = "Causative variant frequency after") 

This code builds a small livestock population, breeds it for ten generations, and looks at the resulting selection response in the form of a shift of the genetic mean, and the changes in the underlying distribution of causative variants. Here are the resulting plots:

What single step does with relationship

We had a journal club about the single step GBLUP method for genomic evaluation a few weeks ago. In this post, we’ll make a few graphs of how the single step method models relatedness between individuals.

Imagine you want to use genomic selection in a breeding program that already has a bunch of historical pedigree and trait information. You could use some so-called multistep evaluation that uses one model for the classical pedigree + trait quantitative genetics and one model for the genotype + trait genomic evaluation, and then mix the predictions from them together. Or you could use the single-step method, which combines pedigree, genotypes and traits into one model. It does this by combining the relationship estimates from pedigree and genotypes. That matrix can then go into your mixed model.

We’ll illustrate this with a tiny simulated population: five generations of 100 individuals per generation, where ten random pairings produce the next generation, with families of ten individuals. (The R code is on Github and uses AlphaSimR for simulation and AGHmatrix for matrices). Here is a heatmap of the pedigree-based additive relationship matrix for the population:

What do we see? In the lower left corner are the founders, and not knowing anything about their heritage, the matrix has them down as unrelated. The squares of high relatedness along the diagonal are the families in each generation. As we go upwards and to the right, relationship is building up.

Now, imagine the last generation of the population also has been genotyped with a SNP chip. Here is a heatmap of their genomic relationship matrix:

Genomic relationship is more detailed. We can still discern the ten families within the last generation, but no longer are all the siblings equally related to each other and to their ancestors. The genotyping helps track segregation within families, pointing out to us when relatives are more or less related than the average that we get from the pedigree.

Enter the single-step relationship matrix. The idea is to put in the genomic relationships for the genotyped individuals into the big pedigree-based relationship matrix, and then adjust the rest of the matrix to propagate that extra information we now have from the genotyped individuals to their ungenotyped relatives. Here is the resulting heatmap:

You can find the matrix equations in Legarra, Aguilar & Misztal (2009). The matrix, called H, is broken down into four partitions called H11, H12, H21, and H22. H22 is the part that pertains to the genotyped animals, and it’s equal to the genomic relationship matrix G (after some rescaling). The others are transformations of G and the corresponding parts of the additive relationship matrix, spreading G onto A.

To show what is going on, here is the difference between the additive relationship matrix and the single-step relationship matrix, with lines delineating the genotyped animals and breaking the matrix into the four partitions:

What do we see? In the top right corner, we have a lot of difference, where the genomic relationship matrix has been plugged in. Then, fading as we go from top to bottom and from right to left, we see the influence of the genomic relationship on relatives, diminishing the further we get from the genotyped individuals.

Literature

Legarra, Andres, I. Aguilar, and I. Misztal. ”A relationship matrix including full pedigree and genomic information.” Journal of dairy science 92.9 (2009): 4656-4663.