Simulating shared segments between relatives

A few months ago I saw this nice figure from Amy Williams of the number of DNA segments that are expected to be shared between relatives. I thought it would be fun to simulate segment sharing with AlphaSimR.

Because DNA comes in chromosomes that don’t break up and recombine that much, the shared DNA between relatives tends to come in long chunks — segments that are identical by descent. The distribution of segment lengths can sometimes be used to tell apart relationships that would otherwise give the same average (e.g., Yengo et al. 2019, Qiao et al. 2021).

But let’s not do anything sophisticated. Instead, we take three very simple pedigrees — anyone who’s taken introductory genetics will recognize these ones — and look at relationships between full-sibs, half-sibs and cousins. We’ll also look at the inbred offspring of matings between full-sibs, half-sibs and cousins to see that the proportion that they share between their two copies of the genome lines up with the expected inbreeding.

There won’t be any direct comparison to the values that Williams’ simulation, because it simulated more distant relationships than this, starting with cousins and then moving further away. This is probably more interesting, especially for human genealogical genetics.

The code is on GitHub if you wants to follow along.

The pedigrees

Here are the three pedigrees, drawn with the kinship2 package:

A pedigree, here, is really a table of individuals, where each column tells us their identifier, their parents, and optionally their sex, like this:

id, mother, father, sex
1, NA, NA, M
2, NA, NA, F
3, NA, NA, M
4, 2, 1, F
5, 2, 1, M
6, NA, NA, F
7, 4, 3, M
8, 6, 5, F
9, 8, 7, F

We can use GeneticsPed to check the relatedness and inbreeding if we don’t trust that I’ve entered the pedigrees right.

library(GeneticsPed)
library(purrr)
library(readr)

ped_fullsib <- read_csv("pedigrees/inbreeding_fullsib.txt")
ped_halfsib <- read_csv("pedigrees/inbreeding_halfsib.txt")
ped_cousin <- read_csv("pedigrees/inbreeding_cousin.txt")

inbreeding_ped <- function(ped) {

    inbreeding(Pedigree(ped))

}

print(map(list(ped_fullsib, ped_halfsib, ped_cousin), inbreeding_ped))
[[1]]
   1    2    3    4    5 
0.00 0.00 0.00 0.00 0.25 

[[2]]
    1     2     3     4     5     6 
0.000 0.000 0.000 0.000 0.000 0.125 

[[3]]
     1      2      3      4      5      6      7      8      9 
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0625

Comparing haplotypes

We need some functions to compare haplotypes and individuals:

library(AlphaSimR)
library(dplyr)
library(purrr)
library(tibble)

## Find shared segments between two haplotypes expressed as vectors
## map is a vector of marker positions

compare_haplotypes <- function(h1, h2, map) {
    sharing <- h1 == h2

    runs <- rle(sharing)
    end <- cumsum(runs$lengths)
    start <- c(1, end[-length(end)] + 1)

    segments <- tibble(start = start,
                       end = end,
                       start_pos = map[start],
                       end_pos = map[end],
                       segment_length = end_pos - start_pos,
                       value = runs$values)

    segments[segments$value,]
}

We will have haplotypes of the variants that go together on a chromosome, and we want to find segments that are shared between them. We do this with a logical vector that tests each variant for equality, and then use the rle to turn this into run-length encoding. We extract the start and end position of the runs and then keep only the runs of equality.

Building on that function, we want to find the shared segments on a chromosome between two individuals. That is, we make all the pairwise comparisons between the haplotypes they carry and combine them.

## Find shared segments between two individuals (expressed as
## matrices of haplotypes) for one chromosome

compare_individuals_chr <- function(ind1, ind2, map) {

    h1_1 <- as.vector(ind1[1,])
    h1_2 <- as.vector(ind1[2,])

    h2_1 <- as.vector(ind2[1,])
    h2_2 <- as.vector(ind2[2,])

    sharing1 <- compare_haplotypes(h1_1, h2_1, map)
    sharing2 <- compare_haplotypes(h1_1, h2_2, map)
    sharing3 <- compare_haplotypes(h1_2, h2_1, map)
    sharing4 <- compare_haplotypes(h1_2, h2_2, map)

    bind_rows(sharing1, sharing2, sharing3, sharing4)
}

Finally, we use that function to compare individuals along all the chromosomes.

This function takes in a population and simulation parameter object from AlphaSimR, and two target individuals to be compared.

We use AlphaSimR‘s pullIbdHaplo function to extract tracked founder haplotypes (see below) and then loop over chromosomes to apply the above comparison functions.

## Find shared segments between two target individuals in a
## population

compare_individuals <- function(pop,
                                target_individuals,
                                simparam) {

    n_chr <- simparam$nChr

    ind1_ix <- paste(target_individuals[1], c("_1", "_2"), sep = "")
    ind2_ix <- paste(target_individuals[2], c("_1", "_2"), sep = "")

    ibd <- pullIbdHaplo(pop,
                        simParam = simparam)

    map <- simparam$genMap
    loci_per_chr <- map_dbl(map, length)

    chr_ends <- cumsum(loci_per_chr)
    chr_starts <- c(1, chr_ends[-n_chr] + 1)

    results <- vector(mode = "list",
                      length = n_chr)

    for (chr_ix in 1:n_chr) {

        ind1 <- ibd[ind1_ix, chr_starts[chr_ix]:chr_ends[chr_ix]]
        ind2 <- ibd[ind2_ix, chr_starts[chr_ix]:chr_ends[chr_ix]]

        results[[chr_ix]] <- compare_individuals_chr(ind1, ind2, map[[chr_ix]])
        results[[chr_ix]]$chr <- chr_ix
    }

    bind_rows(results)
}

(You might think it would be more elegant, when looping over chromosomes, to pull out the identity-by-descent data for each chromosome at a time. This won’t work on version 1.0.4 though, because of a problem with pullIbdHaplo which has been fixed in the development version.)

We use an analogous function to compare the haplotypes carried by one individual. See the details on GitHub if you’re interested.

Building the simulation

We are ready to run our simulation: This code creates a few founder individuals that will initiate the pedigree, and sets up a basic simulation. The key simulation parameter is to set setTrackRec(TRUE) to turn on tracking of recombinations and founder haplotypes.

source("R/simulation_functions.R")

## Set up simulation

founders <- runMacs(nInd = 10,
                    nChr = 25)

simparam <- SimParam$new(founders)

simparam$setTrackRec(TRUE)

founderpop <- newPop(founders,
                     simParam = simparam)

To simulate a pedigree, we use pedigreeCross, a built-in function to simulate a given pedigree, and then apply our comparison functions to the resulting simulated population.

## Run the simulation for a pedigree one replicate

simulate_pedigree <- function(ped,
                              target_individuals,
                              focal_individual,
                              founderpop,
                              simparam) {
    pop <- pedigreeCross(founderPop = founderpop,
                         id = ped$id,
                         mother = ped$mother,
                         father = ped$father,
                         simParam = simparam)
    shared_parents <- compare_individuals(pop,
                                          target_individuals,
                                          simparam)
    shared_inbred <- compare_self(pop,
                                  focal_individual,
                                  simparam)
    list(population = pop,
         shared_segments_parents = shared_parents,
         shared_segments_self_inbred = shared_inbred)
}

Results

First we can check how large proportion of the genome of our inbred individuals is shared between their two haplotypes, averaged over 100 replicates. That is, how much of the genome is homozygous identical by descent — what is their genomic inbreeding? It lines up with the expectation form pedigree: 0.25 for the half-sib pedigree, close to 0.125 for the full-sib pedigree and close to 0.0625 for the cousin pedigree. The proportion shared by the parents is, as it should, about double that.

      case inbred_self_sharing parent_sharing
  full-sib        0.25 (0.052)    0.5 (0.041)
  half-sib        0.13 (0.038)   0.25 (0.029)
    cousin       0.064 (0.027)   0.13 (0.022)

Table of the mean proportion of genome shared between the two genome copies in inbred individuals and between their parents. Standard deviations in parentheses.

This is a nice consistency check, but not really what we wanted. The point of explicitly simulating chromosomes and recombinations is to look at segments, not just total sharing.

With a little counting and summarisation, we can plot the distributions of segment lengths. The horizontal axis is the length of the segments expressed in centimorgan. The vertical axis is the number of shared
segments of this length or longer. Each line is a replicate.

If we look at the summaries (table below), full-sibs share on average 74 segments greater than 1 cM in length, half-sibs 37, and cousins 29.

In real data, short segments might be harder to detect, but because we’re using simulated fake data, we don’t have to worry about phasing errors or false positive sharing.

If we look only at long segments (> 20 cM), full-sibs share on average 46 segments, half-sibs 23, and cousins 13. (Also, similar to Williams’ simulations, none of the cousins simulated here had less than five long segments shared.)

  case     `1 cM`   `10 cM`  `20 cM`  `30 cM`  `40 cM`  
  full-sib 74 (5.2) 60 (4.2) 46 (3.6) 34 (4)   24 (3.8) 
  half-sib 37 (3.4) 30 (3.1) 23 (3.3) 17 (2.8) 13 (2.6) 
  cousin   29 (3.8) 20 (3.3) 13 (3.2) 7.6 (2)  4.3 (1.8)

Table of the mean number of shared segments of different minimum length. Standard deviations in parentheses.

We an also look at the average length of the segments shared, and note that while full-sibs and half-sibs differ in the number of segments, and total segment length shared (above), the length of individual segments is about the same:

  case     mean_length_sd
  full-sib 0.33 (0.032)  
  half-sib 0.34 (0.042)  
  cousin   0.21 (0.03)

Table of the mean length shared segments. Standard deviations in parentheses.

Limitations

Williams’ simulation, using the ped-sim tool, had a more detailed model of recombination in the human genome, with different interference parameters for each chromosome, sex-specific recombination and so on. In that way, it is much more realistic.
We’re not modelling any one genome in particular, but a very generic genome. Each chromosome is 100 cM long for example; one can imagine that a genome with many short chromosomes would give a different distribution. This can be changed, though; the chromosome size is the easiest, if we just pick a species.

Literature

Yengo, L., Wray, N. R., & Visscher, P. M. (2019). Extreme inbreeding in a European ancestry sample from the contemporary UK population. Nature communications, 10(1), 1-11.

Qiao, Y., Sannerud, J. G., Basu-Roy, S., Hayward, C., & Williams, A. L. (2021). Distinguishing pedigree relationships via multi-way identity by descent sharing and sex-specific genetic maps. The American Journal of Human Genetics, 108(1), 68-83.

What single step does with relationship

We had a journal club about the single step GBLUP method for genomic evaluation a few weeks ago. In this post, we’ll make a few graphs of how the single step method models relatedness between individuals.

Imagine you want to use genomic selection in a breeding program that already has a bunch of historical pedigree and trait information. You could use some so-called multistep evaluation that uses one model for the classical pedigree + trait quantitative genetics and one model for the genotype + trait genomic evaluation, and then mix the predictions from them together. Or you could use the single-step method, which combines pedigree, genotypes and traits into one model. It does this by combining the relationship estimates from pedigree and genotypes. That matrix can then go into your mixed model.

We’ll illustrate this with a tiny simulated population: five generations of 100 individuals per generation, where ten random pairings produce the next generation, with families of ten individuals. (The R code is on Github and uses AlphaSimR for simulation and AGHmatrix for matrices). Here is a heatmap of the pedigree-based additive relationship matrix for the population:

What do we see? In the lower left corner are the founders, and not knowing anything about their heritage, the matrix has them down as unrelated. The squares of high relatedness along the diagonal are the families in each generation. As we go upwards and to the right, relationship is building up.

Now, imagine the last generation of the population also has been genotyped with a SNP chip. Here is a heatmap of their genomic relationship matrix:

Genomic relationship is more detailed. We can still discern the ten families within the last generation, but no longer are all the siblings equally related to each other and to their ancestors. The genotyping helps track segregation within families, pointing out to us when relatives are more or less related than the average that we get from the pedigree.

Enter the single-step relationship matrix. The idea is to put in the genomic relationships for the genotyped individuals into the big pedigree-based relationship matrix, and then adjust the rest of the matrix to propagate that extra information we now have from the genotyped individuals to their ungenotyped relatives. Here is the resulting heatmap:

You can find the matrix equations in Legarra, Aguilar & Misztal (2009). The matrix, called H, is broken down into four partitions called H11, H12, H21, and H22. H22 is the part that pertains to the genotyped animals, and it’s equal to the genomic relationship matrix G (after some rescaling). The others are transformations of G and the corresponding parts of the additive relationship matrix, spreading G onto A.

To show what is going on, here is the difference between the additive relationship matrix and the single-step relationship matrix, with lines delineating the genotyped animals and breaking the matrix into the four partitions:

What do we see? In the top right corner, we have a lot of difference, where the genomic relationship matrix has been plugged in. Then, fading as we go from top to bottom and from right to left, we see the influence of the genomic relationship on relatives, diminishing the further we get from the genotyped individuals.

Literature

Legarra, Andres, I. Aguilar, and I. Misztal. ”A relationship matrix including full pedigree and genomic information.” Journal of dairy science 92.9 (2009): 4656-4663.