Showing a difference in mean between two groups, take 2

A couple of years ago, I wrote about the paradoxical difficulty of visualising a difference in means between two groups, while showing both the data and some uncertainty interval. I still feel like many ills in science come from our inability to interpret a simple comparison of means. Anything with more than two groups or a predictor that isn’t categorical makes things worse, of course. It doesn’t take much to overwhelm the intuition.

My suggestion at the time was something like this — either a panel that shows the data an another panel with coefficients and uncertainty intervals; or a plot that shows the with lines that represent the magnitude of the difference at the upper and lower limit of the uncertainty interval.

Alternative 1 (left), with separate panels for data and coefficient estimates, and alternative 2 (right), with confidence limits for the difference shown as vertical lines. For details, see the old post about these graphs.

Here is the fake data and linear model we will plot. If you want to follow along, the whole code is on GitHub. Group 0 should have a mean of 4, and the difference between groups should be two, and as the graphs above show, our linear model is not too far off from these numbers.

library(broom)

data <- data.frame(group = rep(0:1, 20))
data$response <- 4 + data$group * 2 + rnorm(20)

model <- lm(response ~ factor(group), data = data)
result <- tidy(model)

Since the last post, a colleague has told me about the Gardner-Altman plot. In a paper arguing that confidence intervals should be used to show the main result of studies, rather than p-values, Gardner & Altman (1986) introduced plots for simultaneously showing confidence intervals and data.

Their Figure 1 shows (hypothetical) systolic blood pressure data for a group of diabetic and non-diabetic people. The left panel is a dot plot for each group. The right panel is a one-dimensional plot (with a different scale than the right panel; zero is centred on the mean of one of the groups), showing the difference between the groups and a confidence interval as a point with error bars.

There are functions for making this kind of plot (and several more complex alternatives for paired comparisons and analyses of variance) in the R package dabestr from Ho et al. (2019). An example with our fake data looks like this:

Alternative 3: Gardner-Altman plot with bootstrap confidence interval.

library(dabestr)

bootstrap <- dabest(data,
                    group,
                    response,
                    idx = c("0", "1"),
                    paired = FALSE)

bootstrap_diff <- mean_diff(bootstrap)

plot(bootstrap_diff)

While this plot is neat, I think it is a little too busy — I’m not sure that the double horizontal lines between the panels or the shaded density for the bootstrap confidence interval add much. I’d also like to use other inference methods than bootstrapping. I like how the scale of the right panel has the same unit as the left panel, but is offset so the zero is at the mean of one of the groups.

Here is my attempt at making a minimalistic version:

Alternative 4: Simplified Garner-Altman plot.

This piece of code first makes the left panel of data using ggbeeswarm (which I think looks nicer than the jittering I used in the first two alternatives), then the right panel with the estimate and approximate confidence intervals of plus/minus two standard errors of the mean), adjusts the scale, and combines the panels with patchwork.

library(ggbeeswarm)
library(ggplot2
library(patchwork)

ymin <- min(data$response)
ymax <- max(data$response)

plot_points_ga <- ggplot() +
  geom_quasirandom(aes(x = factor(group), y = response),
                   data = data) +
  xlab("Group") +
  ylab("Response") +
  theme_bw() +
  scale_y_continuous(limits = c(ymin, ymax))

height_of_plot <- ymax-ymin

group0_fraction <- (coef(model)[1] - ymin)/height_of_plot

diff_min <- - height_of_plot * group0_fraction

diff_max <- (1 - group0_fraction) * height_of_plot

plot_difference_ga <- ggplot() +
  geom_pointrange(aes(x = term, y = estimate,
                      ymin = estimate - 2 * std.error,
                      ymax = estimate + 2 * std.error),
                  data = result[2,]) +
  scale_y_continuous(limits = c(diff_min, diff_max)) +
  ylab("Difference") +
  xlab("Comparison") +
  theme_bw()

(plot_points_ga | plot_difference_ga) +
    plot_layout(widths = c(0.75, 0.25))

Literature

Gardner, M. J., & Altman, D. G. (1986). Confidence intervals rather than P values: estimation rather than hypothesis testing. British Medical Journal

Ho, J., Tumkaya, T., Aryal, S., Choi, H., & Claridge-Chang, A. (2019). Moving beyond P values: data analysis with estimation graphics. Nature methods.

A genetic mapping animation in R

Cullen Roth posted a beautiful animation of quantitative trait locus mapping on Twitter. It is pretty amazing. I wanted to try to make something similar in R with gganimate. It’s not going to be as beautiful as Roth’s animation, but it will use the same main idea of showing both a test statistic along the genome, and the underlying genotypes and trait values. For example, Roth’s animation has an inset scatterplot that appears above the peak after it’s been reached; to do that I think we would have to go a bit lower-level than gganimate and place our plots ourselves.

First, we’ll look at a locus associated with body weight in chickens (with data from Henriksen et al. 2016), and then a simulated example. We will use ggplot2 with gganimate and a magick trick for combining the two animations. Here are some pertinent snippets of the code; as usual, find the whole thing on Github.

LOD curve

We will use R/qtl for the linkage mapping. We start by loading the data file (Supplementary Dataset from Henriksen et al. 2016). A couple of individuals have missing covariates, so we won’t be able to use them. This piece of code first reads the cross file, and then removes the two offending rows.

library(qtl)

## Read cross file
cross <- read.cross(format = "csv",
                    file = "41598_2016_BFsrep34031_MOESM83_ESM.csv")

cross <- subset(cross, ind = c("-34336", "-34233"))

For nice plotting, let’s restrict ourselves to fully informative markers (that is, the ones that tell the two founder lines of the cross apart). There are some partially informative ones in the dataset too, and R/qtl can get some information out of them thanks to genotype probability calculations with its Hidden Markov Model. They don’t make for nice scatterplots though. This piece of code extracts the genotypes and identifies informative markers as the ones that only have genotypes codes 1, 2 or 3 (homozygote, heterozygote and other homozygote), but not 5 and 6, which are used for partially informative markers.

## Get informative markers and combine with phenotypes for plotting

geno <- as.data.frame(pull.geno(cross,
                                chr = 1))

geno_values <- lapply(geno, unique)
informative <- unlist(lapply(geno_values,
    function(g) all(g %in% c(1:3, NA))))

geno_informative <- geno[informative]

Now for the actual scan. We run a single QTL scan with covariates (sex, batch that the chickens were reared in, and principal components of genotypes), and pull out the logarithm of the odds (LOD) across chromosome 1. This piece of code first prepares a design matrix of the covariates, and then runs a scan of chromosome 1.

## Prepare covariates
pheno <- pull.pheno(cross)

covar <- model.matrix(~ sex_number + batch + PC1 + PC2 + PC3 + PC4 + 
                        PC5 + PC6 + PC7 + PC8 + PC9 + PC10,
                      pheno,
                      na.action = na.exclude)[,-1]

scan <- scanone(cross = cross,
                pheno.col = "weight_212_days",
                method = "hk",
                chr = 1,
                addcovar = covar)

Here is the LOD curve along chromosome 1 that want to animate. The peak is the biggest-effect growth locus in this intercross, known as ”growth1”.

With gganimate, animating the points is as easy as adding a transition layer. This piece of code first makes a list of some formatting for our graphics, then extracts the LOD scores from the scan object, and makes the plot. By setting cumulative in transition_manual the animation will add one data point at the time, while keeping the old ones.

library(ggplot2)
library(gganimate)

formatting <- list(theme_bw(base_size = 16),
                   theme(panel.grid = element_blank(),
                         strip.background = element_blank(),
                         legend.position = "none"),
                   scale_colour_manual(values =
                         c("red", "purple", "blue")))

lod <- as.data.frame(scan)
lod <- lod[informative,]
lod$marker_number <- 1:nrow(lod)

plot_lod <- qplot(x = pos,
                  y = lod,
                  data = lod,
                  geom = c("point", "line")) +
  ylab("Logarithm of odds") +
  xlab("Position") +
  formatting +
  transition_manual(marker_number,
                    cumulative = TRUE)

Plot of the underlying data

We also want a scatterplot of the data. Here what a jittered scatterplot will look like at the peak. The horizontal axes are genotypes (one homozygote, heterozygote in the middle, the other homozygote) and the vertical axis is the body mass in grams. We’ve separated the sexes into small multiples. Whether to give both sexes the same vertical axis or not is a judgement call. The hens weigh a lot less than the roosters, which means that it’s harder to see patterns among them when put on the same axis as the roosters. On the other hand, if we give the sexes different axes, it will hide that difference.

This piece of code builds a combined data frame with informative genotypes and body mass. Then, it makes the above plot for each marker into an animation.

library(tidyr)

## Combined genotypes and weight
geno_informative$id <- pheno$id
geno_informative$w212 <- pheno$weight_212_days
geno_informative$sex <- pheno$sex_number

melted <- pivot_longer(geno_informative,
                       -c("id", "w212", "sex"))

melted <- na.exclude(melted)

## Add marker numbers
marker_numbers <- data.frame(name = rownames(scan),
                             marker_number = 1:nrow(scan),
                             stringsAsFactors = FALSE)

melted <- inner_join(melted, marker_numbers)

## Recode sex to words
melted$sex_char <- ifelse(melted$sex == 1, "male", "female")

plot_scatter <- qplot(x = value,
                     geom = "jitter",
                     y = w212,
                     colour = factor(value),
                     data = melted) +
  facet_wrap(~ factor(sex_char),
             ncol = 1) +
  xlab("Genotype") +
  ylab("Body mass") +
  formatting +
  transition_manual(marker_number)

Combining the animations

And here is the final animation:

To put the pieces together, we use this magick trick (posted by Matt Crump). That is, animate the plots, one frame for each marker, and then use the R interface for ImageMagick to put them together and write them out.

gif_lod <- animate(plot_lod,
                   fps = 2,
                   width = 320,
                   height = 320,
                   nframes = sum(informative))

gif_scatter <- animate(plot_scatter,
                       fps = 2,
                       width = 320,
                       height = 320,
                       nframes = sum(informative))


## Magick trick from Matt Crump

mgif_lod <- image_read(gif_lod)
mgif_scatter <- image_read(gif_scatter)

new_gif <- image_append(c(mgif_lod[1], mgif_scatter[1]))
for(i in 2:sum(informative)){
  combined <- image_append(c(mgif_lod[i], mgif_scatter[i]))
  new_gif <- c(new_gif, combined)
}

image_write(new_gif, path = "out.gif", format = "gif")

Literature

Henriksen, Rie, et al. ”The domesticated brain: genetics of brain mass and brain structure in an avian species.” Scientific reports 6.1 (2016): 1-9.

A model of polygenic adaptation in an infinite population

How do allele frequencies change in response to selection? Answers to that question include ”it depends”, ”we don’t know”, ”sometimes a lot, sometimes a little”, and ”according to a nonlinear differential equation that actually doesn’t look too horrendous if you squint a little”. Let’s look at a model of the polygenic adaptation of an infinitely large population under stabilising selection after a shift in optimum. This model has been developed by different researchers over the years (reviewed in Jain & Stephan 2017).

Here is the big equation for allele frequency change at one locus:

\dot{p}_i = -s \gamma_i p_i q_i (c_1 - z') - \frac{s \gamma_i^2}{2} p_i q_i (q_i - p_i) + \mu (q_i - p_i )

That wasn’t so bad, was it? These are the symbols:

  • the subscript i indexes the loci,
  • \dot{p} is the change in allele frequency per time,
  • \gamma_i is the effect of the locus on the trait (twice the effect of the positive allele to be precise),
  • p_i is the frequency of the positive allele,
  • q_i the frequency of the negative allele,
  • s is the strength of selection,
  • c_1 is the phenotypic mean of the population; it just depends on the effects and allele frequencies
  • \mu is the mutation rate.

This breaks down into three terms that we will look at in order.

The directional selection term

-s \gamma_i p_i q_i (c_1 - z')

is the term that describes change due to directional selection.

Apart from the allele frequencies, it depends on the strength of directional selection s, the effect of the locus on the trait \gamma_i and how far away the population is from the new optimum (c_1 - z'). Stronger selection, larger effect or greater distance to the optimum means more allele frequency change.

It is negative because it describes the change in the allele with a positive effect on the trait, so if the mean phenotype is above the optimum, we would expect the allele frequency to decrease, and indeed: when

(c_1 - z') < 0

this term becomes negative.

If you neglect the other two terms and keep this one, you get Jain & Stephan's "directional selection model", which describes behaviour of allele frequencies in the early phase before the population has gotten close to the new optimum. This approximation does much of the heavy lifting in their analysis.

The stabilising selection term

-\frac{s \gamma_i^2}{2} p_i q_i (q_i - p_i)

is the term that describes change due to stabilising selection. Apart from allele frequencies, it depends on the square of the effect of the locus on the trait. That means that, regardless of the sign of the effect, it penalises large changes. This appears to make sense, because stabilising selection strives to preserve traits at the optimum. The cubic influence of allele frequency is, frankly, not intuitive to me.

The mutation term

Finally,

\mu (q_i - p_i )

is the term that describes change due to new mutations. It depends on the allele frequencies, i.e. how of the alleles there are around that can mutate into the other alleles, and the mutation rate. To me, this is the one term one could sit down and write down, without much head-scratching.

Walking in allele frequency space

Jain & Stephan (2017) show a couple of examples of allele frequency change after the optimum shift. Let us try to draw similar figures. (Jain & Stephan don’t give the exact parameters for their figures, they just show one case with effects below their threshold value and one with effects above.)

First, here is the above equation in R code:

pheno_mean <- function(p, gamma) {
  sum(gamma * (2 * p - 1))
}

allele_frequency_change <- function(s, gamma, p, z_prime, mu) {
  -s * gamma * p * (1 - p) * (pheno_mean(p, gamma) - z_prime) +
    - s * gamma^2 * 0.5 * p * (1 - p) * (1 - p - p) +
    mu * (1 - p - p)
}

With this (and some extra packaging; code on Github), we can now plot allele frequency trajectories such as this one, which starts at some arbitrary point and approaches an optimum:

Animation of alleles at two loci approaching an equilibrium. Here, we have two loci with starting frequencies 0.2 and 0.1 and effect size 1 and 0.01, and the optimum is at 0. The mutation rate is 10-4 and the strength of selection is 1. Animation made with gganimate.

Resting in allele frequency space

The model describes a shift from one optimum to another, so we want want to start at equilibrium. Therefore, we need to know what the allele frequencies are at equilibrium, so we solve for 0 allele frequency change in the above equation. The first term will be zero, because

(c_1 - z') = 0

when the mean phenotype is at the optimum. So, we can throw away that term, and factor the rest equation into:

(1 - 2p) (-\frac{s \gamma ^2}{2} p(1-p) + \mu) = 0

Therefore, one root is p = 1/2. Depending on your constitution, this may or may not be intuitive to you. Imagine that you have all the loci, each with a positive and negative allele with the same effect, balanced so that half the population has one and the other half has the other. Then, there is this quadratic equation that gives two other equilibria:

\mu - \frac{s\gamma^2}{2}p(1-p) = 0
\implies p = \frac{1}{2} (1 \pm \sqrt{1 - 8 \frac{\mu}{s \gamma ^2}})

These points correspond to mutation–selection balance with one or the other allele closer to being lost. Jain & Stephan (2017) show a figure of the three equilibria that looks like a semicircle (from the quadratic equation, presumably) attached to a horizontal line at 0.5 (their Figure 1). Given this information, we can start our loci out at equilibrium frequencies. Before we set them off, we need to attend to the effect size.

How big is a big effect? Hur långt är ett snöre?

In this model, there are big and small effects with qualitatively different behaviours. The cutoff is at:

\hat{\gamma} = \sqrt{ \frac{8 \mu}{s}}

If we look again at the roots to the quadratic equation above, they can only exist as real roots if

\frac {8 \mu}{s \gamma^2} < 1

because otherwise the expression inside the square root will be negative. This inequality can be rearranged into:

\gamma^2 > \frac{8 \mu}{s}

This means that if the effect of a locus is smaller than the threshold value, there is only one equilibrium point, and that is at 0.5. It also affects the way the allele frequency changes. Let us look at two two-locus cases, one where the effects are below this threshold and one where they are above it.

threshold <- function(mu, s) sqrt(8 * mu / s)

threshold(1e-4, 1)
[1] 0.02828427

With mutation rate of 10-4 and strength of selection of 1, the cutoff is about 0.028. Let our ”big” loci have effect sizes of 0.05 and our small loci have effect sizes of 0.01, then. Now, we are ready to shift the optimum.

The small loci will start at an equilibrium frequency of 0.5. We start the large loci at two different equilibrium points, where one positive allele is frequent and the other positive allele is rare:

get_equilibrium_frequencies <- function(mu, s, gamma) {
  c(0.5,
    0.5 * (1 + sqrt(1 - 8 * mu / (s * gamma^2))),
    0.5 * (1 - sqrt(1 - 8 * mu / (s * gamma^2))))
}

(eq0.05 <- get_equilibrium_frequencies(1e-4, 1, 0.05))
[1] 0.50000000 0.91231056 0.08768944
get_equlibrium_frequencies(1e-4, 1, 0.01)
[1] 0.5 NaN NaN

Look at them go!

These animations show the same qualitative behaviour as Jain & Stephan illustrate in their Figure 2. With small effects, there is gradual allele frequency change at both loci:

However, with large effects, one of the loci (the one on the vertical axis) dramatically changes in allele frequency, that is it’s experiencing a selective sweep, while the other one barely changes at all. And the model will show similar behaviour when the trait is properly polygenic, with many loci, as long as effects are large compared to the (scaled) mutation rate.

Here, I ran 10,000 time steps; if we look at the phenotypic means, we can see that they still haven’t arrived at the optimum at the end of that time. The mean with large effects is at 0.089 (new optimum of 0.1), and the mean with small effects is 0.0063 (new optimum: 0.02).

Let’s end here for today. Maybe another time, we can return how this model applies to actually polygenic architectures, that is, with more than two loci. The code for all the figures is on Github.

Literature

Jain, K., & Stephan, W. (2017). Modes of rapid polygenic adaptation. Molecular biology and evolution, 34(12), 3169-3175.

Twin lambs with different fathers

I just learned that in sheep, lambs from the same litter pretty often have different fathers, if the ewe has mated with different males. Berry et al. (2020) looked at sheep flocks on Irland that used more than one ram, and:

Of the 539 pairs of twins included in the analysis, 160 (i.e. 30%) were sired by two different rams. Of the 137 sets of triplets included in the analysis, 73 (i.e. 53%) were sired by more than one ram. Of the nine sets of quadruplets, eight were sired by two rams with the remaining litter being mono‐paternal. The overall incidence of heteropaternal superfecundation among litters was therefore 35%. Given that the incidence of multiple births in these flocks was 65%, heteropaternal superfecundation is expected to be relatively common in sheep; this is especially true as all but two of the litter‐mates were polyzygotic.

They figured this out by looking at individuals genotyped on SNP chips with tens of thousands of SNPs, with both lambs and the potential parents genotyped, so there can’t be much uncertainty in the assignment. You don’t need that many genotyped markers to get a confident assignment, and they don’t have that many rams to choose from.

Time for some Mendelian inheritance

Let’s simulate a situation like this: We set up a population and a marker panel for genotyping, split them into ewes and rams, and make some lambs.

library(AlphaSimR)

founderpop <- runMacs(nInd = 105,
                      nChr = 10,
                      segSites = 100)

simparam <- SimParam$new(founderpop)
simparam$setGender("no")

simparam$addSnpChip(nSnpPerChr = 100)

parents <- newPop(founderpop,
                  simParam = simparam)

ewes <- parents[1:100]
rams <- parents[101:105]

lambs <- randCross2(females = ewes,
                    males = rams,
                    nCrosses = 100,
                    nProgeny = 2,
                    simParam = simparam)

Now, if we have the genotypes of a lamb and its mother, how do we know the father? In this paper, they use exclusion methods: They compared the genotypes from the offspring with the parents and used inheritance rules to exclude rams that can't be the father because if they were, the offspring couldn't have the genotypes it had. Such breaking of regular inheritance patterns would be a "Mendelian inconsistency". This is the simplest kind of parentage assignment; fancier methods will calculate the probabilities of different genotypes, and allow you to reconstruct unknown relationships.

We can do this in two ways:

  • ignore the ewe’s genotypes and look for opposite homozygotes between lamb and ram, which are impossible regardless of the mother’s genotype
  • use both the ewe’s and ram’s genotypes to look what lamb genotypes are possible from a cross between them; this adds a few more cases where we can exclude a ram even if the lamb is heterozygous

To do the first, we count the number of opposite homozygous markers. In this genotype coding, 0 and 2 are homozygotes, and 1 is a heterozygous marker.

opposite_homozygotes <- function(ram,
                                 lamb) {
    sum(lamb == 0 & ram == 2) +
        sum(lamb == 2 & ram == 0)

}

When we include the ewe's genotype, there are a few more possible cases. We could enumerate all of them, but here is some R code to generate them. We first get all possible gametes from each parent, we combine the gametes in all possible combinations, and that gives us the possible lamb genotypes at that marker. If the lamb does, in fact, not have any of those genotypes, we declare the marker inconsistent. Repeat for all markers.

## Generate the possible gametes from a genotype

possible_gametes <- function(genotype) {

    if (genotype == 0) {
        gametes <- 0
    } else if (genotype == 1) {
        gametes <- c(0, 1)
    } else if (genotype == 2) {
        gametes <- 1
    }

    gametes
}

## Generate the possible genotypes for an offspring from
## parent possible gametes

possible_genotypes <- function(father_gametes,
                               mother_gametes) {

    possible_combinations <- expand.grid(father_gametes, mother_gametes)
    resulting_genotypes <- rowSums(possible_combinations)
    unique(resulting_genotypes)
}

## Check offspring genotypes for consistency with parent genotypes

mendelian_inconsistency <- function(ewe,
                                    ram,
                                    lamb) {

    n_markers <- length(ewe)
    inconsistent <- logical(n_markers)

    for (marker_ix in 1:n_markers) {

        possible_lamb_genotypes <-
          possible_genotypes(possible_gametes(ewe[marker_ix]),
                             possible_gametes(ram[marker_ix]))

        inconsistent[marker_ix] <-
          !lamb[marker_ix] %in% possible_lamb_genotypes
    }

    sum(inconsistent)
}

(These functions assume that we have genotypes in vectors. The full code that extracts this information from the simulated data and repeats for all markers is on Gitbhub.)

Here is the outcome for a set of random lambs. The red dots point out the true fathers: because we have perfect genotype data simulated without errors, the true father always has 100% consistent markers.

If we compare how many markers are found inconsistent with the two methods, we get a pattern like this graph. Including the ewe’s genotypes lets us discover a lot more inconsistent markers, but in this case, with plentiful and error-free markers, it doesn’t make a difference.

Thresholds and errors

If I have any complaint with the paper, it’s that the parentage analysis isn’t really described in the methods. This is what it says:

Parentage testing using simple exclusion‐based approaches is determined by the proportion of opposing homozygotes in putative sire–offspring pairs.

/…/

Maternal verification was undertaken using the exclusion method (Double et al . 1997) comparing the genotype of the dam with that of her putative progeny and only validated dam–offspring pairs were retained. Genotypes of the mature rams in the flock were compared with all lambs born in that flock using the exclusion method.

(The reference is related to exclusion methods, but it’s describing how to calculate exclusion probabilities in a certain circumstance. That is, it’s part of a methodological conversation about exclusion methods, but doesn’t actually describe what they did.)

I don’t doubt that they did it well. Still, it would be interesting to know the details, because in the absence of perfect genotype data, they must have had some thresholds for error and some criterion for deciding which ram was right, even if it seemed obvious.

Literature

Berry, D. P., et al. ”Heteropaternal superfecundation frequently occurs in multiple‐bearing mob‐mated sheep.” Animal Genetics (2020).

Using R: setting a colour scheme in ggplot2

Note to self: How to quickly set a colour scheme in ggplot2.

Imagine we have a series of plots that all need a uniform colour scale. The same category needs to have the same colour in all graphics, made possibly with different packages and by different people. Instead of hard-coding the colours and the order of categories, we can put them in a file, like so:

library(readr)
colours <- read_csv("scale_colours.csv")
# A tibble: 5 x 2
  name   colour 
      
1 blue   #d4b9da
2 red    #c994c7
3 purple #df65b0
4 green  #dd1c77
5 orange #980043

Now a plot with default colours, using some made-up data:

x <- 1:100

beta <- rnorm(5, 1, 0.5)

stroop <- data.frame(x,
                     sapply(beta, function(b) x * b + rnorm(100, 1, 10)))
colnames(stroop)[2:6] <- c("orange", "blue", "red", "purple", "green") 

data_long <- pivot_longer(stroop, -x)

plot_y <- qplot(x = x,
                y = value,
                colour = name,
                data = data_long) +
  theme_minimal() +
  theme(panel.grid = element_blank())

Now we can add the custom scale like this:

plot_y_colours <- plot_y + 
  scale_colour_manual(limits = colours$name,
                      values = colours$colour)


Using R: simple Gantt chart with ggplot2

Jeremy Yoder’s code for a simple Gantt chart on the Molecular Ecologist blog uses geom_line and gather to prepare the data structure. I like using geom_linerange and a coord_flip, which lets you use start and end columns directly without pivoting.

Here is a very serious data frame of activities:

# A tibble: 6 x 4
  activity       category        start               end                
                                                  
1 Clean house    preparations    2020-07-01 00:00:00 2020-07-03 00:00:00
2 Pack bags      preparations    2020-07-05 10:00:00 2020-07-05 17:00:00
3 Run to train   travel          2020-07-05 17:00:00 2020-07-05 17:15:00
4 Sleep on train travel          2020-07-05 17:15:00 2020-07-06 08:00:00
5 Procrastinate  procrastination 2020-07-01 00:00:00 2020-07-05 00:00:00
6 Sleep          vacation        2020-07-06 08:00:00 2020-07-09 00:00:00

And here is the code:


library(ggplot2)
library(readr)

activities <- read_csv("activities.csv")

## Set factor level to order the activities on the plot
activities$activity <- factor(activities$activity,
                              levels = activities$activity[nrow(activities):1])
    
plot_gantt <- qplot(ymin = start,
                    ymax = end,
                    x = activity,
                    colour = category,
                    geom = "linerange",
                    data = activities,
                    size = I(5)) +
    scale_colour_manual(values = c("black", "grey", "purple", "yellow")) +
    coord_flip() +
    theme_bw() +
    theme(panel.grid = element_blank()) +
    xlab("") +
    ylab("") +
    ggtitle("Vacation planning")

Using R: 10 years with R

Yesterday, 29 Feburary 2020, was the 20th anniversary of the release R 1.0.0. Jozef Hajnala’s blog has a cute anniversary post with some trivia. I realised that it is also (not to the day, but to the year) my R anniversary.

I started using R in 2010, during my MSc project in Linköping. Daniel Nätt, who was a PhD student there at the time, was using it for gene expression and DNA methylation work. I think that was the reason he was pulled into R; he needed the Bioconductor packages for microarrays. He introduced me. Thanks, Daniel!

I think I must first have used it to do something with qPCR melting curves. I remember that I wrote some function to reshape/pivot data between long and wide format. It was probably an atrocity of nested loops and hard bracket indexing. Coming right from an undergraduate programme with courses using Ada and C++, even if we had also used Minitab for statistics and Matlab for engineering, I spoke R with a strong accent. At any rate, I was primed to think that doing my data analysis with code was a good idea, and jumped at the opportunity to learn a tool for it. Thanks, undergraduate programme!

I think the easiest thing to love about R is the package system. You can certainly end up in dependency hell with R and metaphorically shoot your own foot, especially on a shared high performance computing system. But I wouldn’t run into any of that until after several years. I was, and still am, impressed by how packages just worked, and could do almost anything. So, the Bioconductor packages were probably, indirectly, why I was introduced to R, and after that, my R story can be told in a series of packages. Thanks, CRAN!

The next package was R/qtl, that I relied on for my PhD. I had my own copy of the R/qtl book. For a period, I probably wrote thing every day:

library(qtl)

cross <- read.cross(file = "F8_geno_trim.csv", format = "csv")

R/qtl is one of my favourite pieces or research software, relatively friendly and with lots of documentation. Thanks, R/qtl developers!

Of course it was Dom Wright, who was my PhD supervisor, who introduced me to R/qtl, and I think it was also he who introduced me to ggplot2. At least he used it, and at some point we were together trying to fix the formatting of a graph, probably with some ugly hack. I decided to use ggplot2 as much as possible, and as it is wont to, ggplot2 made me care about rearranging data, thus leading to reshape2 and plyr. ”The magic is not in plotting the data but in tidying and rearranging the data for plotting.” After a while, most everything I wrote used the ddply function in some way. Thank you, Hadley Wickham!

Then came the contemporary tidyverse. For the longest time, I was uneasy with tidyr, and I’m still not a regular purrr user, but one can’t avoid loving dplyr. How much? My talk at the Swedish Bioinformatics Workshop in 2016 had a slide expressing my love of the filter function. It did not receive the cheers that the function deserves. Maybe the audience were Python users. With new file reading functions, new data frames and functions to manipulate data frames, modern R has become smoother and friendlier. Thanks, tidyverse developers!

The history of R on this blog started in 2011, originally as a way to make notes for myself or, ”a fellow user who’s trying to google his or her way to a solution”. This turned into a series of things to help teach R to biologists around me.

There was the Slightly different introduction to R series of blog posts. It used packages that feel somewhat outdated, and today, I don’t think there’s anything even slightly different about advocating RStudio, and teaching ggplot2 from the beginning.

This spawned a couple of seminars in course for PhD students, which were updated for the Wright lab computation lunches, and eventually turned into a course of its own given in 2017. It would be fun to update it and give it again.

The last few years, I’ve been using R for reasonably large genome datasets in a HPC environment, and gotten back to the beginnings, I guess, by using Bioconducor a lot more. However, the package that I think epitomises the last years of my R use is AlphaSimR, developed by colleagues in Edinburgh. It’s great to be able throw together a quick simulation to check how some feature of genetics behaves. AlphaSimR itself is also an example of how far the R/C++ integration has come with RCpp and RCppArmadillo. Thanks, Chris!

In summary, R is my tool of choice for almost anything. I hope we’ll still be using it, in new and interesting ways, in another ten years. Thank you, R core team!

Using R: from plyr to purrr, part 0 out of however many

This post is me thinking out loud about applying functions to vectors or lists and getting data frames back.

Using R is an ongoing process of finding nice ways to throw data frames, lists and model objects around. While tidyr has arrived at a comfortable way to reshape dataframes with pivot_longer and pivot_wider, I don’t always find the replacements for the good old plyr package as satisfying.

Here is an example of something I used to like to do with plyr. Don’t laugh!

Assume we have a number of text files, all in the same format, that we need to read and combine. This arises naturally if you run some kind of analysis where the dataset gets split into chunks, like in genetics, where chunks might be chromosomes.

## Generate vector of file names
files <- paste("data/chromosome", 1:20, ".txt", sep = "")

library(plyr)
library(readr)
genome <- ldply(files, read_tsv)

This gives us one big data frame, containing the rows from all those files.

If we want to move on from plyr, what are our options?

We can go old school with base R functions lapply and Reduce.

library(readr)

chromosomes <- lapply(files, read_tsv)
genome <- Reduce(rbind, chromosomes)

Here, we first let lapply read each file and store it in a list. Then we let Reduce fold the list with rbind, which binds the data frames in the list together, one below the other.

If that didn’t make sense, here it is again: lapply maps a function to each element of a vector or list, collecting the results in a list. Reduce folds the elements in a list together, using a function that takes in two arguments. The first argument will be the results it’s accumulated so far, and the second argument will be the next element of the list.

In the end, this leaves us, as with ldply, with one big data frame.

We can also use purrr‘s map_dfr. This seems to be the contemporary most elegant solution:

library(purrr)
library(readr)

genome <- map_dfr(files, read_tsv)

map_dfr, like good old ldply will map over a vector or list, and collect resulting data frames. The ”r” in the name means adding the next data frame as rows. There is also a ”c” version (map_dfc) for adding as columns.

Using R: Animal model with hglm and Stan (with Cholesky trick)

A few weeks ago I posted about fitting the quantitative genetic animal model with MCMCglmm and R-INLA. Since then, I listened to a talk by Lars Rönnegård, one of the creators of the hglm package, and this paper was published in GSE about animal models in Stan.

hglm

The hglm package fits hierarchical generalised linear models. That includes the animal model with pedigree or genomic relatedness. Hierarchical generalised linear models also allow you to model the dispersion of random effects, which lets you do tricks like variance QTL mapping (Rönnegård & Valdar 2011), breeding values for variances (Rönnegård et al. 2010) or genomic prediction models with predictors of marker variance (Mouresan, Selle & Rönnegård 2019). But let’s not get ahead of ourselves. How do we fit an animal model?

Here is the matrix formulation of the animal model that we skim through in every paper. It’s in this post because we will use the design matrix interface to hglm, which needs us to give it these matrices (this is not a paper, so we’re not legally obliged to include it):

\mathbf{y} = \mu + \mathbf{X} \mathbf{b} + \mathbf{Z} \mathbf{a} + \mathbf{e}

The terms are the the trait value, intercept, fixed coefficients and their design matrix, genetic coefficients and their design matrix, and the residual. The design matrix Z will contain one row and column for each individual, with a 1 to indicate its position in the phenotype table and pedigree and the rest zeros. If we sort our files, it’s an identity matrix.

The trick with the genetic coefficients is that they’re correlated, with a specific known correlation structure that we know from the pedigree (or in genomic models, from markers). It turns out (Lee, Nelder & Pawitan 2017, chapter 8) that you can change the Z matrix around so that it lets you fit the model with an identity covariance matrix, while still accounting for the correlations between relatives. You replace the random effects for relatedness with some transformed random effects that capture the same structure. One way to do this is with Cholesky decomposition.

\mathbf{Z_{fudged}} = \mathbf{Z_0} \mathbf{L}

As an example of what the Cholesky decomposition does, here is slice of the additive relationship matrix of 100 simulated individuals (the last generation of one replicate of these simulations) and the resulting matrix from Cholesky decomposition.

So instead of

\mathbf{a} \sim N(0, \mathbf{A} \sigma)

We can fit

\mathbf{a_{fudged}} \sim N(0, \mathbf{I} \sigma)

This lets us fit the animal model with hglm, by putting in a modified Z matrix.

Assuming we have data frames with a pedigree and a phenotype (like, again, from these simulations):

library(AGHmatrix)
library(hglm)

A  <- Amatrix(ped)

Z0  <- diag(1000)
L <- t(chol(A))
Z  <- Z0 %*% L
X <- model.matrix(~1, pheno)

model <- hglm(y = pheno$pheno,
              X = X,
              Z = Z,
              conv = 1e-8)

est_h2  <- model$varRanef / (model$varRanef + model$varFix)

(I found the recommendation to decrease the convergence criterion from the default for animal models in a YouTube video by Xia Chen.)

Stan

When we turn to Stan, we will meet the Cholesky trick again. Stan is a software for Markov Chain Monte Carlo, built to fit hierarchical linear models, and related high-dimensional models, more effectively than other sampling strategies (like Gibbs). rstan is a helpful package for running Stan from within R.

Nishio & Arakawa (2019) recently published a Stan script to fit an animal model, comparing Stan to a Gibbs sampler (and a related MCMC sampler that they also didn’t publish the code for). If we look into their Stan model code, they also do a Cholesky decomposition to be able to use an identity matrix for the variance.

First, they decompose the additive relationship matrix that the program takes in:

transformed data{
  matrix[K,K] LA;
  LA = cholesky_decompose(A);
}

And then, they express the model like this:

vector[N] mu;
vector[K] a;
a_decompose ~ normal(0, 1);
a = sigma_G * (LA * a_decompose);
mu = X * b + Z * a;
Y ~ normal(mu, sigma_R);

We can add this line to the generated quantities block of the Stan program to get heritability estimates directly:

real h2;
h2 = sigma_U / (sigma_U + sigma_E)

Here, we’ve saved their model to a stan file, and now we can run it from R:

pheno$scaled_pheno <- as.vector(scale(pheno$pheno))

model_stan <- stan(file = "nishio_arakawa.stan",
                   data = list(Y = pheno$scaled_pheno,
                               X = X,
                               A = A,
                               Z = Z0,
                               J = 1,
                               K = 1000,
                               N = 1000))

est_h2_stan <- summary(model_stan, pars = "h2")$summary

Important note that I always forget: It's important to scale your traits before you run this model. If not, the priors might be all wrong.

The last line pulls out the summary for the heritability parameter (that we added above). This gives us an estimate and an interval.

The paper also contains this entertaining passage about performance, which reads as if it was a response to a comment, actual or anticipated:

R language is highly extensible and provides a myriad of statistical and graphical techniques. However, R language has poor computation time compared to Fortran, which is especially well suited to numeric computation and scientific computing. In the present study, we developed the programs for GS and HMC in R but did not examine computation time; instead, we focused on examining the performance of estimating genetic parameters and breeding values.

Yes, two of their samplers (Gibbs and HMC) were written in R, but the one they end up advocating (and the one used above), is in Stan. Stan code gets translated into C++ and then compiled to machine code.

Stan with brms

If rstan lets us run Stan code from R and examine the output, brms lets us write down models in relatively straightforward R syntax. It’s like the MCMCglmm of the Stan world. We can fit an animal model with brms too, by directly plugging in the relationship matrix:

model_brms <- brm(scaled_pheno ~ 1 + (1|animal),
                  data = pheno,
                  family = gaussian(),
                  cov_ranef = list(animal = A),
                  chains = 4,
                  cores = 1,
                  iter = 2000)

Then, we can pull out the posterior samples for the variability, here expressed as standard errors, compute the heritability and then get the estimates (and interval, if we want):

posterior_brms <- posterior_samples(model_brms,
                                    pars = c("sd_animal", "sigma"))

h2_brms  <- posterior_brms[,1]^2 /
    (posterior_brms[,1]^2 + posterior_brms[,2]^2)

est_h2_brms <- mean(h2_brms)

(Code is on GitHub: both for the graphs above, and the models.)

Using R: Installing GenABEL and RepeatABEL

GenABEL is an R package for performing genome-wide association with linear mixed models and a genomic relationship matrix. RepeatABEL is a package for such genome-wide association studies that also need repeated measures.

Unfortunately, since 2018, GenABEL is not available on CRAN anymore, because of failed checks that were not fixed. (Checks are archived on CRAN, but this means very little to me.) As a consequence, RepeatABEL is also missing.

Fair enough, the GenABEL creators probably aren’t paid to maintain old software. It is a bit tragic, however, to think that in 2016, GenABEL was supposed to be the core of a community project to develop a suite of genomic analysis packages, two years before it was taken of CRAN:

The original publication of the GenABEL package for statistical analysis of genotype data has led to the evolution of a community which we now call the GenABEL project, which brings together scientists, software developers and end users with the central goal of making statistical genomics work by openly developing and subsequently implementing statistical models into user-friendly software.

The project has benefited from an open development model, facilitating communication and code sharing between the parties involved. The use of a free software licence for the tools in the GenABEL suite promotes quick uptake and widespread dissemination of new methodologies and tools. Moreover, public access to the source code is an important ingredient for active participation by people from outside the core development team and is paramount for reproducible research. Feedback from end users is actively encouraged through a web forum, which steadily grows into a knowledge base with a multitude of answered questions. Furthermore, our open development process has resulted in transparent development of methods and software, including public code review, a large fraction of bugs being submitted by members of the community, and quick incorporation of bug fixes.

I have no special insight about the circumstances here, but obviously the situation is far from ideal. You can still use the packages, though, with a little more effort to install. Who knows how long that will be the case, though. In a complex web of dependencies like the R package ecosystem, an unmaintained package probably won’t last.

GenABEL can probably be replaced by something like GEMMA. It does mixed models for GWAS, and while it isn’t an R package, it’s probably about as convenient. However, I don’t know of a good alternative to RepeatABEL.

These are the steps to install GenABEL and RepeatABEL from archives:

  1. We go to the CRAN archive and get the tarballs for GenABEL, GenABEL.data which it needs, and RepeatABEL.
    curl -O https://cran.r-project.org/src/contrib/Archive/GenABEL/GenABEL_1.8-0.tar.gz
    curl -O https://cran.r-project.org/src/contrib/Archive/GenABEL.data/GenABEL.data_1.0.0.tar.gz
    curl -O https://cran.r-project.org/src/contrib/Archive/RepeatABEL/RepeatABEL_1.1.tar.gz
    

    We don’t need to unpack them.

  2. Install GenABEL.data and GenABEL from a local source. Inside R, we can use install.packages, using the files we’ve just downloaded instead of the online repository.
    install.packages(c("GenABEL.data_1.0.0.tar.gz", "GenABEL_1.8-0.tar.gz"), repos = NULL)
    
  3. To install RepeatABEL, we first need hglm, which we can get from CRAN. After that has finished, we install RepeatABEL, again from local source:
    install.packages("hglm")
    install.packages("RepeatABEL_1.1.tar.gz", repos = NULL)
    

This worked on R version 3.6.1 running on Ubuntu 16.04, and also on Mac OS X.

Literature

Karssen, Lennart C., Cornelia M. van Duijn, and Yurii S. Aulchenko. ”The GenABEL Project for statistical genomics.” F1000Research 5 (2016).