# Using R: plyr to purrr, part 1

This is the second post about my journey towards writing more modern Tidyverse-style R code; here is the previous one. We will look at the common case of taking subset of data out of a data frame, making some complex R object from them, and then extracting summaries from those objects.

I miss the plyr package. Especially ddply, ldply, and dlply, my favourite trio of R functions of all times. Yes, the contemporary Tidyverse package dplyr is fast and neat. And those plyr functions operating on arrays were maybe overkill; I seldom used them. But plyr was so smooth, so beautiful, and — after you’ve bashed your head against it for some time and it changed your mind — so intuitive. The package still exists, but it’s retired, and we shouldn’t keep writing R code like it’s 2009, should we?

I used to like to do something like this: take a data frame, push it through a function that returns some complex object, store those objects in a list, and then map various functions over that list to extract the parts I care about. What is the equivalent in the new idiom? To do the same thing but with the purrr package, of course! purrr replaces the list-centric parts of plyr, while dplyr covers data frame-centric summarisation, mutation and so on.

For this example we will be using the lm function on subsets of data and store the model object. It’s the simple case that everyone reaches for to demonstrate these features, but it’s a bit dubious. If you find yourself fitting a lot of linear models to subsets, maybe there are other models you should think about Especially here, when the fake data just happens to come from a multilevel model with varying intercepts … But in this case, let’s simulate a simple linear regression and look at the coefficients we get out.

set.seed(20210807)

n_groups <- 10
group_sizes <- rpois(n_groups, 10)
n <- sum(group_sizes)

fake_data <- tibble(id = 1:n,
group = rep(1:n_groups,
times = group_sizes),
predictor = runif(n, 0, 1))
group_intercept <- rnorm(n_groups, 0, 1)

fake_data$response <- fake_data$predictor * 10 +
group_intercept[fake_data$group] + rnorm(n)  And here is the plyr code: First, dlply takes us from a data frame, splitting it by group, to a list of linear models. Then, ldply takes us from the list of models to a data frame of coefficients. tidy is a function from the wonderful broom package which extracts the same information as you would get in the rather unwieldy object from summary(lm), but as a data frame. library(plyr) library(broom) fit_model <- function(data) { lm(response ~ predictor, data) } models <- dlply(fake_data, "group", fit_model) result <- ldply(models, tidy)  This is what the results might looks like. Notice how ldply adds the split labels nicely into the group column, so we know which rows came from which subset.  group term estimate std.error statistic p.value 1 1 (Intercept) -0.2519167 0.5757214 -0.4375670 6.732729e-01 2 1 predictor 10.6136902 1.0051490 10.5593207 5.645878e-06 3 2 (Intercept) 3.1528489 0.6365294 4.9531864 7.878498e-04 4 2 predictor 8.2075766 1.1458702 7.1627452 5.292586e-05 5 3 (Intercept) -0.8103777 0.6901212 -1.1742542 2.786901e-01 ...  # split/map: The modern synthesis If we pull out purrr, we can get the exact same table like so. The one difference is that we get a tibble (that is, a contemporary, more well-behaved data frame) out of it instead of a base R data.frame. library(purrr) models <- map(split(fake_data, fake_data$group),
fit_model)
result <- map_df(models,
tidy,
.id = "group")

# A tibble: 80 x 6
group term        estimate std.error statistic  p.value

1 1     (Intercept)     1.67     0.773      2.16 6.32e- 2
2 1     predictor       8.67     1.36       6.39 2.12e- 4
3 2     (Intercept)     4.11     0.566      7.26 4.75e- 5
4 2     predictor       8.19     1.11       7.36 4.30e- 5
5 3     (Intercept)    -7.50     0.952     -7.89 9.99e- 5
6 3     predictor      11.5      1.75       6.60 3.03e- 4
7 4     (Intercept)   -19.8      0.540    -36.7  7.32e-13
8 4     predictor      11.5      0.896     12.8  5.90e- 8
9 5     (Intercept)   -12.4      1.03     -12.0  7.51e- 7
10 5     predictor       9.69     1.82       5.34 4.71e- 4
# … with 70 more rows


First, the base function split lets us break the data into subsets based on the values of a variable, which in this case is our group variable. The output of this function is a list of data frames, one for each group.

Second, we use map to apply a function to each element of that list. The function is the same modelling function that we used above, which shoves the data into lm. We now have our list of linear models.

Third, we apply the tidy function to each element of that list of models. Because we want the result to be one single data frame consolidating the output from each element, we use map_df, which will combine the results for us. (If we’d just use map again, we would get a list of data frames.) The .id argument tells map to add the group column that indicates what element of the list of models each row comes from. We want this to be able to identify the coefficients.

If we want to be fancy, we can express with the Tidyverse-related pipe and dot notation:

library(magrittr)

result <- fake_data %>%
display_name = json$display_name, stringsAsFactors = FALSE) }) }  This function asks for the gene information for each gene ID we’ve given it with the GET lookup/id/:id endpoint, and extracts the rough position (mean of start and end coordinate), chromosome name, and the ”display name”, which in the human case will be a gene symbol. (For genes that don’t have a gene symbol, we would need to set up this column ourselves.) At this point, we have the data we need in two data frames. That means it’s time to make the plot. # Plotting code We will build a plot with two layers: first the chromosomes (as a geom_linerange) and then the gene locations (as a geom_text_repel from the ggrepel package). The text layer will move the labels around so that they don’t overlap even when the genes are close to each other, and by setting the nudge_x argument we can move them to the side of the chromosomes. Apart from that, we change the scale to set he order of chromosomes and reverse the scale of the y-axis so that chromosomes start at the top of the plot. The function returns a ggplot2 plot object, so one can do some further customisation after the fact — but for some features one would have to re-write things inside the function. plot_genes <- function(coordinates, chromosome_sizes) { ## Restrict to chromosomes that are in data chrs_in_data <- chromosome_sizes[chromosome_sizes$name %in% coordinates$chr,] chr_order <- order(as.numeric(chrs_in_data$name))

ggplot() +
geom_linerange(aes(x = name,
ymin = 1,
ymax = length/1e6),
size = 2,
colour = "grey",
data = chrs_in_data) +
geom_text_repel(aes(x = chr,
y = position/1e6,
label = display_name),
nudge_x = 0.33,
data = coordinates) +
scale_y_reverse() +
## Fix ordering of chromosomes on x-axis
scale_x_discrete(limits = chrs_in_data$name[chr_order], labels = chrs_in_data$name[chr_order]) +
theme_bw() +
theme(panel.grid = element_blank()) +
xlab("Chromosome") +
ylab("Position (Mbp)")

}


# Possible extensions

One feature from the Arabidopsis inspiration that is missing here is the position of centromeres. We should be able to use the option ?bands=1 in the GET info/assembly/:species to get cytogenetic band information and separate p and q arms of chromosomes. This will not be universal though, i.e. not available for most species I care about.

Except to make cartoons of gene positions, I think this might be a nice way to make plots of genome regions with very course resolution, i.e. linkage mapping results, where one could add lines to show genomic confidence intervals, for example.

# Convincing myself about the Monty Hall problem

Like many others, I’ve never felt that the solution to the Monty Hall problem was intuitive, despite the fact that explanations of the correct solution are everywhere. I am not alone. Famously, columnist Marilyn vos Savant got droves of mail from people trying to school her after she had published the correct solution.

The problem goes like this: You are a contestant on a game show (based on a real game show hosted by Monty Hall, hence the name). The host presents you with three doors, one of which contains a prize — say, a goat — and the others are empty. After you’ve made your choice, the host opens one of the doors, showing that it is empty. You are now asked whether you would like to stick to your initial choice, or switch to the other door. The right thing to do is to switch, which gives you 2/3 probability of winning the goat. This can be demonstrated in a few different ways.

A goat is a great prize. Image: Casey Goat by Pete Markham (CC BY-SA 2.0)

So I sat down to do 20 physical Monty Hall simulations on paper. I shuffled three cards with the options, picked one, and then, playing the role of the host, took away one losing option, and noted down if switching or holding on to the first choice would have been the right thing to do. The results came out 13 out of 20 (65%) wins for the switching strategy, and 7 out of 20 (35%) for the holding strategy. Of course, the Monty Hall Truthers out there must question whether this demonstration in fact happened — it’s too perfect, isn’t it?

The outcome of the simulation is less important than the feeling that came over me as I was running it, though. As I was taking on the role of the host and preparing to take away one of the losing options, it started feeling self-evident that the important thing is whether the first choice is right. If the first choice is right, holding is the right strategy. If the first choice is wrong, switching is the right option. And the first choice, clearly, is only right 1/3 of the time.

In this case, it was helpful to take the game show host’s perspective. Selvin (1975) discussed the solution to the problem in The American Statistician, and included a quote from Monty Hall himself:

Monty Hall wrote and expressed that he was not ”a student of statistics problems” but ”the big hole in your argument is that once the first box is seen to be empty, the contestant cannot exchange his box.” He continues to say, ”Oh, and incidentally, after one [box] is seen to be empty, his chances are no longer 50/50 but remain what they were in the first place, one out of three. It just seems to the contestant that one box having been eliminated, he stands a better chance. Not so.” I could not have said it better myself.

# A generalised problem

Now, imagine the same problem with a number d number of doors, w number of prizes and o number of losing doors that are opened after the first choice is made. We assume that the losing doors are opened at random, and that switching entails picking one of the remaining doors at random. What is the probability of winning with the switching strategy?

The probability of picking the a door with or without a prize is:

$\Pr(\text{pick right first}) = \frac{w}{d}$

$\Pr(\text{pick wrong first}) = 1 - \frac{w}{d}$

If we picked a right door first, we have w – 1 winning options left out of d – o – 1 doors after the host opens o doors:

$\Pr(\text{win\textbar right first}) = \frac{w - 1}{d - o - 1}$

If we picked the wrong door first, we have all the winning options left:

$\Pr(\text{win\textbar wrong first}) = \frac{w}{d - o - 1}$

Putting it all together:

$\Pr(\text{win\textbar switch}) = \Pr(\text{pick right first}) \cdot \Pr(\text{win\textbar right first}) + \\ + \Pr(\text{pick wrong first}) \cdot \Pr(\text{win\textbar wrong first}) = \\ = \frac{w}{d} \frac{w - 1}{d - o - 1} + (1 - \frac{w}{d}) \frac{w}{d - o - 1}$

As before, for the hold strategy, the probability of winning is the probability of getting it right the first time:

$\Pr(\text{win\textbar hold}) = \frac{w}{d}$

With the original Monty Hall problem, w = 1, d = 3 and o = 1, and therefore

$\Pr(\text{win\textbar switch}) = \frac{1}{3} \cdot 0 + \frac{2}{3} \cdot 1$

Selvin (1975) also present a generalisation due to Ferguson, where there are n options and p doors that are opened after the choice. That is, w = 1, d = 3 and o = 1. Therefore,

$\Pr(\text{win\textbar switch}) = \frac{1}{n} \cdot 0 + (1 - \frac{1}{n}) \frac{1}{n - p - 1} = \frac{n - 1}{n(n - p - 1)}$

which is Ferguson’s formula.

Finally, in Marilyn vos Savant’s column, she used this thought experiment to illustrate why switching is the right thing to do:

Here’s a good way to visualize what happened. Suppose there are a million doors, and you pick door #1. Then the host, who knows what’s behind the doors and will always avoid the one with the prize, opens them all except door #777,777. You’d switch to that door pretty fast, wouldn’t you?

That is, w = 1 still, d = 106 and o = 106 – 2.

$\Pr(\text{win\textbar switch}) = 1 - \frac{1}{10^6}$

It turns out that the solution to the generalised problem is that it is always better to switch, as long as there is a prize, and as long as the host opens any doors. One can also generalise it to choosing sets of more than one door. This makes some intuitive sense: as long as the host takes opens some doors, taking away losing options, switching should enrich for prizes.

# Some code

To be frank, I’m not sure I have convinced myself of the solution to the generalised problem yet. However, using the code below, I did try the calculation for all combinations of total number of doors, prizes and doors opened up to 100, and in all cases, switching wins. That inspires some confidence, should I end up on a small ruminant game show.

The code below first defines a wrapper around R’s sampling function, which has a very annoying alternative behaviour when fed a vector of length one, to be able to build a computational version of my physical simulation. Finally, we have a function for the above formulae. (See whole thing on GitHub if you are interested.)

## Wrap sample into a function that avoids the "convenience"
## behaviour that happens when the length of x is one

sample_safer <- function(to_sample, n) {
assert_that(n <= length(to_sample))
if (length(to_sample) == 1)
return(to_sample)
else {
return(sample(to_sample, n))
}
}

## Simulate a generalised Monty Hall situation with
## w prizes, d doors and o doors that are opened.

sim_choice <- function(w, d, o) {
## There has to be less prizes than unopened doors
assert_that(w < d - o)
wins <- rep(1, w)
losses <- rep(0, d - w)
doors <- c(wins, losses)

## Pick a door
choice <- sample_safer(1:d, 1)

## Doors that can be opened
to_open_from <- which(doors == 0)

## Chosen door can't be opened
to_open_from <- to_open_from[to_open_from != choice]

## Doors to open
to_open <- sample_safer(to_open_from, o)

## Switch to one of the remaining doors
possible_switches <- setdiff(1:d, c(to_open, choice))
choice_after_switch <- sample_safer(possible_switches , 1)

result_hold <- doors[choice]
result_switch <- doors[choice_after_switch]
c(result_hold,
result_switch)
}

## Formulas for probabilities

mh_formula <- function(w, d, o) {
## There has to be less prizes than unopened doors
assert_that(w < d - o)

p_win_switch <- w/d * (w - 1)/(d - o - 1) +
(1 - w/d) * w / (d - o - 1)
p_win_hold <- w/d
c(p_win_hold,
p_win_switch)
}

## Standard Monty Hall

mh <- replicate(1000, sim_choice(1, 3, 1))

> mh_formula(1, 3, 1)
[1] 0.3333333 0.6666667
> rowSums(mh)/ncol(mh)
[1] 0.347 0.653

# The Monty Hall problem problem

Guest & Martin (2020) use this simple problem as their illustration for computational model building: two 12 inch pizzas for the same price as one 18 inch pizza is not a good deal, because the 18 inch pizza contains more food. Apparently this is counter-intuitive to many people who have intuitions about inches and pizzas.

They call the risk of having inconsistencies in our scientific understanding because we cannot intuitively grasp the implications of our models ”The pizza problem”, arguing that it can be ameliorated by computational modelling, which forces you to spell out implicit assumptions and also makes you actually run the numbers. Having a formal model of areas of circles doesn’t help much, unless you plug in the numbers.

The Monty Hall problem problem is the pizza problem with a vengeance; not only is it hard to intuitively grasp what is going on in the problem, but even when presented with compelling evidence, the mental resistance might still remain and lead people to write angry letters and tweets.

Literature

Guest, O, & Martin, AE (2020). How computational modeling can force theory building in psychological science. Preprint.

Selvin, S (1975) On the Monty Hall problem. The American Statistician 29:3 p.134.

# Showing a difference in mean between two groups, take 2

A couple of years ago, I wrote about the paradoxical difficulty of visualising a difference in means between two groups, while showing both the data and some uncertainty interval. I still feel like many ills in science come from our inability to interpret a simple comparison of means. Anything with more than two groups or a predictor that isn’t categorical makes things worse, of course. It doesn’t take much to overwhelm the intuition.

My suggestion at the time was something like this — either a panel that shows the data an another panel with coefficients and uncertainty intervals; or a plot that shows the with lines that represent the magnitude of the difference at the upper and lower limit of the uncertainty interval.

Alternative 1 (left), with separate panels for data and coefficient estimates, and alternative 2 (right), with confidence limits for the difference shown as vertical lines. For details, see the old post about these graphs.

Here is the fake data and linear model we will plot. If you want to follow along, the whole code is on GitHub. Group 0 should have a mean of 4, and the difference between groups should be two, and as the graphs above show, our linear model is not too far off from these numbers.

library(broom)

data <- data.frame(group = rep(0:1, 20))
data$response <- 4 + data$group * 2 + rnorm(20)

model <- lm(response ~ factor(group), data = data)
result <- tidy(model)


Since the last post, a colleague has told me about the Gardner-Altman plot. In a paper arguing that confidence intervals should be used to show the main result of studies, rather than p-values, Gardner & Altman (1986) introduced plots for simultaneously showing confidence intervals and data.

Their Figure 1 shows (hypothetical) systolic blood pressure data for a group of diabetic and non-diabetic people. The left panel is a dot plot for each group. The right panel is a one-dimensional plot (with a different scale than the right panel; zero is centred on the mean of one of the groups), showing the difference between the groups and a confidence interval as a point with error bars.

There are functions for making this kind of plot (and several more complex alternatives for paired comparisons and analyses of variance) in the R package dabestr from Ho et al. (2019). An example with our fake data looks like this:

Alternative 3: Gardner-Altman plot with bootstrap confidence interval.

library(dabestr)

bootstrap <- dabest(data,
group,
response,
idx = c("0", "1"),
paired = FALSE)

bootstrap_diff <- mean_diff(bootstrap)

plot(bootstrap_diff)


While this plot is neat, I think it is a little too busy — I’m not sure that the double horizontal lines between the panels or the shaded density for the bootstrap confidence interval add much. I’d also like to use other inference methods than bootstrapping. I like how the scale of the right panel has the same unit as the left panel, but is offset so the zero is at the mean of one of the groups.

Here is my attempt at making a minimalistic version:

Alternative 4: Simplified Garner-Altman plot.

This piece of code first makes the left panel of data using ggbeeswarm (which I think looks nicer than the jittering I used in the first two alternatives), then the right panel with the estimate and approximate confidence intervals of plus/minus two standard errors of the mean), adjusts the scale, and combines the panels with patchwork.

library(ggbeeswarm)
library(ggplot2
library(patchwork)

ymin <- min(data$response) ymax <- max(data$response)

plot_points_ga <- ggplot() +
geom_quasirandom(aes(x = factor(group), y = response),
data = data) +
xlab("Group") +
ylab("Response") +
theme_bw() +
scale_y_continuous(limits = c(ymin, ymax))

height_of_plot <- ymax-ymin

group0_fraction <- (coef(model)[1] - ymin)/height_of_plot

diff_min <- - height_of_plot * group0_fraction

diff_max <- (1 - group0_fraction) * height_of_plot

plot_difference_ga <- ggplot() +
geom_pointrange(aes(x = term, y = estimate,
ymin = estimate - 2 * std.error,
ymax = estimate + 2 * std.error),
data = result[2,]) +
scale_y_continuous(limits = c(diff_min, diff_max)) +
ylab("Difference") +
xlab("Comparison") +
theme_bw()

(plot_points_ga | plot_difference_ga) +
plot_layout(widths = c(0.75, 0.25))


Literature

Gardner, M. J., & Altman, D. G. (1986). Confidence intervals rather than P values: estimation rather than hypothesis testing. British Medical Journal

Ho, J., Tumkaya, T., Aryal, S., Choi, H., & Claridge-Chang, A. (2019). Moving beyond P values: data analysis with estimation graphics. Nature methods.

# A genetic mapping animation in R

Cullen Roth posted a beautiful animation of quantitative trait locus mapping on Twitter. It is pretty amazing. I wanted to try to make something similar in R with gganimate. It’s not going to be as beautiful as Roth’s animation, but it will use the same main idea of showing both a test statistic along the genome, and the underlying genotypes and trait values. For example, Roth’s animation has an inset scatterplot that appears above the peak after it’s been reached; to do that I think we would have to go a bit lower-level than gganimate and place our plots ourselves.

First, we’ll look at a locus associated with body weight in chickens (with data from Henriksen et al. 2016), and then a simulated example. We will use ggplot2 with gganimate and a magick trick for combining the two animations. Here are some pertinent snippets of the code; as usual, find the whole thing on Github.

# LOD curve

We will use R/qtl for the linkage mapping. We start by loading the data file (Supplementary Dataset from Henriksen et al. 2016). A couple of individuals have missing covariates, so we won’t be able to use them. This piece of code first reads the cross file, and then removes the two offending rows.

library(qtl)

file = "41598_2016_BFsrep34031_MOESM83_ESM.csv")

cross <- subset(cross, ind = c("-34336", "-34233"))


For nice plotting, let’s restrict ourselves to fully informative markers (that is, the ones that tell the two founder lines of the cross apart). There are some partially informative ones in the dataset too, and R/qtl can get some information out of them thanks to genotype probability calculations with its Hidden Markov Model. They don’t make for nice scatterplots though. This piece of code extracts the genotypes and identifies informative markers as the ones that only have genotypes codes 1, 2 or 3 (homozygote, heterozygote and other homozygote), but not 5 and 6, which are used for partially informative markers.

## Get informative markers and combine with phenotypes for plotting

geno <- as.data.frame(pull.geno(cross,
chr = 1))

geno_values <- lapply(geno, unique)
informative <- unlist(lapply(geno_values,
function(g) all(g %in% c(1:3, NA))))

geno_informative <- geno[informative]


Now for the actual scan. We run a single QTL scan with covariates (sex, batch that the chickens were reared in, and principal components of genotypes), and pull out the logarithm of the odds (LOD) across chromosome 1. This piece of code first prepares a design matrix of the covariates, and then runs a scan of chromosome 1.

## Prepare covariates
pheno <- pull.pheno(cross)

covar <- model.matrix(~ sex_number + batch + PC1 + PC2 + PC3 + PC4 +
PC5 + PC6 + PC7 + PC8 + PC9 + PC10,
pheno,
na.action = na.exclude)[,-1]

scan <- scanone(cross = cross,
pheno.col = "weight_212_days",
method = "hk",
chr = 1,


Here is the LOD curve along chromosome 1 that want to animate. The peak is the biggest-effect growth locus in this intercross, known as ”growth1”.

With gganimate, animating the points is as easy as adding a transition layer. This piece of code first makes a list of some formatting for our graphics, then extracts the LOD scores from the scan object, and makes the plot. By setting cumulative in transition_manual the animation will add one data point at the time, while keeping the old ones.

library(ggplot2)
library(gganimate)

formatting <- list(theme_bw(base_size = 16),
theme(panel.grid = element_blank(),
strip.background = element_blank(),
legend.position = "none"),
scale_colour_manual(values =
c("red", "purple", "blue")))

lod <- as.data.frame(scan)
lod <- lod[informative,]
lod$marker_number <- 1:nrow(lod) plot_lod <- qplot(x = pos, y = lod, data = lod, geom = c("point", "line")) + ylab("Logarithm of odds") + xlab("Position") + formatting + transition_manual(marker_number, cumulative = TRUE)  # Plot of the underlying data We also want a scatterplot of the data. Here what a jittered scatterplot will look like at the peak. The horizontal axes are genotypes (one homozygote, heterozygote in the middle, the other homozygote) and the vertical axis is the body mass in grams. We’ve separated the sexes into small multiples. Whether to give both sexes the same vertical axis or not is a judgement call. The hens weigh a lot less than the roosters, which means that it’s harder to see patterns among them when put on the same axis as the roosters. On the other hand, if we give the sexes different axes, it will hide that difference. This piece of code builds a combined data frame with informative genotypes and body mass. Then, it makes the above plot for each marker into an animation. library(tidyr) ## Combined genotypes and weight geno_informative$id <- pheno$id geno_informative$w212 <- pheno$weight_212_days geno_informative$sex <- pheno$sex_number melted <- pivot_longer(geno_informative, -c("id", "w212", "sex")) melted <- na.exclude(melted) ## Add marker numbers marker_numbers <- data.frame(name = rownames(scan), marker_number = 1:nrow(scan), stringsAsFactors = FALSE) melted <- inner_join(melted, marker_numbers) ## Recode sex to words melted$sex_char <- ifelse(melted$sex == 1, "male", "female") plot_scatter <- qplot(x = value, geom = "jitter", y = w212, colour = factor(value), data = melted) + facet_wrap(~ factor(sex_char), ncol = 1) + xlab("Genotype") + ylab("Body mass") + formatting + transition_manual(marker_number)  # Combining the animations And here is the final animation: To put the pieces together, we use this magick trick (posted by Matt Crump). That is, animate the plots, one frame for each marker, and then use the R interface for ImageMagick to put them together and write them out. gif_lod <- animate(plot_lod, fps = 2, width = 320, height = 320, nframes = sum(informative)) gif_scatter <- animate(plot_scatter, fps = 2, width = 320, height = 320, nframes = sum(informative)) ## Magick trick from Matt Crump mgif_lod <- image_read(gif_lod) mgif_scatter <- image_read(gif_scatter) new_gif <- image_append(c(mgif_lod[1], mgif_scatter[1])) for(i in 2:sum(informative)){ combined <- image_append(c(mgif_lod[i], mgif_scatter[i])) new_gif <- c(new_gif, combined) } image_write(new_gif, path = "out.gif", format = "gif")  Literature Henriksen, Rie, et al. ”The domesticated brain: genetics of brain mass and brain structure in an avian species.” Scientific reports 6.1 (2016): 1-9. # A model of polygenic adaptation in an infinite population How do allele frequencies change in response to selection? Answers to that question include ”it depends”, ”we don’t know”, ”sometimes a lot, sometimes a little”, and ”according to a nonlinear differential equation that actually doesn’t look too horrendous if you squint a little”. Let’s look at a model of the polygenic adaptation of an infinitely large population under stabilising selection after a shift in optimum. This model has been developed by different researchers over the years (reviewed in Jain & Stephan 2017). Here is the big equation for allele frequency change at one locus: $\dot{p}_i = -s \gamma_i p_i q_i (c_1 - z') - \frac{s \gamma_i^2}{2} p_i q_i (q_i - p_i) + \mu (q_i - p_i )$ That wasn’t so bad, was it? These are the symbols: • the subscript i indexes the loci, • $\dot{p}$ is the change in allele frequency per time, • $\gamma_i$ is the effect of the locus on the trait (twice the effect of the positive allele to be precise), • $p_i$ is the frequency of the positive allele, • $q_i$ the frequency of the negative allele, • $s$ is the strength of selection, • $c_1$ is the phenotypic mean of the population; it just depends on the effects and allele frequencies • $\mu$ is the mutation rate. This breaks down into three terms that we will look at in order. # The directional selection term $-s \gamma_i p_i q_i (c_1 - z')$ is the term that describes change due to directional selection. Apart from the allele frequencies, it depends on the strength of directional selection $s$, the effect of the locus on the trait $\gamma_i$ and how far away the population is from the new optimum $(c_1 - z')$. Stronger selection, larger effect or greater distance to the optimum means more allele frequency change. It is negative because it describes the change in the allele with a positive effect on the trait, so if the mean phenotype is above the optimum, we would expect the allele frequency to decrease, and indeed: when $(c_1 - z') < 0$ this term becomes negative. If you neglect the other two terms and keep this one, you get Jain & Stephan's "directional selection model", which describes behaviour of allele frequencies in the early phase before the population has gotten close to the new optimum. This approximation does much of the heavy lifting in their analysis. # The stabilising selection term $-\frac{s \gamma_i^2}{2} p_i q_i (q_i - p_i)$ is the term that describes change due to stabilising selection. Apart from allele frequencies, it depends on the square of the effect of the locus on the trait. That means that, regardless of the sign of the effect, it penalises large changes. This appears to make sense, because stabilising selection strives to preserve traits at the optimum. The cubic influence of allele frequency is, frankly, not intuitive to me. # The mutation term Finally, $\mu (q_i - p_i )$ is the term that describes change due to new mutations. It depends on the allele frequencies, i.e. how of the alleles there are around that can mutate into the other alleles, and the mutation rate. To me, this is the one term one could sit down and write down, without much head-scratching. # Walking in allele frequency space Jain & Stephan (2017) show a couple of examples of allele frequency change after the optimum shift. Let us try to draw similar figures. (Jain & Stephan don’t give the exact parameters for their figures, they just show one case with effects below their threshold value and one with effects above.) First, here is the above equation in R code: pheno_mean <- function(p, gamma) { sum(gamma * (2 * p - 1)) } allele_frequency_change <- function(s, gamma, p, z_prime, mu) { -s * gamma * p * (1 - p) * (pheno_mean(p, gamma) - z_prime) + - s * gamma^2 * 0.5 * p * (1 - p) * (1 - p - p) + mu * (1 - p - p) }  With this (and some extra packaging; code on Github), we can now plot allele frequency trajectories such as this one, which starts at some arbitrary point and approaches an optimum: Animation of alleles at two loci approaching an equilibrium. Here, we have two loci with starting frequencies 0.2 and 0.1 and effect size 1 and 0.01, and the optimum is at 0. The mutation rate is 10-4 and the strength of selection is 1. Animation made with gganimate. # Resting in allele frequency space The model describes a shift from one optimum to another, so we want want to start at equilibrium. Therefore, we need to know what the allele frequencies are at equilibrium, so we solve for 0 allele frequency change in the above equation. The first term will be zero, because $(c_1 - z') = 0$ when the mean phenotype is at the optimum. So, we can throw away that term, and factor the rest equation into: $(1 - 2p) (-\frac{s \gamma ^2}{2} p(1-p) + \mu) = 0$ Therefore, one root is $p = 1/2$. Depending on your constitution, this may or may not be intuitive to you. Imagine that you have all the loci, each with a positive and negative allele with the same effect, balanced so that half the population has one and the other half has the other. Then, there is this quadratic equation that gives two other equilibria: $\mu - \frac{s\gamma^2}{2}p(1-p) = 0$ $\implies p = \frac{1}{2} (1 \pm \sqrt{1 - 8 \frac{\mu}{s \gamma ^2}})$ These points correspond to mutation–selection balance with one or the other allele closer to being lost. Jain & Stephan (2017) show a figure of the three equilibria that looks like a semicircle (from the quadratic equation, presumably) attached to a horizontal line at 0.5 (their Figure 1). Given this information, we can start our loci out at equilibrium frequencies. Before we set them off, we need to attend to the effect size. # How big is a big effect? Hur långt är ett snöre? In this model, there are big and small effects with qualitatively different behaviours. The cutoff is at: $\hat{\gamma} = \sqrt{ \frac{8 \mu}{s}}$ If we look again at the roots to the quadratic equation above, they can only exist as real roots if $\frac {8 \mu}{s \gamma^2} < 1$ because otherwise the expression inside the square root will be negative. This inequality can be rearranged into: $\gamma^2 > \frac{8 \mu}{s}$ This means that if the effect of a locus is smaller than the threshold value, there is only one equilibrium point, and that is at 0.5. It also affects the way the allele frequency changes. Let us look at two two-locus cases, one where the effects are below this threshold and one where they are above it. threshold <- function(mu, s) sqrt(8 * mu / s) threshold(1e-4, 1)  [1] 0.02828427 With mutation rate of 10-4 and strength of selection of 1, the cutoff is about 0.028. Let our ”big” loci have effect sizes of 0.05 and our small loci have effect sizes of 0.01, then. Now, we are ready to shift the optimum. The small loci will start at an equilibrium frequency of 0.5. We start the large loci at two different equilibrium points, where one positive allele is frequent and the other positive allele is rare: get_equilibrium_frequencies <- function(mu, s, gamma) { c(0.5, 0.5 * (1 + sqrt(1 - 8 * mu / (s * gamma^2))), 0.5 * (1 - sqrt(1 - 8 * mu / (s * gamma^2)))) } (eq0.05 <- get_equilibrium_frequencies(1e-4, 1, 0.05))  [1] 0.50000000 0.91231056 0.08768944 get_equlibrium_frequencies(1e-4, 1, 0.01)  [1] 0.5 NaN NaN # Look at them go! These animations show the same qualitative behaviour as Jain & Stephan illustrate in their Figure 2. With small effects, there is gradual allele frequency change at both loci: However, with large effects, one of the loci (the one on the vertical axis) dramatically changes in allele frequency, that is it’s experiencing a selective sweep, while the other one barely changes at all. And the model will show similar behaviour when the trait is properly polygenic, with many loci, as long as effects are large compared to the (scaled) mutation rate. Here, I ran 10,000 time steps; if we look at the phenotypic means, we can see that they still haven’t arrived at the optimum at the end of that time. The mean with large effects is at 0.089 (new optimum of 0.1), and the mean with small effects is 0.0063 (new optimum: 0.02). Let’s end here for today. Maybe another time, we can return how this model applies to actually polygenic architectures, that is, with more than two loci. The code for all the figures is on Github. Literature Jain, K., & Stephan, W. (2017). Modes of rapid polygenic adaptation. Molecular biology and evolution, 34(12), 3169-3175. # Twin lambs with different fathers I just learned that in sheep, lambs from the same litter pretty often have different fathers, if the ewe has mated with different males. Berry et al. (2020) looked at sheep flocks on Irland that used more than one ram, and: Of the 539 pairs of twins included in the analysis, 160 (i.e. 30%) were sired by two different rams. Of the 137 sets of triplets included in the analysis, 73 (i.e. 53%) were sired by more than one ram. Of the nine sets of quadruplets, eight were sired by two rams with the remaining litter being mono‐paternal. The overall incidence of heteropaternal superfecundation among litters was therefore 35%. Given that the incidence of multiple births in these flocks was 65%, heteropaternal superfecundation is expected to be relatively common in sheep; this is especially true as all but two of the litter‐mates were polyzygotic. They figured this out by looking at individuals genotyped on SNP chips with tens of thousands of SNPs, with both lambs and the potential parents genotyped, so there can’t be much uncertainty in the assignment. You don’t need that many genotyped markers to get a confident assignment, and they don’t have that many rams to choose from. # Time for some Mendelian inheritance Let’s simulate a situation like this: We set up a population and a marker panel for genotyping, split them into ewes and rams, and make some lambs. library(AlphaSimR) founderpop <- runMacs(nInd = 105, nChr = 10, segSites = 100) simparam <- SimParam$new(founderpop)
simparam$setGender("no") simparam$addSnpChip(nSnpPerChr = 100)

parents <- newPop(founderpop,
simParam = simparam)

ewes <- parents[1:100]
rams <- parents[101:105]

lambs <- randCross2(females = ewes,
males = rams,
nCrosses = 100,
nProgeny = 2,
simParam = simparam)


Now, if we have the genotypes of a lamb and its mother, how do we know the father? In this paper, they use exclusion methods: They compared the genotypes from the offspring with the parents and used inheritance rules to exclude rams that can't be the father because if they were, the offspring couldn't have the genotypes it had. Such breaking of regular inheritance patterns would be a "Mendelian inconsistency". This is the simplest kind of parentage assignment; fancier methods will calculate the probabilities of different genotypes, and allow you to reconstruct unknown relationships.

We can do this in two ways:

• ignore the ewe’s genotypes and look for opposite homozygotes between lamb and ram, which are impossible regardless of the mother’s genotype
• use both the ewe’s and ram’s genotypes to look what lamb genotypes are possible from a cross between them; this adds a few more cases where we can exclude a ram even if the lamb is heterozygous

To do the first, we count the number of opposite homozygous markers. In this genotype coding, 0 and 2 are homozygotes, and 1 is a heterozygous marker.

opposite_homozygotes <- function(ram,
lamb) {
sum(lamb == 0 & ram == 2) +
sum(lamb == 2 & ram == 0)

}


When we include the ewe's genotype, there are a few more possible cases. We could enumerate all of them, but here is some R code to generate them. We first get all possible gametes from each parent, we combine the gametes in all possible combinations, and that gives us the possible lamb genotypes at that marker. If the lamb does, in fact, not have any of those genotypes, we declare the marker inconsistent. Repeat for all markers.

## Generate the possible gametes from a genotype

possible_gametes <- function(genotype) {

if (genotype == 0) {
gametes <- 0
} else if (genotype == 1) {
gametes <- c(0, 1)
} else if (genotype == 2) {
gametes <- 1
}

gametes
}

## Generate the possible genotypes for an offspring from
## parent possible gametes

possible_genotypes <- function(father_gametes,
mother_gametes) {

possible_combinations <- expand.grid(father_gametes, mother_gametes)
resulting_genotypes <- rowSums(possible_combinations)
unique(resulting_genotypes)
}

## Check offspring genotypes for consistency with parent genotypes

mendelian_inconsistency <- function(ewe,
ram,
lamb) {

n_markers <- length(ewe)
inconsistent <- logical(n_markers)

for (marker_ix in 1:n_markers) {

possible_lamb_genotypes <-
possible_genotypes(possible_gametes(ewe[marker_ix]),
possible_gametes(ram[marker_ix]))

inconsistent[marker_ix] <-
!lamb[marker_ix] %in% possible_lamb_genotypes
}

sum(inconsistent)
}


(These functions assume that we have genotypes in vectors. The full code that extracts this information from the simulated data and repeats for all markers is on Gitbhub.)

Here is the outcome for a set of random lambs. The red dots point out the true fathers: because we have perfect genotype data simulated without errors, the true father always has 100% consistent markers.

If we compare how many markers are found inconsistent with the two methods, we get a pattern like this graph. Including the ewe’s genotypes lets us discover a lot more inconsistent markers, but in this case, with plentiful and error-free markers, it doesn’t make a difference.

# Thresholds and errors

If I have any complaint with the paper, it’s that the parentage analysis isn’t really described in the methods. This is what it says:

Parentage testing using simple exclusion‐based approaches is determined by the proportion of opposing homozygotes in putative sire–offspring pairs.

/…/

Maternal verification was undertaken using the exclusion method (Double et al . 1997) comparing the genotype of the dam with that of her putative progeny and only validated dam–offspring pairs were retained. Genotypes of the mature rams in the flock were compared with all lambs born in that flock using the exclusion method.

(The reference is related to exclusion methods, but it’s describing how to calculate exclusion probabilities in a certain circumstance. That is, it’s part of a methodological conversation about exclusion methods, but doesn’t actually describe what they did.)

I don’t doubt that they did it well. Still, it would be interesting to know the details, because in the absence of perfect genotype data, they must have had some thresholds for error and some criterion for deciding which ram was right, even if it seemed obvious.

Literature

Berry, D. P., et al. ”Heteropaternal superfecundation frequently occurs in multiple‐bearing mob‐mated sheep.” Animal Genetics (2020).

# Using R: setting a colour scheme in ggplot2

Note to self: How to quickly set a colour scheme in ggplot2.

Imagine we have a series of plots that all need a uniform colour scale. The same category needs to have the same colour in all graphics, made possibly with different packages and by different people. Instead of hard-coding the colours and the order of categories, we can put them in a file, like so:

library(readr)

# A tibble: 5 x 2
name   colour

1 blue   #d4b9da
2 red    #c994c7
3 purple #df65b0
4 green  #dd1c77
5 orange #980043

Now a plot with default colours, using some made-up data:

x <- 1:100

beta <- rnorm(5, 1, 0.5)

stroop <- data.frame(x,
sapply(beta, function(b) x * b + rnorm(100, 1, 10)))
colnames(stroop)[2:6] <- c("orange", "blue", "red", "purple", "green")

data_long <- pivot_longer(stroop, -x)

plot_y <- qplot(x = x,
y = value,
colour = name,
data = data_long) +
theme_minimal() +
theme(panel.grid = element_blank())


Now we can add the custom scale like this:

plot_y_colours <- plot_y +
scale_colour_manual(limits = colours$name, values = colours$colour)



# Using R: simple Gantt chart with ggplot2

Jeremy Yoder’s code for a simple Gantt chart on the Molecular Ecologist blog uses geom_line and gather to prepare the data structure. I like using geom_linerange and a coord_flip, which lets you use start and end columns directly without pivoting.

Here is a very serious data frame of activities:

# A tibble: 6 x 4
activity       category        start               end

1 Clean house    preparations    2020-07-01 00:00:00 2020-07-03 00:00:00
2 Pack bags      preparations    2020-07-05 10:00:00 2020-07-05 17:00:00
3 Run to train   travel          2020-07-05 17:00:00 2020-07-05 17:15:00
4 Sleep on train travel          2020-07-05 17:15:00 2020-07-06 08:00:00
5 Procrastinate  procrastination 2020-07-01 00:00:00 2020-07-05 00:00:00
6 Sleep          vacation        2020-07-06 08:00:00 2020-07-09 00:00:00


And here is the code:


library(ggplot2)

## Set factor level to order the activities on the plot
activities$activity <- factor(activities$activity,
levels = activities\$activity[nrow(activities):1])

plot_gantt <- qplot(ymin = start,
ymax = end,
x = activity,
colour = category,
geom = "linerange",
data = activities,
size = I(5)) +
scale_colour_manual(values = c("black", "grey", "purple", "yellow")) +
coord_flip() +
theme_bw() +
theme(panel.grid = element_blank()) +
xlab("") +
ylab("") +
ggtitle("Vacation planning")


# Using R: 10 years with R

Yesterday, 29 Feburary 2020, was the 20th anniversary of the release R 1.0.0. Jozef Hajnala’s blog has a cute anniversary post with some trivia. I realised that it is also (not to the day, but to the year) my R anniversary.

I started using R in 2010, during my MSc project in Linköping. Daniel Nätt, who was a PhD student there at the time, was using it for gene expression and DNA methylation work. I think that was the reason he was pulled into R; he needed the Bioconductor packages for microarrays. He introduced me. Thanks, Daniel!

I think I must first have used it to do something with qPCR melting curves. I remember that I wrote some function to reshape/pivot data between long and wide format. It was probably an atrocity of nested loops and hard bracket indexing. Coming right from an undergraduate programme with courses using Ada and C++, even if we had also used Minitab for statistics and Matlab for engineering, I spoke R with a strong accent. At any rate, I was primed to think that doing my data analysis with code was a good idea, and jumped at the opportunity to learn a tool for it. Thanks, undergraduate programme!

I think the easiest thing to love about R is the package system. You can certainly end up in dependency hell with R and metaphorically shoot your own foot, especially on a shared high performance computing system. But I wouldn’t run into any of that until after several years. I was, and still am, impressed by how packages just worked, and could do almost anything. So, the Bioconductor packages were probably, indirectly, why I was introduced to R, and after that, my R story can be told in a series of packages. Thanks, CRAN!

The next package was R/qtl, that I relied on for my PhD. I had my own copy of the R/qtl book. For a period, I probably wrote thing every day:

library(qtl)

cross <- read.cross(file = "F8_geno_trim.csv", format = "csv")


R/qtl is one of my favourite pieces or research software, relatively friendly and with lots of documentation. Thanks, R/qtl developers!

Of course it was Dom Wright, who was my PhD supervisor, who introduced me to R/qtl, and I think it was also he who introduced me to ggplot2. At least he used it, and at some point we were together trying to fix the formatting of a graph, probably with some ugly hack. I decided to use ggplot2 as much as possible, and as it is wont to, ggplot2 made me care about rearranging data, thus leading to reshape2 and plyr. ”The magic is not in plotting the data but in tidying and rearranging the data for plotting.” After a while, most everything I wrote used the ddply function in some way. Thank you, Hadley Wickham!

Then came the contemporary tidyverse. For the longest time, I was uneasy with tidyr, and I’m still not a regular purrr user, but one can’t avoid loving dplyr. How much? My talk at the Swedish Bioinformatics Workshop in 2016 had a slide expressing my love of the filter function. It did not receive the cheers that the function deserves. Maybe the audience were Python users. With new file reading functions, new data frames and functions to manipulate data frames, modern R has become smoother and friendlier. Thanks, tidyverse developers!

The history of R on this blog started in 2011, originally as a way to make notes for myself or, ”a fellow user who’s trying to google his or her way to a solution”. This turned into a series of things to help teach R to biologists around me.

There was the Slightly different introduction to R series of blog posts. It used packages that feel somewhat outdated, and today, I don’t think there’s anything even slightly different about advocating RStudio, and teaching ggplot2 from the beginning.

This spawned a couple of seminars in course for PhD students, which were updated for the Wright lab computation lunches, and eventually turned into a course of its own given in 2017. It would be fun to update it and give it again.

The last few years, I’ve been using R for reasonably large genome datasets in a HPC environment, and gotten back to the beginnings, I guess, by using Bioconducor a lot more. However, the package that I think epitomises the last years of my R use is AlphaSimR, developed by colleagues in Edinburgh. It’s great to be able throw together a quick simulation to check how some feature of genetics behaves. AlphaSimR itself is also an example of how far the R/C++ integration has come with RCpp and RCppArmadillo. Thanks, Chris!

In summary, R is my tool of choice for almost anything. I hope we’ll still be using it, in new and interesting ways, in another ten years. Thank you, R core team!