# A plot of genes on chromosomes

Marta Cifuentes and Wayne Crismani asked on Twitter if there is a web tool similar to the Arabidopsis Chromosome Map Tool that makes figures of genes on chromosomes for humans. This will not really be an answer to the question — not a web tool, not conveniently packaged — but I thought that would be a nice plot to make in R with ggplot2. We will use the ggrepel package to help with labelling, and get information from the Ensembl REST API with httr and jsonlite.

# The plot and the final code to generate it

Below I will go through the functions that get us there, but here is the end product. The code is on GitHub as usual.

## Some Ensembl genes to test with

ensembl_genes <- c("ENSG00000125845", ## BMP2
"ENSG00000181690", ## PLAG1
"ENSG00000177508", ## IRX3
"ENSG00000140718") ## FTO

chr_sizes <- get_chromosome_sizes_from_ensembl(species = "homo_sapiens")

coords <- get_coordinates_from_ensembl(ensembl_genes)

plot_genes_test <- plot_genes(coords,
chr_sizes)


We will use Ensembl and access the genes via Ensembl Gene IDs. Here, I’ve looked up the Ensembl Gene IDs for four genes I like (in humans).

We need to know how long human chromosomes are in order to plot them, so we have a function for that; we also need to get coordinates for the genes, and we have a function for that. They are both below. These functions call up the Ensembl REST API to get the data from the Ensembl database.

Finally, there is a plotting function that takes the coordinates and the chromosome sizes as input and return a ggplot2 plot.

# Getting the data out of the Ensembl REST API

Now, starting from the top, we will need to define those functions to interact with the Ensembl REST API. This marvellous machine allows us to get data out of the Ensembl database over HTTP, writing our questions in the URL. It is nicely described with examples from multiple languages on the Ensembl REST API website.

An alternative to using the REST API would be to download gene locations from BioMart. This was my first thought. BioMart is more familiar to me than the REST API, and it also has the nice benefit that it is easy to download every gene and store it away for the future. However, there isn’t a nice way to get chromosome lengths from BioMart, so we would have to download them from somewhere else. This is isn’t hard, but I thought using the REST API for both tasks seemed nicer.

## Plot showing the location of a few genomes on chromosomes

library(httr)
library(jsonlite)
library(ggplot2)
library(ggrepel)
library(purrr)

## Get an endpoint from the Ensembl REST api and return parsed JSON

get_from_rest_api <- function(endpoint_string,
server = "https://rest.ensembl.org/") {
rest <- GET(paste(server, endpoint_string, sep = ""),
content_type("application/json"))

stop_for_status(rest)

fromJSON(toJSON(content(rest)))
}


This first function gets content by sending a request, checking whether it worked (and stopping with an error if it didn’t), and then unpacking the content into an R object.

## Get chromosomes sizes from the Ensembl REST API

get_chromosome_sizes_from_ensembl <- function(species) {

json <- get_from_rest_api(paste("info/assembly/", species, sep = ""))

data.frame(name = as.character(json$top_level_region$name),
length = as.numeric(json$top_level_region$length),
stringsAsFactors = FALSE)
}


This second function asks for the genome assembly information for a particular species with the GET info/assembly/:species endpoint, and extracts the chromosome lengths into a data frame. The first part of data gathering is done, now we just need the coordinates fort the genes of interest.

## Get coordinates from Ensembl ids

get_coordinates_from_ensembl <- function(ensembl_ids) {

map_dfr(ensembl_ids,
function(ei) {
json <- get_from_rest_api(paste("lookup/id/", ei, sep = ""))

data.frame(position = (json$start + json$end)/2,
chr = json$seq_region_name, display_name = json$display_name,
stringsAsFactors = FALSE)
})
}


This function asks for the gene information for each gene ID we’ve given it with the GET lookup/id/:id endpoint, and extracts the rough position (mean of start and end coordinate), chromosome name, and the ”display name”, which in the human case will be a gene symbol. (For genes that don’t have a gene symbol, we would need to set up this column ourselves.)

At this point, we have the data we need in two data frames. That means it’s time to make the plot.

# Plotting code

We will build a plot with two layers: first the chromosomes (as a geom_linerange) and then the gene locations (as a geom_text_repel from the ggrepel package). The text layer will move the labels around so that they don’t overlap even when the genes are close to each other, and by setting the nudge_x argument we can move them to the side of the chromosomes.

Apart from that, we change the scale to set he order of chromosomes and reverse the scale of the y-axis so that chromosomes start at the top of the plot.

The function returns a ggplot2 plot object, so one can do some further customisation after the fact — but for some features one would have to re-write things inside the function.

plot_genes <- function(coordinates,
chromosome_sizes) {

## Restrict to chromosomes that are in data
chrs_in_data <-
chromosome_sizes[chromosome_sizes$name %in% coordinates$chr,]
chr_order <- order(as.numeric(chrs_in_data$name)) ggplot() + geom_linerange(aes(x = name, ymin = 1, ymax = length/1e6), size = 2, colour = "grey", data = chrs_in_data) + geom_text_repel(aes(x = chr, y = position/1e6, label = display_name), nudge_x = 0.33, data = coordinates) + scale_y_reverse() + ## Fix ordering of chromosomes on x-axis scale_x_discrete(limits = chrs_in_data$name[chr_order],
ymax <- max(data$response) plot_points_ga <- ggplot() + geom_quasirandom(aes(x = factor(group), y = response), data = data) + xlab("Group") + ylab("Response") + theme_bw() + scale_y_continuous(limits = c(ymin, ymax)) height_of_plot <- ymax-ymin group0_fraction <- (coef(model)[1] - ymin)/height_of_plot diff_min <- - height_of_plot * group0_fraction diff_max <- (1 - group0_fraction) * height_of_plot plot_difference_ga <- ggplot() + geom_pointrange(aes(x = term, y = estimate, ymin = estimate - 2 * std.error, ymax = estimate + 2 * std.error), data = result[2,]) + scale_y_continuous(limits = c(diff_min, diff_max)) + ylab("Difference") + xlab("Comparison") + theme_bw() (plot_points_ga | plot_difference_ga) + plot_layout(widths = c(0.75, 0.25))  Literature Gardner, M. J., & Altman, D. G. (1986). Confidence intervals rather than P values: estimation rather than hypothesis testing. British Medical Journal Ho, J., Tumkaya, T., Aryal, S., Choi, H., & Claridge-Chang, A. (2019). Moving beyond P values: data analysis with estimation graphics. Nature methods. # A genetic mapping animation in R Cullen Roth posted a beautiful animation of quantitative trait locus mapping on Twitter. It is pretty amazing. I wanted to try to make something similar in R with gganimate. It’s not going to be as beautiful as Roth’s animation, but it will use the same main idea of showing both a test statistic along the genome, and the underlying genotypes and trait values. For example, Roth’s animation has an inset scatterplot that appears above the peak after it’s been reached; to do that I think we would have to go a bit lower-level than gganimate and place our plots ourselves. First, we’ll look at a locus associated with body weight in chickens (with data from Henriksen et al. 2016), and then a simulated example. We will use ggplot2 with gganimate and a magick trick for combining the two animations. Here are some pertinent snippets of the code; as usual, find the whole thing on Github. # LOD curve We will use R/qtl for the linkage mapping. We start by loading the data file (Supplementary Dataset from Henriksen et al. 2016). A couple of individuals have missing covariates, so we won’t be able to use them. This piece of code first reads the cross file, and then removes the two offending rows. library(qtl) ## Read cross file cross <- read.cross(format = "csv", file = "41598_2016_BFsrep34031_MOESM83_ESM.csv") cross <- subset(cross, ind = c("-34336", "-34233"))  For nice plotting, let’s restrict ourselves to fully informative markers (that is, the ones that tell the two founder lines of the cross apart). There are some partially informative ones in the dataset too, and R/qtl can get some information out of them thanks to genotype probability calculations with its Hidden Markov Model. They don’t make for nice scatterplots though. This piece of code extracts the genotypes and identifies informative markers as the ones that only have genotypes codes 1, 2 or 3 (homozygote, heterozygote and other homozygote), but not 5 and 6, which are used for partially informative markers. ## Get informative markers and combine with phenotypes for plotting geno <- as.data.frame(pull.geno(cross, chr = 1)) geno_values <- lapply(geno, unique) informative <- unlist(lapply(geno_values, function(g) all(g %in% c(1:3, NA)))) geno_informative <- geno[informative]  Now for the actual scan. We run a single QTL scan with covariates (sex, batch that the chickens were reared in, and principal components of genotypes), and pull out the logarithm of the odds (LOD) across chromosome 1. This piece of code first prepares a design matrix of the covariates, and then runs a scan of chromosome 1. ## Prepare covariates pheno <- pull.pheno(cross) covar <- model.matrix(~ sex_number + batch + PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 + PC10, pheno, na.action = na.exclude)[,-1] scan <- scanone(cross = cross, pheno.col = "weight_212_days", method = "hk", chr = 1, addcovar = covar)  Here is the LOD curve along chromosome 1 that want to animate. The peak is the biggest-effect growth locus in this intercross, known as ”growth1”. With gganimate, animating the points is as easy as adding a transition layer. This piece of code first makes a list of some formatting for our graphics, then extracts the LOD scores from the scan object, and makes the plot. By setting cumulative in transition_manual the animation will add one data point at the time, while keeping the old ones. library(ggplot2) library(gganimate) formatting <- list(theme_bw(base_size = 16), theme(panel.grid = element_blank(), strip.background = element_blank(), legend.position = "none"), scale_colour_manual(values = c("red", "purple", "blue"))) lod <- as.data.frame(scan) lod <- lod[informative,] lod$marker_number <- 1:nrow(lod)

plot_lod <- qplot(x = pos,
y = lod,
data = lod,
geom = c("point", "line")) +
ylab("Logarithm of odds") +
xlab("Position") +
formatting +
transition_manual(marker_number,
cumulative = TRUE)


# Plot of the underlying data

We also want a scatterplot of the data. Here what a jittered scatterplot will look like at the peak. The horizontal axes are genotypes (one homozygote, heterozygote in the middle, the other homozygote) and the vertical axis is the body mass in grams. We’ve separated the sexes into small multiples. Whether to give both sexes the same vertical axis or not is a judgement call. The hens weigh a lot less than the roosters, which means that it’s harder to see patterns among them when put on the same axis as the roosters. On the other hand, if we give the sexes different axes, it will hide that difference.

This piece of code builds a combined data frame with informative genotypes and body mass. Then, it makes the above plot for each marker into an animation.

library(tidyr)

## Combined genotypes and weight
geno_informative$id <- pheno$id
geno_informative$w212 <- pheno$weight_212_days
geno_informative$sex <- pheno$sex_number

melted <- pivot_longer(geno_informative,
-c("id", "w212", "sex"))

melted <- na.exclude(melted)

marker_numbers <- data.frame(name = rownames(scan),
marker_number = 1:nrow(scan),
stringsAsFactors = FALSE)

melted <- inner_join(melted, marker_numbers)

## Recode sex to words
melted$sex_char <- ifelse(melted$sex == 1, "male", "female")

plot_scatter <- qplot(x = value,
geom = "jitter",
y = w212,
colour = factor(value),
data = melted) +
facet_wrap(~ factor(sex_char),
ncol = 1) +
xlab("Genotype") +
ylab("Body mass") +
formatting +
transition_manual(marker_number)



# Combining the animations

And here is the final animation:

To put the pieces together, we use this magick trick (posted by Matt Crump). That is, animate the plots, one frame for each marker, and then use the R interface for ImageMagick to put them together and write them out.

gif_lod <- animate(plot_lod,
fps = 2,
width = 320,
height = 320,
nframes = sum(informative))

gif_scatter <- animate(plot_scatter,
fps = 2,
width = 320,
height = 320,
nframes = sum(informative))

## Magick trick from Matt Crump

new_gif <- image_append(c(mgif_lod[1], mgif_scatter[1]))
for(i in 2:sum(informative)){
combined <- image_append(c(mgif_lod[i], mgif_scatter[i]))
new_gif <- c(new_gif, combined)
}

image_write(new_gif, path = "out.gif", format = "gif")



Literature

Henriksen, Rie, et al. ”The domesticated brain: genetics of brain mass and brain structure in an avian species.” Scientific reports 6.1 (2016): 1-9.

# A model of polygenic adaptation in an infinite population

How do allele frequencies change in response to selection? Answers to that question include ”it depends”, ”we don’t know”, ”sometimes a lot, sometimes a little”, and ”according to a nonlinear differential equation that actually doesn’t look too horrendous if you squint a little”. Let’s look at a model of the polygenic adaptation of an infinitely large population under stabilising selection after a shift in optimum. This model has been developed by different researchers over the years (reviewed in Jain & Stephan 2017).

Here is the big equation for allele frequency change at one locus:

$\dot{p}_i = -s \gamma_i p_i q_i (c_1 - z') - \frac{s \gamma_i^2}{2} p_i q_i (q_i - p_i) + \mu (q_i - p_i )$

That wasn’t so bad, was it? These are the symbols:

• the subscript i indexes the loci,
• $\dot{p}$ is the change in allele frequency per time,
• $\gamma_i$ is the effect of the locus on the trait (twice the effect of the positive allele to be precise),
• $p_i$ is the frequency of the positive allele,
• $q_i$ the frequency of the negative allele,
• $s$ is the strength of selection,
• $c_1$ is the phenotypic mean of the population; it just depends on the effects and allele frequencies
• $\mu$ is the mutation rate.

This breaks down into three terms that we will look at in order.

# The directional selection term

$-s \gamma_i p_i q_i (c_1 - z')$

is the term that describes change due to directional selection.

Apart from the allele frequencies, it depends on the strength of directional selection $s$, the effect of the locus on the trait $\gamma_i$ and how far away the population is from the new optimum $(c_1 - z')$. Stronger selection, larger effect or greater distance to the optimum means more allele frequency change.

It is negative because it describes the change in the allele with a positive effect on the trait, so if the mean phenotype is above the optimum, we would expect the allele frequency to decrease, and indeed: when

$(c_1 - z') < 0$

this term becomes negative.

If you neglect the other two terms and keep this one, you get Jain & Stephan's "directional selection model", which describes behaviour of allele frequencies in the early phase before the population has gotten close to the new optimum. This approximation does much of the heavy lifting in their analysis.

# The stabilising selection term

$-\frac{s \gamma_i^2}{2} p_i q_i (q_i - p_i)$

is the term that describes change due to stabilising selection. Apart from allele frequencies, it depends on the square of the effect of the locus on the trait. That means that, regardless of the sign of the effect, it penalises large changes. This appears to make sense, because stabilising selection strives to preserve traits at the optimum. The cubic influence of allele frequency is, frankly, not intuitive to me.

# The mutation term

Finally,

$\mu (q_i - p_i )$

is the term that describes change due to new mutations. It depends on the allele frequencies, i.e. how of the alleles there are around that can mutate into the other alleles, and the mutation rate. To me, this is the one term one could sit down and write down, without much head-scratching.

# Walking in allele frequency space

Jain & Stephan (2017) show a couple of examples of allele frequency change after the optimum shift. Let us try to draw similar figures. (Jain & Stephan don’t give the exact parameters for their figures, they just show one case with effects below their threshold value and one with effects above.)

First, here is the above equation in R code:

pheno_mean <- function(p, gamma) {
sum(gamma * (2 * p - 1))
}

allele_frequency_change <- function(s, gamma, p, z_prime, mu) {
-s * gamma * p * (1 - p) * (pheno_mean(p, gamma) - z_prime) +
- s * gamma^2 * 0.5 * p * (1 - p) * (1 - p - p) +
mu * (1 - p - p)
}


With this (and some extra packaging; code on Github), we can now plot allele frequency trajectories such as this one, which starts at some arbitrary point and approaches an optimum:

Animation of alleles at two loci approaching an equilibrium. Here, we have two loci with starting frequencies 0.2 and 0.1 and effect size 1 and 0.01, and the optimum is at 0. The mutation rate is 10-4 and the strength of selection is 1. Animation made with gganimate.

# Resting in allele frequency space

The model describes a shift from one optimum to another, so we want want to start at equilibrium. Therefore, we need to know what the allele frequencies are at equilibrium, so we solve for 0 allele frequency change in the above equation. The first term will be zero, because

$(c_1 - z') = 0$

when the mean phenotype is at the optimum. So, we can throw away that term, and factor the rest equation into:

$(1 - 2p) (-\frac{s \gamma ^2}{2} p(1-p) + \mu) = 0$

Therefore, one root is $p = 1/2$. Depending on your constitution, this may or may not be intuitive to you. Imagine that you have all the loci, each with a positive and negative allele with the same effect, balanced so that half the population has one and the other half has the other. Then, there is this quadratic equation that gives two other equilibria:

$\mu - \frac{s\gamma^2}{2}p(1-p) = 0$
$\implies p = \frac{1}{2} (1 \pm \sqrt{1 - 8 \frac{\mu}{s \gamma ^2}})$

These points correspond to mutation–selection balance with one or the other allele closer to being lost. Jain & Stephan (2017) show a figure of the three equilibria that looks like a semicircle (from the quadratic equation, presumably) attached to a horizontal line at 0.5 (their Figure 1). Given this information, we can start our loci out at equilibrium frequencies. Before we set them off, we need to attend to the effect size.

# How big is a big effect? Hur långt är ett snöre?

In this model, there are big and small effects with qualitatively different behaviours. The cutoff is at:

$\hat{\gamma} = \sqrt{ \frac{8 \mu}{s}}$

If we look again at the roots to the quadratic equation above, they can only exist as real roots if

$\frac {8 \mu}{s \gamma^2} < 1$

because otherwise the expression inside the square root will be negative. This inequality can be rearranged into:

$\gamma^2 > \frac{8 \mu}{s}$

This means that if the effect of a locus is smaller than the threshold value, there is only one equilibrium point, and that is at 0.5. It also affects the way the allele frequency changes. Let us look at two two-locus cases, one where the effects are below this threshold and one where they are above it.

threshold <- function(mu, s) sqrt(8 * mu / s)

threshold(1e-4, 1)

[1] 0.02828427

With mutation rate of 10-4 and strength of selection of 1, the cutoff is about 0.028. Let our ”big” loci have effect sizes of 0.05 and our small loci have effect sizes of 0.01, then. Now, we are ready to shift the optimum.

The small loci will start at an equilibrium frequency of 0.5. We start the large loci at two different equilibrium points, where one positive allele is frequent and the other positive allele is rare:

get_equilibrium_frequencies <- function(mu, s, gamma) {
c(0.5,
0.5 * (1 + sqrt(1 - 8 * mu / (s * gamma^2))),
0.5 * (1 - sqrt(1 - 8 * mu / (s * gamma^2))))
}

(eq0.05 <- get_equilibrium_frequencies(1e-4, 1, 0.05))

[1] 0.50000000 0.91231056 0.08768944
get_equlibrium_frequencies(1e-4, 1, 0.01)

[1] 0.5 NaN NaN

# Look at them go!

These animations show the same qualitative behaviour as Jain & Stephan illustrate in their Figure 2. With small effects, there is gradual allele frequency change at both loci:

However, with large effects, one of the loci (the one on the vertical axis) dramatically changes in allele frequency, that is it’s experiencing a selective sweep, while the other one barely changes at all. And the model will show similar behaviour when the trait is properly polygenic, with many loci, as long as effects are large compared to the (scaled) mutation rate.

Here, I ran 10,000 time steps; if we look at the phenotypic means, we can see that they still haven’t arrived at the optimum at the end of that time. The mean with large effects is at 0.089 (new optimum of 0.1), and the mean with small effects is 0.0063 (new optimum: 0.02).

Let’s end here for today. Maybe another time, we can return how this model applies to actually polygenic architectures, that is, with more than two loci. The code for all the figures is on Github.

Literature

Jain, K., & Stephan, W. (2017). Modes of rapid polygenic adaptation. Molecular biology and evolution, 34(12), 3169-3175.

# Twin lambs with different fathers

I just learned that in sheep, lambs from the same litter pretty often have different fathers, if the ewe has mated with different males. Berry et al. (2020) looked at sheep flocks on Irland that used more than one ram, and:

Of the 539 pairs of twins included in the analysis, 160 (i.e. 30%) were sired by two different rams. Of the 137 sets of triplets included in the analysis, 73 (i.e. 53%) were sired by more than one ram. Of the nine sets of quadruplets, eight were sired by two rams with the remaining litter being mono‐paternal. The overall incidence of heteropaternal superfecundation among litters was therefore 35%. Given that the incidence of multiple births in these flocks was 65%, heteropaternal superfecundation is expected to be relatively common in sheep; this is especially true as all but two of the litter‐mates were polyzygotic.

They figured this out by looking at individuals genotyped on SNP chips with tens of thousands of SNPs, with both lambs and the potential parents genotyped, so there can’t be much uncertainty in the assignment. You don’t need that many genotyped markers to get a confident assignment, and they don’t have that many rams to choose from.

# Time for some Mendelian inheritance

Let’s simulate a situation like this: We set up a population and a marker panel for genotyping, split them into ewes and rams, and make some lambs.

library(AlphaSimR)

founderpop <- runMacs(nInd = 105,
nChr = 10,
segSites = 100)

simparam <- SimParam$new(founderpop) simparam$setGender("no")

values = colours$colour)  # Using R: simple Gantt chart with ggplot2 Jeremy Yoder’s code for a simple Gantt chart on the Molecular Ecologist blog uses geom_line and gather to prepare the data structure. I like using geom_linerange and a coord_flip, which lets you use start and end columns directly without pivoting. Here is a very serious data frame of activities: # A tibble: 6 x 4 activity category start end 1 Clean house preparations 2020-07-01 00:00:00 2020-07-03 00:00:00 2 Pack bags preparations 2020-07-05 10:00:00 2020-07-05 17:00:00 3 Run to train travel 2020-07-05 17:00:00 2020-07-05 17:15:00 4 Sleep on train travel 2020-07-05 17:15:00 2020-07-06 08:00:00 5 Procrastinate procrastination 2020-07-01 00:00:00 2020-07-05 00:00:00 6 Sleep vacation 2020-07-06 08:00:00 2020-07-09 00:00:00  And here is the code:  library(ggplot2) library(readr) activities <- read_csv("activities.csv") ## Set factor level to order the activities on the plot activities$activity <- factor(activities$activity, levels = activities$activity[nrow(activities):1])

plot_gantt <- qplot(ymin = start,
ymax = end,
x = activity,
colour = category,
geom = "linerange",
data = activities,
size = I(5)) +
scale_colour_manual(values = c("black", "grey", "purple", "yellow")) +
coord_flip() +
theme_bw() +
theme(panel.grid = element_blank()) +
xlab("") +
ylab("") +
ggtitle("Vacation planning")


# Using R: 10 years with R

Yesterday, 29 Feburary 2020, was the 20th anniversary of the release R 1.0.0. Jozef Hajnala’s blog has a cute anniversary post with some trivia. I realised that it is also (not to the day, but to the year) my R anniversary.

I started using R in 2010, during my MSc project in Linköping. Daniel Nätt, who was a PhD student there at the time, was using it for gene expression and DNA methylation work. I think that was the reason he was pulled into R; he needed the Bioconductor packages for microarrays. He introduced me. Thanks, Daniel!

I think I must first have used it to do something with qPCR melting curves. I remember that I wrote some function to reshape/pivot data between long and wide format. It was probably an atrocity of nested loops and hard bracket indexing. Coming right from an undergraduate programme with courses using Ada and C++, even if we had also used Minitab for statistics and Matlab for engineering, I spoke R with a strong accent. At any rate, I was primed to think that doing my data analysis with code was a good idea, and jumped at the opportunity to learn a tool for it. Thanks, undergraduate programme!

I think the easiest thing to love about R is the package system. You can certainly end up in dependency hell with R and metaphorically shoot your own foot, especially on a shared high performance computing system. But I wouldn’t run into any of that until after several years. I was, and still am, impressed by how packages just worked, and could do almost anything. So, the Bioconductor packages were probably, indirectly, why I was introduced to R, and after that, my R story can be told in a series of packages. Thanks, CRAN!

The next package was R/qtl, that I relied on for my PhD. I had my own copy of the R/qtl book. For a period, I probably wrote thing every day:

library(qtl)

cross <- read.cross(file = "F8_geno_trim.csv", format = "csv")


R/qtl is one of my favourite pieces or research software, relatively friendly and with lots of documentation. Thanks, R/qtl developers!

Of course it was Dom Wright, who was my PhD supervisor, who introduced me to R/qtl, and I think it was also he who introduced me to ggplot2. At least he used it, and at some point we were together trying to fix the formatting of a graph, probably with some ugly hack. I decided to use ggplot2 as much as possible, and as it is wont to, ggplot2 made me care about rearranging data, thus leading to reshape2 and plyr. ”The magic is not in plotting the data but in tidying and rearranging the data for plotting.” After a while, most everything I wrote used the ddply function in some way. Thank you, Hadley Wickham!

Then came the contemporary tidyverse. For the longest time, I was uneasy with tidyr, and I’m still not a regular purrr user, but one can’t avoid loving dplyr. How much? My talk at the Swedish Bioinformatics Workshop in 2016 had a slide expressing my love of the filter function. It did not receive the cheers that the function deserves. Maybe the audience were Python users. With new file reading functions, new data frames and functions to manipulate data frames, modern R has become smoother and friendlier. Thanks, tidyverse developers!

The history of R on this blog started in 2011, originally as a way to make notes for myself or, ”a fellow user who’s trying to google his or her way to a solution”. This turned into a series of things to help teach R to biologists around me.

There was the Slightly different introduction to R series of blog posts. It used packages that feel somewhat outdated, and today, I don’t think there’s anything even slightly different about advocating RStudio, and teaching ggplot2 from the beginning.

This spawned a couple of seminars in course for PhD students, which were updated for the Wright lab computation lunches, and eventually turned into a course of its own given in 2017. It would be fun to update it and give it again.

The last few years, I’ve been using R for reasonably large genome datasets in a HPC environment, and gotten back to the beginnings, I guess, by using Bioconducor a lot more. However, the package that I think epitomises the last years of my R use is AlphaSimR, developed by colleagues in Edinburgh. It’s great to be able throw together a quick simulation to check how some feature of genetics behaves. AlphaSimR itself is also an example of how far the R/C++ integration has come with RCpp and RCppArmadillo. Thanks, Chris!

In summary, R is my tool of choice for almost anything. I hope we’ll still be using it, in new and interesting ways, in another ten years. Thank you, R core team!

# Using R: from plyr to purrr, part 0 out of however many

This post is me thinking out loud about applying functions to vectors or lists and getting data frames back.

Using R is an ongoing process of finding nice ways to throw data frames, lists and model objects around. While tidyr has arrived at a comfortable way to reshape dataframes with pivot_longer and pivot_wider, I don’t always find the replacements for the good old plyr package as satisfying.

Here is an example of something I used to like to do with plyr. Don’t laugh!

Assume we have a number of text files, all in the same format, that we need to read and combine. This arises naturally if you run some kind of analysis where the dataset gets split into chunks, like in genetics, where chunks might be chromosomes.

## Generate vector of file names
files <- paste("data/chromosome", 1:20, ".txt", sep = "")

library(plyr)


This gives us one big data frame, containing the rows from all those files.

If we want to move on from plyr, what are our options?

We can go old school with base R functions lapply and Reduce.

library(readr)

genome <- Reduce(rbind, chromosomes)


Here, we first let lapply read each file and store it in a list. Then we let Reduce fold the list with rbind, which binds the data frames in the list together, one below the other.

If that didn’t make sense, here it is again: lapply maps a function to each element of a vector or list, collecting the results in a list. Reduce folds the elements in a list together, using a function that takes in two arguments. The first argument will be the results it’s accumulated so far, and the second argument will be the next element of the list.

In the end, this leaves us, as with ldply, with one big data frame.

We can also use purrr‘s map_dfr. This seems to be the contemporary most elegant solution:

library(purrr)