# Using R: 10 years with R

Yesterday, 29 Feburary 2020, was the 20th anniversary of the release R 1.0.0. Jozef Hajnala’s blog has a cute anniversary post with some trivia. I realised that it is also (not to the day, but to the year) my R anniversary.

I started using R in 2010, during my MSc project in Linköping. Daniel Nätt, who was a PhD student there at the time, was using it for gene expression and DNA methylation work. I think that was the reason he was pulled into R; he needed the Bioconductor packages for microarrays. He introduced me. Thanks, Daniel!

I think I must first have used it to do something with qPCR melting curves. I remember that I wrote some function to reshape/pivot data between long and wide format. It was probably an atrocity of nested loops and hard bracket indexing. Coming right from an undergraduate programme with courses using Ada and C++, even if we had also used Minitab for statistics and Matlab for engineering, I spoke R with a strong accent. At any rate, I was primed to think that doing my data analysis with code was a good idea, and jumped at the opportunity to learn a tool for it. Thanks, undergraduate programme!

I think the easiest thing to love about R is the package system. You can certainly end up in dependency hell with R and metaphorically shoot your own foot, especially on a shared high performance computing system. But I wouldn’t run into any of that until after several years. I was, and still am, impressed by how packages just worked, and could do almost anything. So, the Bioconductor packages were probably, indirectly, why I was introduced to R, and after that, my R story can be told in a series of packages. Thanks, CRAN!

The next package was R/qtl, that I relied on for my PhD. I had my own copy of the R/qtl book. For a period, I probably wrote thing every day:

library(qtl)

cross <- read.cross(file = "F8_geno_trim.csv", format = "csv")

R/qtl is one of my favourite pieces or research software, relatively friendly and with lots of documentation. Thanks, R/qtl developers!

Of course it was Dom Wright, who was my PhD supervisor, who introduced me to R/qtl, and I think it was also he who introduced me to ggplot2. At least he used it, and at some point we were together trying to fix the formatting of a graph, probably with some ugly hack. I decided to use ggplot2 as much as possible, and as it is wont to, ggplot2 made me care about rearranging data, thus leading to reshape2 and plyr. ”The magic is not in plotting the data but in tidying and rearranging the data for plotting.” After a while, most everything I wrote used the ddply function in some way. Thank you, Hadley Wickham!

Then came the contemporary tidyverse. For the longest time, I was uneasy with tidyr, and I’m still not a regular purrr user, but one can’t avoid loving dplyr. How much? My talk at the Swedish Bioinformatics Workshop in 2016 had a slide expressing my love of the filter function. It did not receive the cheers that the function deserves. Maybe the audience were Python users. With new file reading functions, new data frames and functions to manipulate data frames, modern R has become smoother and friendlier. Thanks, tidyverse developers!

The history of R on this blog started in 2011, originally as a way to make notes for myself or, ”a fellow user who’s trying to google his or her way to a solution”. This turned into a series of things to help teach R to biologists around me.

There was the Slightly different introduction to R series of blog posts. It used packages that feel somewhat outdated, and today, I don’t think there’s anything even slightly different about advocating RStudio, and teaching ggplot2 from the beginning.

This spawned a couple of seminars in course for PhD students, which were updated for the Wright lab computation lunches, and eventually turned into a course of its own given in 2017. It would be fun to update it and give it again.

The last few years, I’ve been using R for reasonably large genome datasets in a HPC environment, and gotten back to the beginnings, I guess, by using Bioconducor a lot more. However, the package that I think epitomises the last years of my R use is AlphaSimR, developed by colleagues in Edinburgh. It’s great to be able throw together a quick simulation to check how some feature of genetics behaves. AlphaSimR itself is also an example of how far the R/C++ integration has come with RCpp and RCppArmadillo. Thanks, Chris!

In summary, R is my tool of choice for almost anything. I hope we’ll still be using it, in new and interesting ways, in another ten years. Thank you, R core team!

# Using R: from plyr to purrr, part 0 out of however many

This post is me thinking out loud about applying functions to vectors or lists and getting data frames back.

Using R is an ongoing process of finding nice ways to throw data frames, lists and model objects around. While tidyr has arrived at a comfortable way to reshape dataframes with pivot_longer and pivot_wider, I don’t always find the replacements for the good old plyr package as satisfying.

Here is an example of something I used to like to do with plyr. Don’t laugh!

Assume we have a number of text files, all in the same format, that we need to read and combine. This arises naturally if you run some kind of analysis where the dataset gets split into chunks, like in genetics, where chunks might be chromosomes.

## Generate vector of file names
files <- paste("data/chromosome", 1:20, ".txt", sep = "")

library(plyr)

This gives us one big data frame, containing the rows from all those files.

If we want to move on from plyr, what are our options?

We can go old school with base R functions lapply and Reduce.

genome <- Reduce(rbind, chromosomes)

Here, we first let lapply read each file and store it in a list. Then we let Reduce fold the list with rbind, which binds the data frames in the list together, one below the other.

If that didn’t make sense, here it is again: lapply maps a function to each element of a vector or list, collecting the results in a list. Reduce folds the elements in a list together, using a function that takes in two arguments. The first argument will be the results it’s accumulated so far, and the second argument will be the next element of the list.

In the end, this leaves us, as with ldply, with one big data frame.

We can also use purrr‘s map_dfr. This seems to be the contemporary most elegant solution:

library(purrr)

map_dfr, like good old ldply will map over a vector or list, and collect resulting data frames. The ”r” in the name means adding the next data frame as rows. There is also a ”c” version (map_dfc) for adding as columns.

# Using R: Animal model with hglm and Stan (with Cholesky trick)

A few weeks ago I posted about fitting the quantitative genetic animal model with MCMCglmm and R-INLA. Since then, I listened to a talk by Lars Rönnegård, one of the creators of the hglm package, and this paper was published in GSE about animal models in Stan.

# hglm

The hglm package fits hierarchical generalised linear models. That includes the animal model with pedigree or genomic relatedness. Hierarchical generalised linear models also allow you to model the dispersion of random effects, which lets you do tricks like variance QTL mapping (Rönnegård & Valdar 2011), breeding values for variances (Rönnegård et al. 2010) or genomic prediction models with predictors of marker variance (Mouresan, Selle & Rönnegård 2019). But let’s not get ahead of ourselves. How do we fit an animal model?

Here is the matrix formulation of the animal model that we skim through in every paper. It’s in this post because we will use the design matrix interface to hglm, which needs us to give it these matrices (this is not a paper, so we’re not legally obliged to include it):

$\mathbf{y} = \mu + \mathbf{X} \mathbf{b} + \mathbf{Z} \mathbf{a} + \mathbf{e}$

The terms are the the trait value, intercept, fixed coefficients and their design matrix, genetic coefficients and their design matrix, and the residual. The design matrix Z will contain one row and column for each individual, with a 1 to indicate its position in the phenotype table and pedigree and the rest zeros. If we sort our files, it’s an identity matrix.

The trick with the genetic coefficients is that they’re correlated, with a specific known correlation structure that we know from the pedigree (or in genomic models, from markers). It turns out (Lee, Nelder & Pawitan 2017, chapter 8) that you can change the Z matrix around so that it lets you fit the model with an identity covariance matrix, while still accounting for the correlations between relatives. You replace the random effects for relatedness with some transformed random effects that capture the same structure. One way to do this is with Cholesky decomposition.

$\mathbf{Z_{fudged}} = \mathbf{Z_0} \mathbf{L}$

As an example of what the Cholesky decomposition does, here is slice of the additive relationship matrix of 100 simulated individuals (the last generation of one replicate of these simulations) and the resulting matrix from Cholesky decomposition.

$\mathbf{a} \sim N(0, \mathbf{A} \sigma)$

We can fit

$\mathbf{a_{fudged}} \sim N(0, \mathbf{I} \sigma)$

This lets us fit the animal model with hglm, by putting in a modified Z matrix.

Assuming we have data frames with a pedigree and a phenotype (like, again, from these simulations):

library(AGHmatrix)
library(hglm)

A  <- Amatrix(ped)

Z0  <- diag(1000)
L <- t(chol(A))
Z  <- Z0 %*% L
X <- model.matrix(~1, pheno)

model <- hglm(y = pheno$pheno, X = X, Z = Z, conv = 1e-8) est_h2 <- model$varRanef / (model$varRanef + model$varFix)

(I found the recommendation to decrease the convergence criterion from the default for animal models in a YouTube video by Xia Chen.)

# Stan

When we turn to Stan, we will meet the Cholesky trick again. Stan is a software for Markov Chain Monte Carlo, built to fit hierarchical linear models, and related high-dimensional models, more effectively than other sampling strategies (like Gibbs). rstan is a helpful package for running Stan from within R.

Nishio & Arakawa (2019) recently published a Stan script to fit an animal model, comparing Stan to a Gibbs sampler (and a related MCMC sampler that they also didn’t publish the code for). If we look into their Stan model code, they also do a Cholesky decomposition to be able to use an identity matrix for the variance.

First, they decompose the additive relationship matrix that the program takes in:

transformed data{
matrix[K,K] LA;
LA = cholesky_decompose(A);
}

And then, they express the model like this:

vector[N] mu;
vector[K] a;
a_decompose ~ normal(0, 1);
a = sigma_G * (LA * a_decompose);
mu = X * b + Z * a;
Y ~ normal(mu, sigma_R);

We can add this line to the generated quantities block of the Stan program to get heritability estimates directly:

real h2;
h2 = sigma_U / (sigma_U + sigma_E)

Here, we’ve saved their model to a stan file, and now we can run it from R:

pheno$scaled_pheno <- as.vector(scale(pheno$pheno))

model_stan <- stan(file = "nishio_arakawa.stan",
data = list(Y = pheno$scaled_pheno, X = X, A = A, Z = Z0, J = 1, K = 1000, N = 1000)) est_h2_stan <- summary(model_stan, pars = "h2")$summary

Important note that I always forget: It's important to scale your traits before you run this model. If not, the priors might be all wrong.

The last line pulls out the summary for the heritability parameter (that we added above). This gives us an estimate and an interval.

The paper also contains this entertaining passage about performance, which reads as if it was a response to a comment, actual or anticipated:

R language is highly extensible and provides a myriad of statistical and graphical techniques. However, R language has poor computation time compared to Fortran, which is especially well suited to numeric computation and scientific computing. In the present study, we developed the programs for GS and HMC in R but did not examine computation time; instead, we focused on examining the performance of estimating genetic parameters and breeding values.

Yes, two of their samplers (Gibbs and HMC) were written in R, but the one they end up advocating (and the one used above), is in Stan. Stan code gets translated into C++ and then compiled to machine code.

# Stan with brms

If rstan lets us run Stan code from R and examine the output, brms lets us write down models in relatively straightforward R syntax. It’s like the MCMCglmm of the Stan world. We can fit an animal model with brms too, by directly plugging in the relationship matrix:

model_brms <- brm(scaled_pheno ~ 1 + (1|animal),
data = pheno,
family = gaussian(),
cov_ranef = list(animal = A),
chains = 4,
cores = 1,
iter = 2000)

Then, we can pull out the posterior samples for the variability, here expressed as standard errors, compute the heritability and then get the estimates (and interval, if we want):

posterior_brms <- posterior_samples(model_brms,
pars = c("sd_animal", "sigma"))

h2_brms  <- posterior_brms[,1]^2 /
(posterior_brms[,1]^2 + posterior_brms[,2]^2)

est_h2_brms <- mean(h2_brms)

(Code is on GitHub: both for the graphs above, and the models.)

# Using R: Installing GenABEL and RepeatABEL

GenABEL is an R package for performing genome-wide association with linear mixed models and a genomic relationship matrix. RepeatABEL is a package for such genome-wide association studies that also need repeated measures.

Unfortunately, since 2018, GenABEL is not available on CRAN anymore, because of failed checks that were not fixed. (Checks are archived on CRAN, but this means very little to me.) As a consequence, RepeatABEL is also missing.

Fair enough, the GenABEL creators probably aren’t paid to maintain old software. It is a bit tragic, however, to think that in 2016, GenABEL was supposed to be the core of a community project to develop a suite of genomic analysis packages, two years before it was taken of CRAN:

The original publication of the GenABEL package for statistical analysis of genotype data has led to the evolution of a community which we now call the GenABEL project, which brings together scientists, software developers and end users with the central goal of making statistical genomics work by openly developing and subsequently implementing statistical models into user-friendly software.

The project has benefited from an open development model, facilitating communication and code sharing between the parties involved. The use of a free software licence for the tools in the GenABEL suite promotes quick uptake and widespread dissemination of new methodologies and tools. Moreover, public access to the source code is an important ingredient for active participation by people from outside the core development team and is paramount for reproducible research. Feedback from end users is actively encouraged through a web forum, which steadily grows into a knowledge base with a multitude of answered questions. Furthermore, our open development process has resulted in transparent development of methods and software, including public code review, a large fraction of bugs being submitted by members of the community, and quick incorporation of bug fixes.

I have no special insight about the circumstances here, but obviously the situation is far from ideal. You can still use the packages, though, with a little more effort to install. Who knows how long that will be the case, though. In a complex web of dependencies like the R package ecosystem, an unmaintained package probably won’t last.

GenABEL can probably be replaced by something like GEMMA. It does mixed models for GWAS, and while it isn’t an R package, it’s probably about as convenient. However, I don’t know of a good alternative to RepeatABEL.

These are the steps to install GenABEL and RepeatABEL from archives:

1. We go to the CRAN archive and get the tarballs for GenABEL, GenABEL.data which it needs, and RepeatABEL.
curl -O https://cran.r-project.org/src/contrib/Archive/GenABEL/GenABEL_1.8-0.tar.gz
curl -O https://cran.r-project.org/src/contrib/Archive/GenABEL.data/GenABEL.data_1.0.0.tar.gz
curl -O https://cran.r-project.org/src/contrib/Archive/RepeatABEL/RepeatABEL_1.1.tar.gz

We don’t need to unpack them.

2. Install GenABEL.data and GenABEL from a local source. Inside R, we can use install.packages, using the files we’ve just downloaded instead of the online repository.
install.packages(c("GenABEL.data_1.0.0.tar.gz", "GenABEL_1.8-0.tar.gz"), repos = NULL)

3. To install RepeatABEL, we first need hglm, which we can get from CRAN. After that has finished, we install RepeatABEL, again from local source:
install.packages("hglm")
install.packages("RepeatABEL_1.1.tar.gz", repos = NULL)

This worked on R version 3.6.1 running on Ubuntu 16.04, and also on Mac OS X.

Literature

Karssen, Lennart C., Cornelia M. van Duijn, and Yurii S. Aulchenko. ”The GenABEL Project for statistical genomics.” F1000Research 5 (2016).

# Using R: From gather to pivot

Since version 1.0.0, released in September, the tidyr package has a new replacement for the gather/spread pair of functions, called pivot_longer/pivot_wider. (See the blog post about the release. It can do a lot of cool things.) Just what we needed, another pair of names for melt/cast, right?

Yes, I feel like this might just be what we need!

My journey started with reshape2, and after a bit of confusion, I internalised the logic of melt/cast. Look at this beauty:

library(reshape2)
fake_data <- data.frame(id = 1:20,
variable1 = runif(20, 0, 1),
variable2 = rnorm(20))
melted <- melt(fake_data, id.vars = "id")

This turns a data frame that looks like this …

id  variable1   variable2
1  1 0.10287737 -0.21740708
2  2 0.04219212  1.36050438
3  3 0.78119150  0.09808656
4  4 0.44304613  0.48306900
5  5 0.30720140 -0.45028374
6  6 0.42387957  1.16875579

… into a data frame that looks like this:

id  variable      value
1  1 variable1 0.10287737
2  2 variable1 0.04219212
3  3 variable1 0.78119150
4  4 variable1 0.44304613
5  5 variable1 0.30720140
6  6 variable1 0.42387957

This is extremely useful. Among other things it comes up all the time when using ggplot2.

Then, as I detailed in a post two years ago, I switched to tidyr as that became the replacement package. ”Gather” and ”spread” made no sense to me as descriptions of operations on a data frame. To be fair, ”melt” and ”cast” felt equally arbitrary, but by that time I was used to them. Getting the logic of the arguments, the order, what needed quotation marks and not, took some staring at examples and a fair bit of trial and error.

Here are some examples. If you’re not used to these functions, just skip ahead, because you will want to learn the pivot functions instead!

library(tidyr)
melted <- gather(fake_data, variable, value, 2:3)

## Column names instead of indices
melted <- gather(fake_data, variable, value, variable1, variable2)

melted <- gather(fake_data, variable, value, -1)

## Excluding using column name
melted <- gather(fake_data, variable, value, -id)

Enter the pivot functions. Now, I have never used pivot tables in any spreadsheet software, and in fact, the best way to explain them to me was to tell me that they were like melt/cast (and summarise) … But pivot_longer/pivot_wider are friendlier on first use than gather/spread. The naming of both the functions themselves and their arguments feel like a definite improvement.

long <- pivot_longer(fake_data, 2:3,
names_to = "variable",
values_to = "value")
# A tibble: 40 x 3
id variable    value

1     1 variable1  0.103
2     1 variable2 -0.217
3     2 variable1  0.0422
4     2 variable2  1.36
5     3 variable1  0.781
6     3 variable2  0.0981
7     4 variable1  0.443
8     4 variable2  0.483
9     5 variable1  0.307
10     5 variable2 -0.450
# … with 30 more rows

We tell it into what column we want the names to go, and into what column we want the values to go. The function is named after a verb that is associated with moving things about in tables all the way to matrix algebra, followed by an adjective (in my opinion the most descriptive, out of the alternatives) that describes the layout of the data that we want.

Or, to switch us back again:

wide <- pivot_wider(long,
names_from = "variable",
values_from = "value")
# A tibble: 20 x 3
id variable1 variable2

1     1    0.103    -0.217
2     2    0.0422    1.36
3     3    0.781     0.0981
4     4    0.443     0.483
5     5    0.307    -0.450
6     6    0.424     1.17

Here, instead, we tell it where we want the new column names taken from and where we want the new values taken from. None of this is self-explanatory, by any means, but they are thoughtful choices that make a lot of sense.

We’ll see what I think after trying to explain them to beginners a few times, and after I’ve fought warning messages involving list columns for some time, but so far: well done, tidyr developers!

# Using R: Animal model with simulated data

Last week’s post just happened to use MCMCglmm as an example of an R package that can get confused by tibble-style data frames. To make that example, I simulated some pedigree and trait data. Just for fun, let’s look at the simulation code, and use MCMCglmm and AnimalINLA to get heritability estimates.

First, here is some AlphaSimR code that creates a small random mating population, and collects trait and pedigree:

library(AlphaSimR)

## Founder population
FOUNDERPOP <- runMacs(nInd = 100,
nChr = 20,
inbred = FALSE,
species = "GENERIC")

## Simulation parameters
SIMPARAM <- SimParam$new(FOUNDERPOP) SIMPARAM$addTraitA(nQtlPerChr = 100,
mean = 100,
var = 10)
SIMPARAM$setGender("yes_sys") SIMPARAM$setVarE(h2 = 0.3)

## Random mating for 9 more generations
generations <- vector(mode = "list", length = 10)
generations[[1]] <- newPop(FOUNDERPOP,
simParam = SIMPARAM)

for (gen in 2:10) {

generations[[gen]] <- randCross(generations[[gen - 1]],
nCrosses = 10,
nProgeny = 10,
simParam = SIMPARAM)

}

## Put them all together
combined <- Reduce(c, generations)

## Extract phentoypes
pheno <- data.frame(animal = combined@id,
pheno = combined@pheno[,1])

## Extract pedigree
ped <- data.frame(id = combined@id,
dam = combined@mother,
sire =combined@father)
ped$dam[ped$dam == 0] <- NA
ped$sire[ped$sire == 0] <- NA

## Write out the files
write.csv(pheno,
file = "sim_pheno.csv",
row.names = FALSE,
quote = FALSE)

write.csv(ped,
file = "sim_ped.csv",
row.names = FALSE,
quote = FALSE)

In turn, we:

1. Set up a founder population with a AlphaSimR’s generic livestock-like population history, and 20 chromosomes.
2. Choose simulation parameters: we have an organism with separate sexes, a quantitative trait with an additive polygenic architecture, and we want an environmental variance to give us a heritability of 0.3.
3. We store away the founders as the first generation, then run a loop to give us nine additional generations of random mating.
4. Combine the resulting generations into one population.
5. Extract phenotypes and pedigree into their own data frames.
6. Optionally, save the latter data frames to files (for the last post).

Now that we have some data, we can fit a quantitative genetic pedigree model (”animal model”) to estimate genetic parameters. We’re going to try two methods to fit it: Markov Chain Monte Carlo and (the unfortunately named) Integrated Nested Laplace Approximation. MCMC explores the posterior distribution by sampling; I’m not sure where I heard it described as ”exploring a mountain by random teleportation”. INLA makes approximations to the posterior that can be integrated numerically; I guess it’s more like building a sculpture of the mountain.

First, a Gaussian animal model in MCMCglmm:

library(MCMCglmm)

## Gamma priors for variances
prior_gamma <- list(R = list(V = 1, nu = 1),
G = list(G1 = list(V = 1, nu = 1)))

## Fit the model
model_mcmc  <- MCMCglmm(scaled ~ 1,
random = ~ animal,
family = "gaussian",
prior = prior_gamma,
pedigree = ped,
data = pheno,
nitt = 100000,
burnin = 10000,
thin = 10)

## Calculate heritability for heritability from variance components
h2_mcmc_object  <- model_mcmc$VCV[, "animal"] / (model_mcmc$VCV[, "animal"] + model_mcmc$VCV[, "units"]) ## Summarise results from that posterior h2_mcmc <- data.frame(mean = mean(h2_mcmc_object), lower = quantile(h2_mcmc_object, 0.025), upper = quantile(h2_mcmc_object, 0.975), method = "MCMC", stringsAsFactors = FALSE) And here is a similar animal model in AnimalINLA: library(AnimalINLA) ## Format pedigree to AnimalINLA's tastes ped_inla <- ped ped_inla$id  <- as.numeric(ped_inla$id) ped_inla$dam  <- as.numeric(ped_inla$dam) ped_inla$dam[is.na(ped_inla$dam)] <- 0 ped_inla$sire  <- as.numeric(ped_inla$sire) ped_inla$sire[is.na(ped_inla$sire)] <- 0 ## Turn to relationship matrix A_inv <- compute.Ainverse(ped_inla) ## Fit the model model_inla <- animal.inla(response = scaled, genetic = "animal", Ainverse = A_inv, type.data = "gaussian", data = pheno, verbose = TRUE) ## Pull out summaries from the model object summary_inla <- summary(model_inla) ## Summarise results h2_inla <- data.frame(mean = summary_inla$summary.hyperparam["Heritability", "mean"],
lower = summary_inla$summary.hyperparam["Heritability", "0.025quant"], upper = summary_inla$summary.hyperparam["Heritability", "0.975quant"],
method = "INLA",
stringsAsFactors = FALSE)

If we wrap this all in a loop, we can see how the estimation methods do on replicate data (full script on GitHub). Here are estimates and intervals from ten replicates (black dots show the actual heritability in the first generation):

As you can see, the MCMC and INLA estimates agree pretty well and mostly hit the mark. In the one replicate dataset where they falter, they falter together.

# Using R: When weird errors occur in packages that used to work, check that you’re not feeding them a tibble

There are some things that are great about the tidyverse family of R packages and the style they encourage. There are also a few gotchas. Here’s a reminder to myself about this phenomenon: tidyverse-style data frames (”tibbles”) do not simplify to vectors upon extracting a single column with hard bracket indexing.

Because some packages rely on specific data.frame behaviours that tibbles don’t show, functions that work nicely with data frames, and normally have nice interpretable error messages, may mysteriously collapse in all kinds of ways when fed a tibble.

Here’s an example with MCMCglmm. This is not to pick on MCMCglmm; it just happened to be one of the handful of packages where I’ve run into this issue. Here, we use readr, the tidyverse alternative to the read.table family of functions to read some simulated data. The base function is called read.csv, and the readr alternative is read_csv.

Reading in tabular data is a surprisingly hard problem: tables can be formatted in any variety of obnoxious ways, and the reading function also needs to be fast enough to deal with large files. Using readr certainly isn’t always painless, but it reduces the friction a lot compared to read.table. One of the improvements is that read_csv will return a data.frame with the class tbl_df, affectionately called ”tibble

After reading the data, we centre and scale the trait, set up some priors and run an animal model. Unfortunately, MCMCglmm will choke on the tibble, and deliver a confusing error message.

library(MCMCglmm)

pheno$scaled <- scale(pheno$pheno)

prior_gamma <- list(R = list(V = 1, nu = 1),
G = list(G1 = list(V = 1, nu = 1)))

model <- MCMCglmm(scaled ~ 1,
random = ~ animal,
family = "gaussian",
prior = prior_gamma,
pedigree = ped,
data = pheno,
nitt = 100000,
burnin = 10000,
thin = 10)

Error in inverseA(pedigree = pedigree, scale = scale, nodes = nodes) :
individuals appearing as dams but not in pedigree
In if (attr(pedigree, "class") == "phylo") { :
the condition has length > 1 and only the first element will be used

In this pedigree, it is not the case that there are individuals appearing as dams but not listed. If we turn the data and pedigree into vanilla data frames instead, it will work:

ped <- as.data.frame(ped)
pheno <- as.data.frame(pheno)

model <- MCMCglmm(scaled ~ 1,
random = ~ animal,
family = "gaussian",
prior = prior_gamma,
pedigree = ped,
data = pheno,
nitt = 100000,
burnin = 10000,
thin = 10)

MCMC iteration = 0

MCMC iteration = 1000

MCMC iteration = 2000

# ‘Simulating genetic data with R: an example with deleterious variants (and a pun)’

A few weeks ago, I gave a talk at the Edinburgh R users group EdinbR on the RAGE paper. Since this is an R meetup, the talk concentrated on the mechanics of genetic data simulation and with the paper as a case study. I showed off some of what Chris Gaynor’s AlphaSimR can do, and how we built on that to make the specifics of this simulation study. The slides are on the EdinbR Github.

Genetic simulation is useful for all kinds of things. Sure, they’re only as good as the theory that underpins them, but the willingness to try things out in simulations is one of the things I always liked about breeding research.

This is my description of the logic of genetic simulation: we think of the genome as a large table of genotypes, drawn from some distribution of allele frequencies.

To make an utterly minimal simulation, we could draw allele frequencies from some distribution (like a Beta distribution), and then draw the genotypes from a binomial distribution. Done!

However, there is a ton of nuance we would like to have: chromosomes, linkage between variants, sexes, mating, selection …

AlphaSimR addresses all of this, and allows you to throw individuals and populations around to build pretty complicated designs. Here is the small example simulation I used in the talk.

library(AlphaSimR)
library(ggplot2)

## Generate founder chromsomes

FOUNDERPOP <- runMacs(nInd = 1000,
nChr = 10,
segSites = 5000,
inbred = FALSE,
species = "GENERIC")

## Simulation parameters

SIMPARAM <- SimParam$new(FOUNDERPOP) SIMPARAM$addTraitA(nQtlPerChr = 100,
mean = 100,
var = 10)
SIMPARAM$addSnpChip(nSnpPerChr = 1000) SIMPARAM$setGender("yes_sys")

## Founding population

pop <- newPop(FOUNDERPOP,
simParam = SIMPARAM)

pop <- setPheno(pop,
varE = 20,
simParam = SIMPARAM)

## Breeding

print("Breeding")
breeding <- vector(length = 11, mode = "list")
breeding[[1]] <- pop

for (i in 2:11) {
print(i)
sires <- selectInd(pop = breeding[[i - 1]],
gender = "M",
nInd = 25,
trait = 1,
use = "pheno",
simParam = SIMPARAM)

dams <- selectInd(pop = breeding[[i - 1]],
nInd = 500,
gender = "F",
trait = 1,
use = "pheno",
simParam = SIMPARAM)

breeding[[i]] <- randCross2(males = sires,
females = dams,
nCrosses = 500,
nProgeny = 10,
simParam = SIMPARAM)
breeding[[i]] <- setPheno(breeding[[i]],
varE = 20,
simParam = SIMPARAM)
}

## Look at genetic gain and shift in causative variant allele frequency

mean_g <- unlist(lapply(breeding, meanG))
sd_g <- sqrt(unlist(lapply(breeding, varG)))

plot_gain <- qplot(x = 1:11,
y = mean_g,
ymin = mean_g - sd_g,
ymax = mean_g + sd_g,
geom = "pointrange",
main = "Genetic mean and standard deviation",
xlab = "Generation", ylab = "Genetic mean")

start_geno <- pullQtlGeno(breeding[[1]], simParam = SIMPARAM)
start_freq <- colSums(start_geno)/(2 * nrow(start_geno))

end_geno <- pullQtlGeno(breeding[[11]], simParam = SIMPARAM)
end_freq <- colSums(end_geno)/(2 * nrow(end_geno))

plot_freq_before <- qplot(start_freq, main = "Causative variant frequency before")
plot_freq_after <- qplot(end_freq, main = "Causative variant frequency after")

This code builds a small livestock population, breeds it for ten generations, and looks at the resulting selection response in the form of a shift of the genetic mean, and the changes in the underlying distribution of causative variants. Here are the resulting plots:

# Using R: plotting the genome on a line

Imagine you want to make a Manhattan-style plot or anything else where you want a series of intervals laid out on one axis after one another. If it’s actually a Manhattan plot you may have a friendly R package that does it for you, but here is how to cobble the plot together ourselves with ggplot2.

We start by making some fake data. Here, we have three contigs (this could be your chromosomes, your genomic intervals or whatever) divided into one, two and three windows, respectively. Each window has a value that we’ll put on the y-axis.

library(dplyr)
library(ggplot2)

data <- data_frame(contig = c("a", "a", "a", "b", "b", "c"),
start = c(0, 500, 1000, 0, 500, 0),
end = c(500, 1000, 1500, 500, 1000, 200),
value = c(0.5, 0.2, 0.4, 0.5, 0.3, 0.1))

We will need to know how long each contig is. In this case, if we assume that the windows cover the whole thing, we can get this from the data. If not, say if the windows don’t go up to the end of the chromosome, we will have to get this data from elsewhere (often some genome assembly metadata). This is also where we can decide in what order we want the contigs.

contig_lengths <- summarise(group_by(data, contig), length = max(end))

Now, we need to transform the coordinates on each contig to coordinates on our new axis, where we lay the contings after one another. What we need to do is to add an offset to each point, where the offset is the sum of the lengths of the contigs we’ve layed down before this one. We make a function that takes three arguments: two vectors containing the contig of each point and the position of each point, and also the table of lengths we just made.

flatten_coordinates <- function(contig, coord, contig_lengths) {
coord_flat <- coord
offset <- 0

for (contig_ix in 1:nrow(contig_lengths)) {
on_contig <- contig == contig_lengths$contig[contig_ix] coord_flat[on_contig] <- coord[on_contig] + offset offset <- offset + contig_lengths$length[contig_ix]
}

coord_flat
}

Now, we use this to transform the start and end of each window. We also transform the vector of the length of the contigs, so we can use it to add vertical lines between the contigs.

data$start_flat <- flatten_coordinates(data$contig,
data$start, contig_lengths) data$end_flat <- flatten_coordinates(data$contig, data$end,
contig_lengths)
contig_lengths$length_flat <- flatten_coordinates(contig_lengths$contig,
contig_lengths$length, contig_lengths) It would be nice to label the x-axis with contig names. One way to do this is to take the coordinates we just made for the vertical lines, add a zero, and shift them one position, like so: axis_coord <- c(0, contig_lengths$length_flat[-nrow(contig_lengths)])

Now it’s time to plot! We add one layer of points for the values on the y-axis, where each point is centered on the middle of the window, followed by a layer of vertical lines at the borders between contigs. Finally, we add our custom x-axis, and also some window dressing.

plot_genome <- ggplot() +
geom_point(aes(x = (start_flat + end_flat)/2,
y = value),
data = data) +
geom_vline(aes(xintercept = length_flat),
data = contig_lengths) +
scale_x_continuous(breaks = axis_coord,
labels = contig_lengths$contig, limits = c(0, max(contig_lengths$length_flat))) +
xlab("Contig") + ylim(0, 1) + theme_bw()

And this is what we get:

I’m sure your plot will look more impressive, but you get the idea.

# Showing a difference in means between two groups

Visualising a difference in mean between two groups isn’t as straightforward as it should. After all, it’s probably the most common quantitative analysis in science. There are two obvious options: we can either plot the data from the two groups separately, or we can show the estimate of the difference with an interval around it.

A swarm of dots is good because it shows the data, but it obscures the difference, and has no easy way to show the uncertainty in the difference. And, unfortunately, the uncertainty of the means within groups is not the same thing as the uncertainty of the difference between means.

An interval around the difference is good because it makes the plausible range of the difference very clear, but it obscures the range and distribution of the data.

Let’s simulate some fake data and look at these plots:

library(broom)
library(egg)
library(ggplot2)

data <- data.frame(group = rep(0:1, 20))
data$response <- 4 + data$group * 2 + rnorm(20)

We start by making two clouds of dots. Then we estimate the difference with a simple linear model, and plot the difference surrounded by an approximate confidence interval. We can plot them separately or the egg package to put them together in two neat panels:

plot_points <- ggplot() +
geom_jitter(aes(x = factor(group), y = response),
data = data,
width = 0.1) +
xlab("Group") +
ylab("Response") +
theme_bw()

model <- lm(response ~ factor(group), data = data)
result <- tidy(model)

plot_difference <- ggplot() +
geom_pointrange(aes(x = term, y = estimate,
ymin = estimate - 2 * std.error,
ymax = estimate + 2 * std.error),
data = result) +
ylim(-5, 5) +
ylab("Value") +
xlab("Coefficient") +
coord_flip() +
theme_bw()

plot_combined <- ggarrange(plot_points,
plot_difference,
heights = c(2, 1))

Here it is:

But I had another idea. I am not sure whether it’s a good idea or not, but here it is: We put in the dots, and then we put in two lines that represent the smallest and the greatest difference from the approximate confidence interval:

offset <- (2 * result$estimate[1] + result$estimate[2])/2
shortest <- result$estimate[2] - 2 * result$std.error[2]
longest <- result$estimate[2] + 2 * result$std.error[2]

plot_both <- plot_points +
geom_linerange(aes(ymin = offset - shortest/2,
ymax= offset + shortest/2,
x = 1.25)) +
geom_linerange(aes(ymin = offset - longest/2,
ymax= offset + longest/2,
x = 1.75)) +
theme_bw()

I think it looks pretty good, but it’s not self-explanatory, and I’m not sure whether it is misleading in any way.