Using R: Animal model with simulated data

Last week’s post just happened to use MCMCglmm as an example of an R package that can get confused by tibble-style data frames. To make that example, I simulated some pedigree and trait data. Just for fun, let’s look at the simulation code, and use MCMCglmm and AnimalINLA to get heritability estimates.

First, here is some AlphaSimR code that creates a small random mating population, and collects trait and pedigree:

library(AlphaSimR)

## Founder population
FOUNDERPOP <- runMacs(nInd = 100,
                      nChr = 20,
                      inbred = FALSE,
                      species = "GENERIC")

## Simulation parameters 
SIMPARAM <- SimParam$new(FOUNDERPOP)
SIMPARAM$addTraitA(nQtlPerChr = 100,
                   mean = 100,
                   var = 10)
SIMPARAM$setGender("yes_sys")
SIMPARAM$setVarE(h2 = 0.3)
 
## Random mating for 9 more generations
generations <- vector(mode = "list", length = 10) 
generations[[1]] <- newPop(FOUNDERPOP,
                           simParam = SIMPARAM)


for (gen in 2:10) {

    generations[[gen]] <- randCross(generations[[gen - 1]],
                                    nCrosses = 10,
                                    nProgeny = 10,
                                    simParam = SIMPARAM)

}

## Put them all together
combined <- Reduce(c, generations)


## Extract phentoypes
pheno <- data.frame(animal = combined@id,
                    pheno = combined@pheno[,1])

## Extract pedigree
ped <- data.frame(id = combined@id,
                  dam = combined@mother,
                  sire =combined@father)
ped$dam[ped$dam == 0] <- NA
ped$sire[ped$sire == 0] <- NA

## Write out the files
write.csv(pheno,
          file = "sim_pheno.csv",
          row.names = FALSE,
          quote = FALSE)

write.csv(ped,
          file = "sim_ped.csv",
          row.names = FALSE,
          quote = FALSE)

In turn, we:

  1. Set up a founder population with a AlphaSimR’s generic livestock-like population history, and 20 chromosomes.
  2. Choose simulation parameters: we have an organism with separate sexes, a quantitative trait with an additive polygenic architecture, and we want an environmental variance to give us a heritability of 0.3.
  3. We store away the founders as the first generation, then run a loop to give us nine additional generations of random mating.
  4. Combine the resulting generations into one population.
  5. Extract phenotypes and pedigree into their own data frames.
  6. Optionally, save the latter data frames to files (for the last post).

Now that we have some data, we can fit a quantitative genetic pedigree model (”animal model”) to estimate genetic parameters. We’re going to try two methods to fit it: Markov Chain Monte Carlo and (the unfortunately named) Integrated Nested Laplace Approximation. MCMC explores the posterior distribution by sampling; I’m not sure where I heard it described as ”exploring a mountain by random teleportation”. INLA makes approximations to the posterior that can be integrated numerically; I guess it’s more like building a sculpture of the mountain.

First, a Gaussian animal model in MCMCglmm:

library(MCMCglmm)

## Gamma priors for variances
prior_gamma <- list(R = list(V = 1, nu = 1),
                    G = list(G1 = list(V = 1, nu = 1)))
    
## Fit the model
model_mcmc  <- MCMCglmm(scaled ~ 1,
                        random = ~ animal,
                        family = "gaussian",
                        prior = prior_gamma,
                        pedigree = ped,
                        data = pheno,
                        nitt = 100000,
                        burnin = 10000,
                        thin = 10)

## Calculate heritability for heritability from variance components
h2_mcmc_object  <- model_mcmc$VCV[, "animal"] /
    (model_mcmc$VCV[, "animal"] + model_mcmc$VCV[, "units"])
 
## Summarise results from that posterior
h2_mcmc  <- data.frame(mean = mean(h2_mcmc_object),
                       lower = quantile(h2_mcmc_object, 0.025),
                       upper = quantile(h2_mcmc_object, 0.975),
                       method = "MCMC",
                       stringsAsFactors = FALSE)

And here is a similar animal model in AnimalINLA:

library(AnimalINLA)

## Format pedigree to AnimalINLA's tastes
ped_inla <- ped
ped_inla$id  <- as.numeric(ped_inla$id)
ped_inla$dam  <- as.numeric(ped_inla$dam)
ped_inla$dam[is.na(ped_inla$dam)] <- 0
ped_inla$sire  <- as.numeric(ped_inla$sire)
ped_inla$sire[is.na(ped_inla$sire)] <- 0
    
## Turn to relationship matrix
A_inv <- compute.Ainverse(ped_inla)
    
## Fit the model
model_inla  <- animal.inla(response = scaled,
                           genetic = "animal",
                           Ainverse = A_inv,
                           type.data = "gaussian",
                           data = pheno,
                           verbose = TRUE)

## Pull out summaries from the model object
summary_inla  <- summary(model_inla)

## Summarise results
h2_inla  <- data.frame(mean = summary_inla$summary.hyperparam["Heritability", "mean"],
                       lower = summary_inla$summary.hyperparam["Heritability", "0.025quant"],
                       upper = summary_inla$summary.hyperparam["Heritability", "0.975quant"],
                       method = "INLA",
                       stringsAsFactors = FALSE)

If we wrap this all in a loop, we can see how the estimation methods do on replicate data (full script on GitHub). Here are estimates and intervals from ten replicates (black dots show the actual heritability in the first generation):

As you can see, the MCMC and INLA estimates agree pretty well and mostly hit the mark. In the one replicate dataset where they falter, they falter together.

Using R: When weird errors occur in packages that used to work, check that you’re not feeding them a tibble

There are some things that are great about the tidyverse family of R packages and the style they encourage. There are also a few gotchas. Here’s a reminder to myself about this phenomenon: tidyverse-style data frames (”tibbles”) do not simplify to vectors upon extracting a single column with hard bracket indexing.

Because some packages rely on specific data.frame behaviours that tibbles don’t show, functions that work nicely with data frames, and normally have nice interpretable error messages, may mysteriously collapse in all kinds of ways when fed a tibble.

Here’s an example with MCMCglmm. This is not to pick on MCMCglmm; it just happened to be one of the handful of packages where I’ve run into this issue. Here, we use readr, the tidyverse alternative to the read.table family of functions to read some simulated data. The base function is called read.csv, and the readr alternative is read_csv.

Reading in tabular data is a surprisingly hard problem: tables can be formatted in any variety of obnoxious ways, and the reading function also needs to be fast enough to deal with large files. Using readr certainly isn’t always painless, but it reduces the friction a lot compared to read.table. One of the improvements is that read_csv will return a data.frame with the class tbl_df, affectionately called ”tibble

After reading the data, we centre and scale the trait, set up some priors and run an animal model. Unfortunately, MCMCglmm will choke on the tibble, and deliver a confusing error message.

library(MCMCglmm)
library(readr)

ped <- read_csv("sim_ped.csv")
pheno <- read_csv("sim_pheno.csv")

pheno$scaled <- scale(pheno$pheno)

prior_gamma <- list(R = list(V = 1, nu = 1),
                    G = list(G1 = list(V = 1, nu = 1)))

model <- MCMCglmm(scaled ~ 1,
                  random = ~ animal,
                  family = "gaussian",
                  prior = prior_gamma,
                  pedigree = ped,
                  data = pheno,
                  nitt = 100000,
                  burnin = 10000,
                  thin = 10)

Error in inverseA(pedigree = pedigree, scale = scale, nodes = nodes) : 
  individuals appearing as dams but not in pedigree
In addition: Warning message:
In if (attr(pedigree, "class") == "phylo") { :
  the condition has length > 1 and only the first element will be used

In this pedigree, it is not the case that there are individuals appearing as dams but not listed. If we turn the data and pedigree into vanilla data frames instead, it will work:

ped <- as.data.frame(ped)
pheno <- as.data.frame(pheno)

model <- MCMCglmm(scaled ~ 1,
                  random = ~ animal,
                  family = "gaussian",
                  prior = prior_gamma,
                  pedigree = ped,
                  data = pheno,
                  nitt = 100000,
                  burnin = 10000,
                  thin = 10)

                       MCMC iteration = 0

                       MCMC iteration = 1000

                       MCMC iteration = 2000

‘Simulating genetic data with R: an example with deleterious variants (and a pun)’

A few weeks ago, I gave a talk at the Edinburgh R users group EdinbR on the RAGE paper. Since this is an R meetup, the talk concentrated on the mechanics of genetic data simulation and with the paper as a case study. I showed off some of what Chris Gaynor’s AlphaSimR can do, and how we built on that to make the specifics of this simulation study. The slides are on the EdinbR Github.

Genetic simulation is useful for all kinds of things. Sure, they’re only as good as the theory that underpins them, but the willingness to try things out in simulations is one of the things I always liked about breeding research.

This is my description of the logic of genetic simulation: we think of the genome as a large table of genotypes, drawn from some distribution of allele frequencies.

To make an utterly minimal simulation, we could draw allele frequencies from some distribution (like a Beta distribution), and then draw the genotypes from a binomial distribution. Done!

However, there is a ton of nuance we would like to have: chromosomes, linkage between variants, sexes, mating, selection …

AlphaSimR addresses all of this, and allows you to throw individuals and populations around to build pretty complicated designs. Here is the small example simulation I used in the talk.


library(AlphaSimR)
library(ggplot2)

## Generate founder chromsomes

FOUNDERPOP <- runMacs(nInd = 1000,
                      nChr = 10,
                      segSites = 5000,
                      inbred = FALSE,
                      species = "GENERIC")


## Simulation parameters

SIMPARAM <- SimParam$new(FOUNDERPOP)
SIMPARAM$addTraitA(nQtlPerChr = 100,
                   mean = 100,
                   var = 10)
SIMPARAM$addSnpChip(nSnpPerChr = 1000)
SIMPARAM$setGender("yes_sys")


## Founding population

pop <- newPop(FOUNDERPOP,
              simParam = SIMPARAM)

pop <- setPheno(pop,
                varE = 20,
                simParam = SIMPARAM)


## Breeding

print("Breeding")
breeding <- vector(length = 11, mode = "list")
breeding[[1]] <- pop

for (i in 2:11) {
    print(i)
    sires <- selectInd(pop = breeding[[i - 1]],
                       gender = "M",
                       nInd = 25,
                       trait = 1,
                       use = "pheno",
                       simParam = SIMPARAM)

    dams <- selectInd(pop = breeding[[i - 1]],
                      nInd = 500,
                      gender = "F",
                      trait = 1,
                      use = "pheno",
                      simParam = SIMPARAM)

    breeding[[i]] <- randCross2(males = sires,
                                females = dams,
                                nCrosses = 500,
                                nProgeny = 10,
                                simParam = SIMPARAM)
    breeding[[i]] <- setPheno(breeding[[i]],
                              varE = 20,
                              simParam = SIMPARAM)
}



## Look at genetic gain and shift in causative variant allele frequency

mean_g <- unlist(lapply(breeding, meanG))
sd_g <- sqrt(unlist(lapply(breeding, varG)))

plot_gain <- qplot(x = 1:11,
                   y = mean_g,
                   ymin = mean_g - sd_g,
                   ymax = mean_g + sd_g,
                   geom = "pointrange",
                   main = "Genetic mean and standard deviation",
                   xlab = "Generation", ylab = "Genetic mean")

start_geno <- pullQtlGeno(breeding[[1]], simParam = SIMPARAM)
start_freq <- colSums(start_geno)/(2 * nrow(start_geno))

end_geno <- pullQtlGeno(breeding[[11]], simParam = SIMPARAM)
end_freq <- colSums(end_geno)/(2 * nrow(end_geno))

plot_freq_before <- qplot(start_freq, main = "Causative variant frequency before") 
plot_freq_after <- qplot(end_freq, main = "Causative variant frequency after") 

This code builds a small livestock population, breeds it for ten generations, and looks at the resulting selection response in the form of a shift of the genetic mean, and the changes in the underlying distribution of causative variants. Here are the resulting plots:

Using R: plotting the genome on a line

Imagine you want to make a Manhattan-style plot or anything else where you want a series of intervals laid out on one axis after one another. If it’s actually a Manhattan plot you may have a friendly R package that does it for you, but here is how to cobble the plot together ourselves with ggplot2.

We start by making some fake data. Here, we have three contigs (this could be your chromosomes, your genomic intervals or whatever) divided into one, two and three windows, respectively. Each window has a value that we’ll put on the y-axis.

library(dplyr)
library(ggplot2)

data <- data_frame(contig = c("a", "a", "a", "b", "b", "c"),
                   start = c(0, 500, 1000, 0, 500, 0),
                   end = c(500, 1000, 1500, 500, 1000, 200),
                   value = c(0.5, 0.2, 0.4, 0.5, 0.3, 0.1))

We will need to know how long each contig is. In this case, if we assume that the windows cover the whole thing, we can get this from the data. If not, say if the windows don’t go up to the end of the chromosome, we will have to get this data from elsewhere (often some genome assembly metadata). This is also where we can decide in what order we want the contigs.

contig_lengths <- summarise(group_by(data, contig), length = max(end))

Now, we need to transform the coordinates on each contig to coordinates on our new axis, where we lay the contings after one another. What we need to do is to add an offset to each point, where the offset is the sum of the lengths of the contigs we’ve layed down before this one. We make a function that takes three arguments: two vectors containing the contig of each point and the position of each point, and also the table of lengths we just made.

flatten_coordinates <- function(contig, coord, contig_lengths) {
    coord_flat <- coord
    offset <- 0

    for (contig_ix in 1:nrow(contig_lengths)) {
        on_contig <- contig == contig_lengths$contig[contig_ix]
        coord_flat[on_contig] <- coord[on_contig] + offset
        offset <- offset + contig_lengths$length[contig_ix]
    }

    coord_flat
}

Now, we use this to transform the start and end of each window. We also transform the vector of the length of the contigs, so we can use it to add vertical lines between the contigs.

data$start_flat <- flatten_coordinates(data$contig,
                                       data$start,
                                       contig_lengths)
data$end_flat <- flatten_coordinates(data$contig,
                                     data$end,
                                     contig_lengths)
contig_lengths$length_flat <- flatten_coordinates(contig_lengths$contig,
                                                  contig_lengths$length,
                                                  contig_lengths)

It would be nice to label the x-axis with contig names. One way to do this is to take the coordinates we just made for the vertical lines, add a zero, and shift them one position, like so:

axis_coord <- c(0, contig_lengths$length_flat[-nrow(contig_lengths)])

Now it’s time to plot! We add one layer of points for the values on the y-axis, where each point is centered on the middle of the window, followed by a layer of vertical lines at the borders between contigs. Finally, we add our custom x-axis, and also some window dressing.

plot_genome <- ggplot() +
    geom_point(aes(x = (start_flat + end_flat)/2,
                   y = value),
               data = data) +
    geom_vline(aes(xintercept = length_flat),
               data = contig_lengths) +
    scale_x_continuous(breaks = axis_coord,
                       labels = contig_lengths$contig,
                       limits = c(0, max(contig_lengths$length_flat))) +
    xlab("Contig") + ylim(0, 1) + theme_bw()

And this is what we get:

I’m sure your plot will look more impressive, but you get the idea.

Showing a difference in means between two groups

Visualising a difference in mean between two groups isn’t as straightforward as it should. After all, it’s probably the most common quantitative analysis in science. There are two obvious options: we can either plot the data from the two groups separately, or we can show the estimate of the difference with an interval around it.

A swarm of dots is good because it shows the data, but it obscures the difference, and has no easy way to show the uncertainty in the difference. And, unfortunately, the uncertainty of the means within groups is not the same thing as the uncertainty of the difference between means.

An interval around the difference is good because it makes the plausible range of the difference very clear, but it obscures the range and distribution of the data.

Let’s simulate some fake data and look at these plots:

library(broom)
library(egg)
library(ggplot2)

data <- data.frame(group = rep(0:1, 20))
data$response <- 4 + data$group * 2 + rnorm(20)

We start by making two clouds of dots. Then we estimate the difference with a simple linear model, and plot the difference surrounded by an approximate confidence interval. We can plot them separately or the egg package to put them together in two neat panels:

plot_points <- ggplot() +
    geom_jitter(aes(x = factor(group), y = response),
                data = data,
                width = 0.1) +
    xlab("Group") +
    ylab("Response") +
    theme_bw()

model <- lm(response ~ factor(group), data = data)
result <- tidy(model)

plot_difference <- ggplot() +
    geom_pointrange(aes(x = term, y = estimate,
                        ymin = estimate - 2 * std.error,
                        ymax = estimate + 2 * std.error),
                    data = result) +
    ylim(-5, 5) +
    ylab("Value") +
    xlab("Coefficient") +
    coord_flip() +
    theme_bw()

plot_combined <- ggarrange(plot_points,
                           plot_difference,
                           heights = c(2, 1))

Here it is:

But I had another idea. I am not sure whether it’s a good idea or not, but here it is: We put in the dots, and then we put in two lines that represent the smallest and the greatest difference from the approximate confidence interval:

offset <- (2 * result$estimate[1] + result$estimate[2])/2
shortest <- result$estimate[2] - 2 * result$std.error[2]
longest <- result$estimate[2] + 2 * result$std.error[2]

plot_both <- plot_points + 
    geom_linerange(aes(ymin = offset - shortest/2,
                       ymax= offset + shortest/2,
                       x = 1.25)) +
    geom_linerange(aes(ymin = offset - longest/2,
                       ymax= offset + longest/2,
                       x = 1.75)) +
    theme_bw()

I think it looks pretty good, but it’s not self-explanatory, and I’m not sure whether it is misleading in any way.

Using R: the best thing I’ve changed about my code in years

Hopefully, one’s coding habits are constantly improving. If you feel any doubt about yourself, I suggest looking back at something you wrote 2011.

One thing I’ve changed recently that made my life so much better is a simple silly thing: meaningful name for index and counter variables.

Take a look at these pieces of fake code, that both loop over a matrix of hypothetical data (say: genotypes) to pass each value to a function (that does something):

## First attempt
for (i in 1:ncol(geno)) {
   for (j in 1:nrow(geno)) {
        output[i,j] <- do_something(geno[j, i])
   }
}
## Second attempt
n_markers <- ncol(geno)
n_ind <- nrow(geno)
for (marker_ix in 1:n_markers) {
   for (individual_ix in 1:n_ind) {
        output[individual_ix, marker_ix] <-
            do_something(geno[individual_ix, marker_ix])
   }
}

Isn’t that much nicer? The second version explicitly states what is what: we are iterating over markers and individuals, where each row is an individual and each column a marker. It even helps us spot errors such as the one in the first version. You would marvel at how many years it took me to realise that there is no law that says that the loop variable must be called i.

(Yes, nested for loops and hard bracket indexing looks uglier than a split-apply-combine solution, and using an apply family function would do away with any risk of mixing up the indices. However, loops do look less arcane to the uninitiated, and sometimes in more complicated cases, we really need that loop variable for something else.)

Using R: reshape2 to tidyr

Tidy data — it’s one of those terms that tend to confuse people, and certainly confused me. It’s Codd’s third normal form, but you can’t go around telling that to people and expect to be understood. One form is ”long”, the other is ”wide”. One form is ”melted”, another ”cast”. One form is ”gathered”, the other ”spread”. To make matters worse, I often botch the explanation and mix up at least two of the terms.

The word is also associated with the tidyverse suite of R packages in a somewhat loose way. But you don’t need to write in a tidyverse-style (including the %>%s and all) to enjoy tidy data.

But Hadley Wickham’s definition is straightforward:

In tidy data:
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.

In practice, I don’t think people always take their data frames all the way to tidy. For example, to make a scatterplot, it is convenient to keep a couple of variables as different columns. The key is that we need to move between different forms rapidly (brain time-rapidly, more than computer time-rapidly, I might add).

And not everything should be organized this way. If you’re a geneticist, genotypes are notoriously inconvenient in normalized form. Better keep that individual by marker matrix.

The first serious piece of R code I wrote for someone else was a function to turn data into long form for plotting. I suspect plotting is often the gateway to tidy data. The function was like what you’d expect from R code written by a beginner who comes from C-style languages: It reinvented the wheel, and I bet it had nested for loops, a bunch of hard bracket indices, and so on. Then I discovered reshape2.

library(reshape2)
fake_data <- data.frame(id = 1:20,
                        variable1 = runif(20, 0, 1),
                        variable2 = rnorm(20))
melted <- melt(fake_data, id.vars = "id")

The id.vars argument is to tell the function that the id column is the key, a column that tells us which individual each observation comes from. As the name suggests, id.vars can name multiple columns in a vector.

So the is the data before:

  id   variable1    variable2
1  1 0.938173781  0.852098580
2  2 0.408216233  0.261269134
3  3 0.341325188  1.796235963
4  4 0.958889279 -0.356218000

And this is after. We go from 20 rows to 40: two variables times 20 individuals.

  id  variable       value
1  1 variable1 0.938173781
2  2 variable1 0.408216233
3  3 variable1 0.341325188
4  4 variable1 0.958889279

And now: tidyr. tidyr is the new tidyverse package for rearranging data like this.

The tidyr equivalent of the melt function is called gather. There are two important differences that messed with my mind at first.

The melt and gather functions take the opposite default assumption about what columns should be treated as keys and what columns should be treated as containing values. In melt, as we saw above, we need to list the keys to keep them with each observation. In gather, we need to list the value columns, and the rest will be treated as keys.

Also, the second and third arguments (and they would be the first and second if you piped something into it), are the variable names that will be used in the long form data. In this case, to get a data frame that looks exactly the same as the first, we will stick with ”variable” and ”value”.

Here are five different ways to get the same long form data frame as above:

library(tidyr)
melted <- gather(fake_data, variable, value, 2:3)

## Column names instead of indices
melted <- gather(fake_data, variable, value, variable1, variable2)

## Excluding instead of including
melted <- gather(fake_data, variable, value, -1)

## Excluding using column name
melted <- gather(fake_data, variable, value, -id)

## With pipe
melted <- fake_data %>% gather(variable, value, -id)

Usually, this is the transformation we need: wide to long. If we need to go the other way, we can use plyr’s cast functions, and tidyr’s gather. This code recovers the original data frame:

## plyr
dcast(melted, id ~  variable)

## tidyr
spread(melted, variable, value)