Virtual animal breeding journal club: ”Structural equation models to disentangle the biological relationship between microbiota and complex traits …”

The other day was the first Virtual breeding and genetics journal club organised by John Cole. This was the first online journal club I’ve attended (shocking, given how many video calls I’ve been on for other sciencey reasons), so I thought I’d write a little about it: both the format and the paper. You can look the slide deck from the journal club here (pptx file).

The medium

We used Zoom, and that seemed to work, as I’m sure anything else would, if everyone just mute their microphone when they aren’t speaking. As John said, the key feature of Zoom seems to be the ability for the host to mute everyone else. During the call, I think we were at most 29 or so people, but only a handful spoke. It will probably get more intense with the turn taking if more people want to speak.

The format

John started the journal club with a code of conduct, which I expect helped to set what I felt was a good atmosphere. In most journal clubs I’ve been in, I feel like the atmosphere has been pretty good, but I think we’ve all heard stories about hyper-critical and hostile journal clubs, and that doesn’t sound particularly fun or useful. On that note, one of the authors, Oscar González-Recio, was on the call and answered some questions.

The paper

Saborío‐Montero, Alejandro, et al. ”Structural equation models to disentangle the biological relationship between microbiota and complex traits: Methane production in dairy cattle as a case of study.” Journal of Animal Breeding and Genetics 137.1 (2020): 36-48.

The authors measured methane emissions (by analysing breath with with an infrared gas monitor) and abundance of different microbes in the rumen (with Nanopore sequencing) from dairy cows. They genotyped the animals for relatedness.

They analysed the genetic relationship between breath methane and abundance of each taxon of microbe, individually, with either:

  • a bivariate animal model;
  • a structural equations model that allows for a causal effect of abundance on methane, capturing the assumption that the abundance of a taxon can affect the methane emission, but not the other way around.

They used them to estimate heritabilities of abundances and genetic correlations between methane and abundances, and in the case of the structural model: conditional on the assumed causal model, the effect of that taxon’s abundance on methane.

My thoughts

It’s cool how there’s a literature building up on genetic influences on the microbiome, with some consistency across studies. These intense high-tech studies on relatively few cattle might build up to finding new traits and proxies that can go into larger scale phenotyping for breeding.

As the title suggests, the paper advocates for using the structural equations model: ”Genetic correlation estimates revealed differences according to the usage of non‐recursive and recursive models, with a more biologically supported result for the recursive model estimation.” (Conclusions)

While I agree that a priori, it makes sense to assume a structural equations model with a causal structure, I don’t think the results provide much evidence that it’s better. The estimates of heritabilities and genetic correlations from the two models are near indistinguishable. Here is the key figure 4, comparing genetic correlation estimates:


As you can see, there are a couple of examples of genetic correlations where the point estimate switches sign, and one of them (Succinivibrio sp.) where the credible intervals don’t overlap. ”Recursive” is the structural equations model. The error bars are 95% credible intervals. This is not strong evidence of anything; the authors are responsible about it and don’t go into interpreting this difference. But let us speculate! They write:

All genera in this case, excepting Succinivibrio sp. from the Proteobacteria phylum, resulted in overlapped genetic cor- relations between the non‐recursive bivariate model and the recursive model. However, high differences were observed. Succinivibrio sp. showed the largest disagreement changing from positively correlated (0.08) in the non‐recursive bivariate model to negatively correlated (−0.20) in the recursive model.

Succinivibrio are also the taxon with the estimated largest inhibitory effect on methane (from the structural equations model).

While some taxa, such as ciliate protozoa or Methanobrevibacter sp., increased the CH4 emissions …, others such as Succinivibrio sp. from Proteobacteria phylum decreased it

Looking at the paper that first described these bacteria (Bryan & Small 1955),  Succinivibrio were originally isolated from the cattle rumen, and their name is because ”they ferment glucose with the production of a large amount of succinic acid”. Bryant & Small made a fermentation experiment to see what came out, and it seems that the bacteria don’t produce methane:


This is also in line with a rRNA sequencing study of high and low methane emitting cows (Wallace & al 2015) that found lower Succinivibrio abundance in high methane emitters.

We may speculate that Succinivibrio species could be involved in diverting energy from methanogens, and thus reducing methane emissions. If that is true, then the structural equations model estimate (larger genetic negative correlation between Succinivibrio abundance and methane) might be better than one from the animal model.

Finally, while I’m on board with the a priori argument for using a structural equations model, as with other applications of causal modelling (gene networks, Mendelian randomisation etc), it might be dangerous to consider only parts of the system independently, where the microbes are likely to have causal effects on each other.


Saborío‐Montero, Alejandro, et al. ”Structural equation models to disentangle the biological relationship between microbiota and complex traits: Methane production in dairy cattle as a case of study.” Journal of Animal Breeding and Genetics 137.1 (2020): 36-48.

Wallace, R. John, et al. ”The rumen microbial metagenome associated with high methane production in cattle.” BMC genomics 16.1 (2015): 839.

Bryant, Marvin P., and Nola Small. ”Characteristics of two new genera of anaerobic curved rods isolated from the rumen of cattle.” Journal of bacteriology 72.1 (1956): 22.

Preprint: ”Genetics of recombination rate variation in the pig”

We have a new preprint posted, showing that recombination rate in the pig is lowly heritable and associated with alleles at RNF212.

We developed a new method to estimate recombinations in 150,000 pigs, and used that to estimate heritability and perform genome-wide association studies in 23,000.

Here is the preprint:

Johnsson M*, Whalen A*, Ros-Freixedes R, Gorjanc G, Chen C-Y, Herring WO, de Koning D-J, Hickey JM. (2020) Genetics of recombination rate variation in the pig. BioRxiv preprint. (* equal contribution)

Here is the abstract:

Background In this paper, we estimated recombination rate variation within the genome and between individuals in the pig for 150,000 pigs across nine genotyped pedigrees. We used this to estimate the heritability of recombination and perform a genome-wide association study of recombination in the pig.

Results Our results confirmed known features of the pig recombination landscape, including differences in chromosome length, and marked sex differences. The recombination landscape was repeatable between lines, but at the same time, the lines also showed differences in average genome-wide recombination rate. The heritability of genome-wide recombination was low but non-zero (on average 0.07 for females and 0.05 for males). We found three genomic regions associated with recombination rate, one of them harbouring the RNF212 gene, previously associated with recombination rate in several other species.

Conclusion Our results from the pig agree with the picture of recombination rate variation in vertebrates, with low but nonzero heritability, and a major locus that is homologous to one detected in several other species. This work also highlights the utility of using large-scale livestock data to understand biological processes.

Using R: simple Gantt chart with ggplot2

Jeremy Yoder’s code for a simple Gantt chart on the Molecular Ecologist blog uses geom_line and gather to prepare the data structure. I like using geom_linerange and a coord_flip, which lets you use start and end columns directly without pivoting.

Here is a very serious data frame of activities:

# A tibble: 6 x 4
  activity       category        start               end                
1 Clean house    preparations    2020-07-01 00:00:00 2020-07-03 00:00:00
2 Pack bags      preparations    2020-07-05 10:00:00 2020-07-05 17:00:00
3 Run to train   travel          2020-07-05 17:00:00 2020-07-05 17:15:00
4 Sleep on train travel          2020-07-05 17:15:00 2020-07-06 08:00:00
5 Procrastinate  procrastination 2020-07-01 00:00:00 2020-07-05 00:00:00
6 Sleep          vacation        2020-07-06 08:00:00 2020-07-09 00:00:00

And here is the code:


activities <- read_csv("activities.csv")

## Set factor level to order the activities on the plot
activities$activity <- factor(activities$activity,
                              levels = activities$activity[nrow(activities):1])
plot_gantt <- qplot(ymin = start,
                    ymax = end,
                    x = activity,
                    colour = category,
                    geom = "linerange",
                    data = activities,
                    size = I(5)) +
    scale_colour_manual(values = c("black", "grey", "purple", "yellow")) +
    coord_flip() +
    theme_bw() +
    theme(panel.grid = element_blank()) +
    xlab("") +
    ylab("") +
    ggtitle("Vacation planning")

Using R: 10 years with R

Yesterday, 29 Feburary 2020, was the 20th anniversary of the release R 1.0.0. Jozef Hajnala’s blog has a cute anniversary post with some trivia. I realised that it is also (not to the day, but to the year) my R anniversary.

I started using R in 2010, during my MSc project in Linköping. Daniel Nätt, who was a PhD student there at the time, was using it for gene expression and DNA methylation work. I think that was the reason he was pulled into R; he needed the Bioconductor packages for microarrays. He introduced me. Thanks, Daniel!

I think I must first have used it to do something with qPCR melting curves. I remember that I wrote some function to reshape/pivot data between long and wide format. It was probably an atrocity of nested loops and hard bracket indexing. Coming right from an undergraduate programme with courses using Ada and C++, even if we had also used Minitab for statistics and Matlab for engineering, I spoke R with a strong accent. At any rate, I was primed to think that doing my data analysis with code was a good idea, and jumped at the opportunity to learn a tool for it. Thanks, undergraduate programme!

I think the easiest thing to love about R is the package system. You can certainly end up in dependency hell with R and metaphorically shoot your own foot, especially on a shared high performance computing system. But I wouldn’t run into any of that until after several years. I was, and still am, impressed by how packages just worked, and could do almost anything. So, the Bioconductor packages were probably, indirectly, why I was introduced to R, and after that, my R story can be told in a series of packages. Thanks, CRAN!

The next package was R/qtl, that I relied on for my PhD. I had my own copy of the R/qtl book. For a period, I probably wrote thing every day:


cross <- read.cross(file = "F8_geno_trim.csv", format = "csv")

R/qtl is one of my favourite pieces or research software, relatively friendly and with lots of documentation. Thanks, R/qtl developers!

Of course it was Dom Wright, who was my PhD supervisor, who introduced me to R/qtl, and I think it was also he who introduced me to ggplot2. At least he used it, and at some point we were together trying to fix the formatting of a graph, probably with some ugly hack. I decided to use ggplot2 as much as possible, and as it is wont to, ggplot2 made me care about rearranging data, thus leading to reshape2 and plyr. ”The magic is not in plotting the data but in tidying and rearranging the data for plotting.” After a while, most everything I wrote used the ddply function in some way. Thank you, Hadley Wickham!

Then came the contemporary tidyverse. For the longest time, I was uneasy with tidyr, and I’m still not a regular purrr user, but one can’t avoid loving dplyr. How much? My talk at the Swedish Bioinformatics Workshop in 2016 had a slide expressing my love of the filter function. It did not receive the cheers that the function deserves. Maybe the audience were Python users. With new file reading functions, new data frames and functions to manipulate data frames, modern R has become smoother and friendlier. Thanks, tidyverse developers!

The history of R on this blog started in 2011, originally as a way to make notes for myself or, ”a fellow user who’s trying to google his or her way to a solution”. This turned into a series of things to help teach R to biologists around me.

There was the Slightly different introduction to R series of blog posts. It used packages that feel somewhat outdated, and today, I don’t think there’s anything even slightly different about advocating RStudio, and teaching ggplot2 from the beginning.

This spawned a couple of seminars in course for PhD students, which were updated for the Wright lab computation lunches, and eventually turned into a course of its own given in 2017. It would be fun to update it and give it again.

The last few years, I’ve been using R for reasonably large genome datasets in a HPC environment, and gotten back to the beginnings, I guess, by using Bioconducor a lot more. However, the package that I think epitomises the last years of my R use is AlphaSimR, developed by colleagues in Edinburgh. It’s great to be able throw together a quick simulation to check how some feature of genetics behaves. AlphaSimR itself is also an example of how far the R/C++ integration has come with RCpp and RCppArmadillo. Thanks, Chris!

In summary, R is my tool of choice for almost anything. I hope we’ll still be using it, in new and interesting ways, in another ten years. Thank you, R core team!

Using R: from plyr to purrr, part 0 out of however many

This post is me thinking out loud about applying functions to vectors or lists and getting data frames back.

Using R is an ongoing process of finding nice ways to throw data frames, lists and model objects around. While tidyr has arrived at a comfortable way to reshape dataframes with pivot_longer and pivot_wider, I don’t always find the replacements for the good old plyr package as satisfying.

Here is an example of something I used to like to do with plyr. Don’t laugh!

Assume we have a number of text files, all in the same format, that we need to read and combine. This arises naturally if you run some kind of analysis where the dataset gets split into chunks, like in genetics, where chunks might be chromosomes.

## Generate vector of file names
files <- paste("data/chromosome", 1:20, ".txt", sep = "")

genome <- ldply(files, read_tsv)

This gives us one big data frame, containing the rows from all those files.

If we want to move on from plyr, what are our options?

We can go old school with base R functions lapply and Reduce.


chromosomes <- lapply(files, read_tsv)
genome <- Reduce(rbind, chromosomes)

Here, we first let lapply read each file and store it in a list. Then we let Reduce fold the list with rbind, which binds the data frames in the list together, one below the other.

If that didn’t make sense, here it is again: lapply maps a function to each element of a vector or list, collecting the results in a list. Reduce folds the elements in a list together, using a function that takes in two arguments. The first argument will be the results it’s accumulated so far, and the second argument will be the next element of the list.

In the end, this leaves us, as with ldply, with one big data frame.

We can also use purrr‘s map_dfr. This seems to be the contemporary most elegant solution:


genome <- map_dfr(files, read_tsv)

map_dfr, like good old ldply will map over a vector or list, and collect resulting data frames. The ”r” in the name means adding the next data frame as rows. There is also a ”c” version (map_dfc) for adding as columns.

Things that really don’t matter: megabase or megabasepair

Should we talk about physical distance in genetics as number of base pairs (kbp, Mbp, and so on) or bases (kb, Mb)?

I got into a discussion about this recently, and I said I’d continue the struggle on my blog. Here it is. Let me first say that I don’t think this matters at all, and if you make a big deal out of this (or whether ”data” can be singular, or any of those inconsequential matters of taste we argue about for amusement), you shouldn’t. See this blog post as an exorcism, helping me not to trouble my colleagues with my issues.

What I’m objecting to is mostly the inconsistency of talking about long stretches of nucleotides as ”kilobase” and ”megabase” but talking about short stretches as ”base pairs”. I don’t think it’s very common to call a 100 nucleotide stretch ”a 100 b sequence”; I would expect ”100 bp”. For example, if we look at Ensembl, they might describe a large region as 1 Mb, but if you zoom in a lot, they give length in bp. My impression is that this is a common practice. However, if you consistently use ”bases” and ”megabase”, more power to you.

Unless you’re writing a very specific kind of bioinformatics paper, the risk of confusion with the computer storage unit isn’t a problem. But there are some biological arguments.

A biological argument for ”base”, might be that we care about the identity of the base, not the base pairing. We note only one nucleotide down when we write a nucleic acid sequence. The base pair is a different thing: that base bound to the one on the other strand that it’s paired with, or, if the DNA or RNA is single-stranded, it’s not even paired at all.

Conversely, a biochemical argument for ”base pair” might be that in a double-stranded molecule, the base pair is the relevant informational unit. We may only write one base in our nucleotide sequence for convenience, but because of the rules of base pairing, we know the complementing pair. In this case, maybe we should use ”base” for single-stranded molecules.

If we consult two more or less trustworthy sources, The Encylopedia of Life Sciences and Wiktionary, they both seem to take this view.

eLS says:

A megabase pair, abbreviated Mbp, is a unit of length of nucleic acids, equal to one million base pairs. The term ‘megabase’ (or Mb) is commonly used inter-changeably, although strictly this would refer to a single-stranded nucleic acid.

Wiktionary says:

A length of nucleic acid containing one million nucleotides (bases if single-stranded, base pairs if double-stranded)

Please return next week for the correct pronunciation of ”loci”.


Dear, P.H. (2006). Megabase Pair (Mbp). eLS.

If research is learning, how should researchers learn?

I’m taking a course on university pedagogy to, hopefully, become a better teacher. While reading about students’ learning and what teachers ought to do to facilitate it, I couldn’t help thinking about researchers’ learning, and what we ought to do to give ourselves a good learning environment.

Research is, largely, learning. First, a large part of any research work is learning what is already known, not just by me in particular; it’s a direct continuation of learning that takes place in courses. While doing any research project, we learn the concepts other researchers use in this specific sub-subfield, and the relations between them. First to the extent that we can orient ourselves, and eventually to be able to make a contribution that is intelligible to others who work there. We also learn their priorities, attitudes and platitudes. (Seriously, I suspect you learn a lot about a sub-subfield by trying to make jokes about it.) We also learn to do something new: perform a laboratory procedure, a calculation, or something like that.

But more importantly, research is learning about things no-one knows yet. The idea of constructivist learning theory seems apt: We are constructing new knowledge, building on pre-existing structures. We don’t go out and read the book of nature; we take the concepts and relations of our sub-subfield of choice, and graft, modify and rearrange them into our new model of the subject.

If there is something to this, it means that old clichéd phrases like ”institution of higher learning”, scientists as ”students of X”, and so on, name a deeper analogy than it might seem. It also suggests that innovations in student learning might also be good building blocks for research group management. Should we be concept mapping with our colleagues to figure out where we disagree about the definition of ”developmental pleiotropy”? It also makes one wonder why meetings and departmental seminars often take the form of sage on the stage lectures.

Two distinguishing traits of science are that there are errors all the time and that almost no-one can reproduce anything

I got annoyed and tweeted:

”If you can’t reproduce a result, it isn’t science” … so we’re at that stage now, when we write things that sound righteous but are nonsense.

Hashtag subtweet, I guess. But it doesn’t matter who first wrote the sentence I was complaining about; they won’t care what I think, and I’m not out to debate them. I only think the quoted sentence makes sense if you take ”science” to mean ”the truth”. The relationship between science and reproducibility is messier than that.

The first clause could mean a few different things:

You have previously produced a result, but now, you can’t reproduce it when you try …

Then you might have done something wrong the first time, or the second time. This is an everyday occurrence of any type of research, that probably happens to every postdoc every week. Not even purely theoretical results are safe. If the simulation is stochastic, one might have been interpreting noise. If there is an analytical result, one might have made an odd number of sign errors. In fact, it is a distinguishing trait of science that when we try to learn new things, there are errors all the time.

If that previous result is something that has been published, circulated to peers, and interpreted as if it was a useful finding, then that is unfortunate. The hypothetical you should probably make some effort to figure out why, and communicate that to peers. But it seems like a bad idea to suggest that because there was an error, you’re not doing science.

You personally can’t reproduce a results because you don’t have the expertise or resources …

Science takes a lot of skill and a lot of specialised technical stuff. I probably can’t reproduce even a simple organic chemistry experiment. In fact, it is a distinguishing trait of science that almost no-one can reproduce any of it, because it takes both expertise and special equipment.

No-one can ever reproduce a certain result even in principle …

It might still be science. The 1918 influenza epidemic will by the nature of time never happen again. Still, there is science about it.

You can’t reproduce someone else’s results when you try with a reasonably similar setup …

Of course, this is what the original authors of the sentence meant. When this turns out to be a common occurrence, as people systematically try to reproduce findings, there is clearly something wrong with the research methods scientists use: The original report may be the outcome of a meandering process of researcher degrees of freedom that produced a striking result that is unlikely to happen when the procedure is repeated, even with high fidelity. However, I would say that we’re dealing with bad science, rather than non-science. Reproducibility is not a demarcation criterion.

(Note: Some people reserve ”reproducibility” for the computational reproducibility of re-running someone’s analysis code and getting the same results. This was not the case with the sentence quoted above.)

Self-indulgent meta-post of the year

Dear diary,

Time for a recap of the On unicorns and genes blogging year. During 2019, your friendly neighbourhood genetics blog mostly kept to its schedule of four posts per month with some blog vacation in summer and in December.

This added up to a total of 43 posts (including this one), only one of them in Swedish (Gener påverkar ditt och datt, reflecting on how genome-wide association is often reported as if it was something else), and three of them posts about three first-author papers that came out in 2019:

Paper: ”Integrating selection mapping with genetic mapping and functional genomics”
Paper: ”Sequence variation, evolutionary constraint, and selection at the CD163 gene in pigs”
Paper: ”Removal of alleles by genome editing (RAGE) against deleterious load”

Now, let’s pick one post per month to represent the blogging year of 2019:

January: Showing a difference in means between two groups. This is one of those hard easy problems: Imagine we have an experiment comparing the means of two groups; how do we show both the data and the estimate of the difference, with uncertainty, in the same plot? In this little post, I try a more and a less radical version. I’ve since used the less radical version a couple of times.

February: ”We have reached peak gene and passed it”. Comment on an opinion piece that argued that revisions to the gene concept have important implications for modern genetics, and people need to be told. I agreed about a lot of the criticisms, but thought they should have nothing to do with gene concepts.

March: Journal club of one: ”Biological relevance of computationally predicted pathogenicity of noncoding variants”. Journal club post about a then recent paper about how variant effect prediction, especially for noncoding variants, is really hard, and not easy to evaluate fairly either.

April: Greek in biology. Comment on a stimulating essay about multilingualism in biology.

May: What single step does with relationship. A simulation and a couple of heatmaps to try to understand how the single step method describes relationship between individuals by blending genomic and pedigree relatedness.

June: Simulating genetic data with R: an example with deleterious variants Post from my talk at the Edinburgh R users group.

July: Using R: Correlation heatmap with ggplot2. I thought I’d update one of my most viewed posts, which was becoming too embarrassingly outdated. Unfortunately, I also broke its old address, and that wasn’t so smart.

August: Blog vacation.

September: Genes do not form networks. They just don’t. It seems DiFrisco & Jaeger (2019) don’t think so either. I would like to return to this later.

October: Using R: Animal model with simulated data. What it says on the tin: example of simulating a pedigree with AlphaSimR and fitting an animal model.

November: Using R: from gather to pivot. Introduction to, and celebration of, one of the best changes in the tidyverse.

December: Interpreting genome scans, with wisdom. Opinions about genome-wide association, precipitated by Eric Faumann’s Twitter account.

This is also the year I moved from the Roslin in Scotland to Uppsala, Sweden, for the second phase of the mobility grant-supported postdoc. Here are many of my worldly possessions, being hauled:

What will happen on here in 2020? The ambition is to keep a reasonably regular schedule, post about papers as they come out, and maybe write some more in Swedish.

I’ll also have a blog anniversary. The blog, admittedly rather different the first years, started in earnest in June 2010. I’m not sure how to celebrate, but I feel like I should follow up on one of the first posts (about linkage mapping and the probably spurious ”Gay Gene” at Xq28), now that Ganna et al. (2019) put genetic mapping of sexual orientation is in the news again.

Using R: Animal model with hglm and Stan (with Cholesky trick)

A few weeks ago I posted about fitting the quantitative genetic animal model with MCMCglmm and R-INLA. Since then, I listened to a talk by Lars Rönnegård, one of the creators of the hglm package, and this paper was published in GSE about animal models in Stan.


The hglm package fits hierarchical generalised linear models. That includes the animal model with pedigree or genomic relatedness. Hierarchical generalised linear models also allow you to model the dispersion of random effects, which lets you do tricks like variance QTL mapping (Rönnegård & Valdar 2011), breeding values for variances (Rönnegård et al. 2010) or genomic prediction models with predictors of marker variance (Mouresan, Selle & Rönnegård 2019). But let’s not get ahead of ourselves. How do we fit an animal model?

Here is the matrix formulation of the animal model that we skim through in every paper. It’s in this post because we will use the design matrix interface to hglm, which needs us to give it these matrices (this is not a paper, so we’re not legally obliged to include it):

\mathbf{y} = \mu + \mathbf{X} \mathbf{b} + \mathbf{Z} \mathbf{a} + \mathbf{e}

The terms are the the trait value, intercept, fixed coefficients and their design matrix, genetic coefficients and their design matrix, and the residual. The design matrix Z will contain one row and column for each individual, with a 1 to indicate its position in the phenotype table and pedigree and the rest zeros. If we sort our files, it’s an identity matrix.

The trick with the genetic coefficients is that they’re correlated, with a specific known correlation structure that we know from the pedigree (or in genomic models, from markers). It turns out (Lee, Nelder & Pawitan 2017, chapter 8) that you can change the Z matrix around so that it lets you fit the model with an identity covariance matrix, while still accounting for the correlations between relatives. You replace the random effects for relatedness with some transformed random effects that capture the same structure. One way to do this is with Cholesky decomposition.

\mathbf{Z_{fudged}} = \mathbf{Z_0} \mathbf{L}

As an example of what the Cholesky decomposition does, here is slice of the additive relationship matrix of 100 simulated individuals (the last generation of one replicate of these simulations) and the resulting matrix from Cholesky decomposition.

So instead of

\mathbf{a} \sim N(0, \mathbf{A} \sigma)

We can fit

\mathbf{a_{fudged}} \sim N(0, \mathbf{I} \sigma)

This lets us fit the animal model with hglm, by putting in a modified Z matrix.

Assuming we have data frames with a pedigree and a phenotype (like, again, from these simulations):


A  <- Amatrix(ped)

Z0  <- diag(1000)
L <- t(chol(A))
Z  <- Z0 %*% L
X <- model.matrix(~1, pheno)

model <- hglm(y = pheno$pheno,
              X = X,
              Z = Z,
              conv = 1e-8)

est_h2  <- model$varRanef / (model$varRanef + model$varFix)

(I found the recommendation to decrease the convergence criterion from the default for animal models in a YouTube video by Xia Chen.)


When we turn to Stan, we will meet the Cholesky trick again. Stan is a software for Markov Chain Monte Carlo, built to fit hierarchical linear models, and related high-dimensional models, more effectively than other sampling strategies (like Gibbs). rstan is a helpful package for running Stan from within R.

Nishio & Arakawa (2019) recently published a Stan script to fit an animal model, comparing Stan to a Gibbs sampler (and a related MCMC sampler that they also didn’t publish the code for). If we look into their Stan model code, they also do a Cholesky decomposition to be able to use an identity matrix for the variance.

First, they decompose the additive relationship matrix that the program takes in:

transformed data{
  matrix[K,K] LA;
  LA = cholesky_decompose(A);

And then, they express the model like this:

vector[N] mu;
vector[K] a;
a_decompose ~ normal(0, 1);
a = sigma_G * (LA * a_decompose);
mu = X * b + Z * a;
Y ~ normal(mu, sigma_R);

We can add this line to the generated quantities block of the Stan program to get heritability estimates directly:

real h2;
h2 = sigma_U / (sigma_U + sigma_E)

Here, we’ve saved their model to a stan file, and now we can run it from R:

pheno$scaled_pheno <- as.vector(scale(pheno$pheno))

model_stan <- stan(file = "nishio_arakawa.stan",
                   data = list(Y = pheno$scaled_pheno,
                               X = X,
                               A = A,
                               Z = Z0,
                               J = 1,
                               K = 1000,
                               N = 1000))

est_h2_stan <- summary(model_stan, pars = "h2")$summary

Important note that I always forget: It's important to scale your traits before you run this model. If not, the priors might be all wrong.

The last line pulls out the summary for the heritability parameter (that we added above). This gives us an estimate and an interval.

The paper also contains this entertaining passage about performance, which reads as if it was a response to a comment, actual or anticipated:

R language is highly extensible and provides a myriad of statistical and graphical techniques. However, R language has poor computation time compared to Fortran, which is especially well suited to numeric computation and scientific computing. In the present study, we developed the programs for GS and HMC in R but did not examine computation time; instead, we focused on examining the performance of estimating genetic parameters and breeding values.

Yes, two of their samplers (Gibbs and HMC) were written in R, but the one they end up advocating (and the one used above), is in Stan. Stan code gets translated into C++ and then compiled to machine code.

Stan with brms

If rstan lets us run Stan code from R and examine the output, brms lets us write down models in relatively straightforward R syntax. It’s like the MCMCglmm of the Stan world. We can fit an animal model with brms too, by directly plugging in the relationship matrix:

model_brms <- brm(scaled_pheno ~ 1 + (1|animal),
                  data = pheno,
                  family = gaussian(),
                  cov_ranef = list(animal = A),
                  chains = 4,
                  cores = 1,
                  iter = 2000)

Then, we can pull out the posterior samples for the variability, here expressed as standard errors, compute the heritability and then get the estimates (and interval, if we want):

posterior_brms <- posterior_samples(model_brms,
                                    pars = c("sd_animal", "sigma"))

h2_brms  <- posterior_brms[,1]^2 /
    (posterior_brms[,1]^2 + posterior_brms[,2]^2)

est_h2_brms <- mean(h2_brms)

(Code is on GitHub: both for the graphs above, and the models.)