# Virtual breeding and genetics journal club: ”Structural equation models to disentangle the biological relationship between microbiota and complex traits …”

The other day was the first Virtual breeding and genetics journal club organised by John Cole. This was the first online journal club I’ve attended (shocking, given how many video calls I’ve been on for other sciencey reasons), so I thought I’d write a little about it: both the format and the paper. You can look the slide deck from the journal club here (pptx file).

## The medium

We used Zoom, and that seemed to work, as I’m sure anything else would, if everyone just mute their microphone when they aren’t speaking. As John said, the key feature of Zoom seems to be the ability for the host to mute everyone else. During the call, I think we were at most 29 or so people, but only a handful spoke. It will probably get more intense with the turn taking if more people want to speak.

## The format

John started the journal club with a code of conduct, which I expect helped to set what I felt was a good atmosphere. In most journal clubs I’ve been in, I feel like the atmosphere has been pretty good, but I think we’ve all heard stories about hyper-critical and hostile journal clubs, and that doesn’t sound particularly fun or useful. On that note, one of the authors, Oscar González-Recio, was on the call and answered some questions.

## The paper

Saborío‐Montero, Alejandro, et al. ”Structural equation models to disentangle the biological relationship between microbiota and complex traits: Methane production in dairy cattle as a case of study.” Journal of Animal Breeding and Genetics 137.1 (2020): 36-48.

The authors measured methane emissions (by analysing breath with with an infrared gas monitor) and abundance of different microbes in the rumen (with Nanopore sequencing) from dairy cows. They genotyped the animals for relatedness.

They analysed the genetic relationship between breath methane and abundance of each taxon of microbe, individually, with either:

• a bivariate animal model;
• a structural equations model that allows for a causal effect of abundance on methane, capturing the assumption that the abundance of a taxon can affect the methane emission, but not the other way around.

They used them to estimate heritabilities of abundances and genetic correlations between methane and abundances, and in the case of the structural model: conditional on the assumed causal model, the effect of that taxon’s abundance on methane.

## My thoughts

It’s cool how there’s a literature building up on genetic influences on the microbiome, with some consistency across studies. These intense high-tech studies on relatively few cattle might build up to finding new traits and proxies that can go into larger scale phenotyping for breeding.

As the title suggests, the paper advocates for using the structural equations model: ”Genetic correlation estimates revealed differences according to the usage of non‐recursive and recursive models, with a more biologically supported result for the recursive model estimation.” (Conclusions)

While I agree that a priori, it makes sense to assume a structural equations model with a causal structure, I don’t think the results provide much evidence that it’s better. The estimates of heritabilities and genetic correlations from the two models are near indistinguishable. Here is the key figure 4, comparing genetic correlation estimates:

As you can see, there are a couple of examples of genetic correlations where the point estimate switches sign, and one of them (Succinivibrio sp.) where the credible intervals don’t overlap. ”Recursive” is the structural equations model. The error bars are 95% credible intervals. This is not strong evidence of anything; the authors are responsible about it and don’t go into interpreting this difference. But let us speculate! They write:

All genera in this case, excepting Succinivibrio sp. from the Proteobacteria phylum, resulted in overlapped genetic cor- relations between the non‐recursive bivariate model and the recursive model. However, high differences were observed. Succinivibrio sp. showed the largest disagreement changing from positively correlated (0.08) in the non‐recursive bivariate model to negatively correlated (−0.20) in the recursive model.

Succinivibrio are also the taxon with the estimated largest inhibitory effect on methane (from the structural equations model).

While some taxa, such as ciliate protozoa or Methanobrevibacter sp., increased the CH4 emissions …, others such as Succinivibrio sp. from Proteobacteria phylum decreased it

Looking at the paper that first described these bacteria (Bryan & Small 1955),  Succinivibrio were originally isolated from the cattle rumen, and their name is because ”they ferment glucose with the production of a large amount of succinic acid”. Bryant & Small made a fermentation experiment to see what came out, and it seems that the bacteria don’t produce methane:

This is also in line with a rRNA sequencing study of high and low methane emitting cows (Wallace & al 2015) that found lower Succinivibrio abundance in high methane emitters.

We may speculate that Succinivibrio species could be involved in diverting energy from methanogens, and thus reducing methane emissions. If that is true, then the structural equations model estimate (larger genetic negative correlation between Succinivibrio abundance and methane) might be better than one from the animal model.

Finally, while I’m on board with the a priori argument for using a structural equations model, as with other applications of causal modelling (gene networks, Mendelian randomisation etc), it might be dangerous to consider only parts of the system independently, where the microbes are likely to have causal effects on each other.

## Literature

Saborío‐Montero, Alejandro, et al. ”Structural equation models to disentangle the biological relationship between microbiota and complex traits: Methane production in dairy cattle as a case of study.” Journal of Animal Breeding and Genetics 137.1 (2020): 36-48.

Wallace, R. John, et al. ”The rumen microbial metagenome associated with high methane production in cattle.” BMC genomics 16.1 (2015): 839.

Bryant, Marvin P., and Nola Small. ”Characteristics of two new genera of anaerobic curved rods isolated from the rumen of cattle.” Journal of bacteriology 72.1 (1956): 22.

# Preprint: ”Genetics of recombination rate variation in the pig”

We have a new preprint posted, showing that recombination rate in the pig is lowly heritable and associated with alleles at RNF212.

We developed a new method to estimate recombinations in 150,000 pigs, and used that to estimate heritability and perform genome-wide association studies in 23,000.

Here is the preprint:

Johnsson M*, Whalen A*, Ros-Freixedes R, Gorjanc G, Chen C-Y, Herring WO, de Koning D-J, Hickey JM. (2020) Genetics of recombination rate variation in the pig. BioRxiv preprint. https://doi.org/10.1101/2020.03.17.995969 (* equal contribution)

Here is the abstract:

Background In this paper, we estimated recombination rate variation within the genome and between individuals in the pig for 150,000 pigs across nine genotyped pedigrees. We used this to estimate the heritability of recombination and perform a genome-wide association study of recombination in the pig.

Results Our results confirmed known features of the pig recombination landscape, including differences in chromosome length, and marked sex differences. The recombination landscape was repeatable between lines, but at the same time, the lines also showed differences in average genome-wide recombination rate. The heritability of genome-wide recombination was low but non-zero (on average 0.07 for females and 0.05 for males). We found three genomic regions associated with recombination rate, one of them harbouring the RNF212 gene, previously associated with recombination rate in several other species.

Conclusion Our results from the pig agree with the picture of recombination rate variation in vertebrates, with low but nonzero heritability, and a major locus that is homologous to one detected in several other species. This work also highlights the utility of using large-scale livestock data to understand biological processes.

# Things that really don’t matter: megabase or megabasepair

Should we talk about physical distance in genetics as number of base pairs (kbp, Mbp, and so on) or bases (kb, Mb)?

I got into a discussion about this recently, and I said I’d continue the struggle on my blog. Here it is. Let me first say that I don’t think this matters at all, and if you make a big deal out of this (or whether ”data” can be singular, or any of those inconsequential matters of taste we argue about for amusement), you shouldn’t. See this blog post as an exorcism, helping me not to trouble my colleagues with my issues.

What I’m objecting to is mostly the inconsistency of talking about long stretches of nucleotides as ”kilobase” and ”megabase” but talking about short stretches as ”base pairs”. I don’t think it’s very common to call a 100 nucleotide stretch ”a 100 b sequence”; I would expect ”100 bp”. For example, if we look at Ensembl, they might describe a large region as 1 Mb, but if you zoom in a lot, they give length in bp. My impression is that this is a common practice. However, if you consistently use ”bases” and ”megabase”, more power to you.

Unless you’re writing a very specific kind of bioinformatics paper, the risk of confusion with the computer storage unit isn’t a problem. But there are some biological arguments.

A biological argument for ”base”, might be that we care about the identity of the base, not the base pairing. We note only one nucleotide down when we write a nucleic acid sequence. The base pair is a different thing: that base bound to the one on the other strand that it’s paired with, or, if the DNA or RNA is single-stranded, it’s not even paired at all.

Conversely, a biochemical argument for ”base pair” might be that in a double-stranded molecule, the base pair is the relevant informational unit. We may only write one base in our nucleotide sequence for convenience, but because of the rules of base pairing, we know the complementing pair. In this case, maybe we should use ”base” for single-stranded molecules.

If we consult two more or less trustworthy sources, The Encylopedia of Life Sciences and Wiktionary, they both seem to take this view.

eLS says:

A megabase pair, abbreviated Mbp, is a unit of length of nucleic acids, equal to one million base pairs. The term ‘megabase’ (or Mb) is commonly used inter-changeably, although strictly this would refer to a single-stranded nucleic acid.

Wiktionary says:

A length of nucleic acid containing one million nucleotides (bases if single-stranded, base pairs if double-stranded)

Please return next week for the correct pronunciation of ”loci”.

Literature

Dear, P.H. (2006). Megabase Pair (Mbp). eLS.

# Using R: Animal model with hglm and Stan (with Cholesky trick)

A few weeks ago I posted about fitting the quantitative genetic animal model with MCMCglmm and R-INLA. Since then, I listened to a talk by Lars Rönnegård, one of the creators of the hglm package, and this paper was published in GSE about animal models in Stan.

# hglm

The hglm package fits hierarchical generalised linear models. That includes the animal model with pedigree or genomic relatedness. Hierarchical generalised linear models also allow you to model the dispersion of random effects, which lets you do tricks like variance QTL mapping (Rönnegård & Valdar 2011), breeding values for variances (Rönnegård et al. 2010) or genomic prediction models with predictors of marker variance (Mouresan, Selle & Rönnegård 2019). But let’s not get ahead of ourselves. How do we fit an animal model?

Here is the matrix formulation of the animal model that we skim through in every paper. It’s in this post because we will use the design matrix interface to hglm, which needs us to give it these matrices (this is not a paper, so we’re not legally obliged to include it):

$\mathbf{y} = \mu + \mathbf{X} \mathbf{b} + \mathbf{Z} \mathbf{a} + \mathbf{e}$

The terms are the the trait value, intercept, fixed coefficients and their design matrix, genetic coefficients and their design matrix, and the residual. The design matrix Z will contain one row and column for each individual, with a 1 to indicate its position in the phenotype table and pedigree and the rest zeros. If we sort our files, it’s an identity matrix.

The trick with the genetic coefficients is that they’re correlated, with a specific known correlation structure that we know from the pedigree (or in genomic models, from markers). It turns out (Lee, Nelder & Pawitan 2017, chapter 8) that you can change the Z matrix around so that it lets you fit the model with an identity covariance matrix, while still accounting for the correlations between relatives. You replace the random effects for relatedness with some transformed random effects that capture the same structure. One way to do this is with Cholesky decomposition.

$\mathbf{Z_{fudged}} = \mathbf{Z_0} \mathbf{L}$

As an example of what the Cholesky decomposition does, here is slice of the additive relationship matrix of 100 simulated individuals (the last generation of one replicate of these simulations) and the resulting matrix from Cholesky decomposition.

$\mathbf{a} \sim N(0, \mathbf{A} \sigma)$

We can fit

$\mathbf{a_{fudged}} \sim N(0, \mathbf{I} \sigma)$

This lets us fit the animal model with hglm, by putting in a modified Z matrix.

Assuming we have data frames with a pedigree and a phenotype (like, again, from these simulations):

library(AGHmatrix)
library(hglm)

A  <- Amatrix(ped)

Z0  <- diag(1000)
L <- t(chol(A))
Z  <- Z0 %*% L
X <- model.matrix(~1, pheno)

model <- hglm(y = pheno$pheno, X = X, Z = Z, conv = 1e-8) est_h2 <- model$varRanef / (model$varRanef + model$varFix)



(I found the recommendation to decrease the convergence criterion from the default for animal models in a YouTube video by Xia Chen.)

# Stan

When we turn to Stan, we will meet the Cholesky trick again. Stan is a software for Markov Chain Monte Carlo, built to fit hierarchical linear models, and related high-dimensional models, more effectively than other sampling strategies (like Gibbs). rstan is a helpful package for running Stan from within R.

Nishio & Arakawa (2019) recently published a Stan script to fit an animal model, comparing Stan to a Gibbs sampler (and a related MCMC sampler that they also didn’t publish the code for). If we look into their Stan model code, they also do a Cholesky decomposition to be able to use an identity matrix for the variance.

First, they decompose the additive relationship matrix that the program takes in:

transformed data{
matrix[K,K] LA;
LA = cholesky_decompose(A);
}


And then, they express the model like this:

vector[N] mu;
vector[K] a;
a_decompose ~ normal(0, 1);
a = sigma_G * (LA * a_decompose);
mu = X * b + Z * a;
Y ~ normal(mu, sigma_R);


We can add this line to the generated quantities block of the Stan program to get heritability estimates directly:

real h2;
h2 = sigma_U / (sigma_U + sigma_E)


Here, we’ve saved their model to a stan file, and now we can run it from R:

pheno$scaled_pheno <- as.vector(scale(pheno$pheno))

model_stan <- stan(file = "nishio_arakawa.stan",
data = list(Y = pheno$scaled_pheno, X = X, A = A, Z = Z0, J = 1, K = 1000, N = 1000)) est_h2_stan <- summary(model_stan, pars = "h2")$summary


Important note that I always forget: It's important to scale your traits before you run this model. If not, the priors might be all wrong.

The last line pulls out the summary for the heritability parameter (that we added above). This gives us an estimate and an interval.

The paper also contains this entertaining passage about performance, which reads as if it was a response to a comment, actual or anticipated:

R language is highly extensible and provides a myriad of statistical and graphical techniques. However, R language has poor computation time compared to Fortran, which is especially well suited to numeric computation and scientific computing. In the present study, we developed the programs for GS and HMC in R but did not examine computation time; instead, we focused on examining the performance of estimating genetic parameters and breeding values.

Yes, two of their samplers (Gibbs and HMC) were written in R, but the one they end up advocating (and the one used above), is in Stan. Stan code gets translated into C++ and then compiled to machine code.

# Stan with brms

If rstan lets us run Stan code from R and examine the output, brms lets us write down models in relatively straightforward R syntax. It’s like the MCMCglmm of the Stan world. We can fit an animal model with brms too, by directly plugging in the relationship matrix:

model_brms <- brm(scaled_pheno ~ 1 + (1|animal),
data = pheno,
family = gaussian(),
cov_ranef = list(animal = A),
chains = 4,
cores = 1,
iter = 2000)


Then, we can pull out the posterior samples for the variability, here expressed as standard errors, compute the heritability and then get the estimates (and interval, if we want):

posterior_brms <- posterior_samples(model_brms,
pars = c("sd_animal", "sigma"))

h2_brms  <- posterior_brms[,1]^2 /
(posterior_brms[,1]^2 + posterior_brms[,2]^2)

est_h2_brms <- mean(h2_brms)


(Code is on GitHub: both for the graphs above, and the models.)

# Interpreting genome scans, with wisdom

Eric Fauman is a scientist at Pfizer who also tweets out interpretations of genome-wide association scans.

Background: There is a GWASbot twitter account which posts Manhattan plots with links for various traits from the UK Biobank. The bot was made by the Genetic Epidemiology lab at the Finnish Institute for Molecular Medicine and Harvard. The source of the results is these genome scans (probably; it’s little bit opaque); the bot also links to heritability and genetic correlation databases. There is also an EnrichrBot that replies with enrichment of chromatin marks (Chen et al. 2013). Fauman’s comments on some of the genome scans on his Twitter account.

Here are a couple of recent ones:

And here is his list of these threads as a Google Document.

This makes me thing of three things, two good, and one bad.

# 1. The ephemeral nature of genome scans

Isn’t it great that we’re now at a stage where a genome scan can be something to be tweeted or put en masse in a database, instead of published one paper per scan with lots of boilerplate. The researchers behind the genome scans say as much in their 2017 blog post on the first release:

To further enhance the value of this resource, we have performed a basic association test on ~337,000 unrelated individuals of British ancestry for over 2,000 of the available phenotypes. We’re making these results available for browsing through several portals, including the Global Biobank Engine where they will appear soon. They are also available for download here.

We have decided not to write a scientific article for publication based on these analyses. Rather, we have described the data processing in a detailed blog post linked to the underlying code repositories. The decision to eschew scientific publication for the basic association analysis is rooted in our view that we will continue to work on and analyze these data and, as a result, writing a paper would not reflect the current state of the scientific work we are performing. Our goal here is to make these results available as quickly as possible, for any geneticist, biologist or curious citizen to explore. This is not to suggest that we will not write any papers on these data, but rather only write papers for those activities that involve novel method development or more complex analytic approaches. A univariate genome-wide association analysis is now a relatively well-established activity, and while the scale of this is a bit grander than before, that in and of itself is a relatively perfunctory activity. [emphasis mine] Simply put, let the data be free.

That being said, when starting to write this post, first I missed a paper. It was pretty frustrating to find a detailed description of the methods: after circling back and forth between the different pages that link to each other, I landed on the original methods post, which is informative, and written in a light conversational style. On the internet, one would fear that this links may rot and die eventually, and a paper would probably (but not necessarily …) be longer-lasting.

# 2. Everything is a genome scan, if you’re brave enough

Another thing that the GWAS bot drives home is that you can map anything that you can measure. The results are not always straightforward. On the other hand, even if the trait in question seems a bit silly, the results are not necessarily nonsense either.

There is a risk, for geneticists and non-geneticists alike, to reify traits based on their genetic parameters. If we can measure the heritability coefficient of something, and localise it in the genome with a genome-wide association study, it better be a real and important thing, right? No. The truth is that geneticists choose traits to measure the same way all researchers choose things to measure. Sometimes for great reasons with serious validation and considerations about usefulness. Sometimes just because. The GWAS bot also helpfully links to the UK Biobank website that describes the traits.

Now substitute bread intake for some more socially relevant trait, also imperfectly measured.

# 3. Lost, like tweets in rain

Genome scan interpretation is just that: interpretation. It means pulling together quantitative data, a knowledge of biology, previous literature, and writing an unstructured text, such as a Discussion section or a Twitter thread. This makes them harder to organise, store and build on than the genome scans themselves. Sure, Fauman’s Twitter threads are linked from the above Google Document, and our Discussion sections are available from the library. But they’re spread out in different places, they mix (as they should) evidence with evaluation and speculation, and it’s not like we have a structured vocabulary for describing genetic mechanisms of quantitative trait loci, and the levels of evidence for them. Maybe we could, with genome-wide association study ontologies and wikis.

# X-related genes

It is hard to interpret gene lists. But before we would even get into the statistical properties of annotation term enrichment, or whether network models are appropriate, or anything like that, we have the simpler problem of how to talk, colloquially, about genes connected with a biological process. In particular, there is a weak way to describe gene function one ought to avoid.

What is, for example, an immune-related gene? Why, it’s a gene that is important to immune function, of course! Is beta-catenin an immune-related gene? Wnt signalling is certainly important to immune cell differentiation (Chae & Bothwell 2018), and beta-catenin is certainly important to Wnt signalling function.

Similarly, Paris is a city in France. Therefore, all cities in France are Paris-related.

The thing is, any indirect mechanism can be a mechanism of genuine genetic causation, and this one isn’t even very roundabout. I couldn’t find a known Mendelian disorder with a mechanism that fit the above story, but I don’t think it’s out of the question. At the same time, labeling everything Wnt ”immune-related” would be a little silly, because those genes also do all sorts of other things. If the omnigenenic hypothesis of near-universal pleiotropy is correct, we should expect a lot of genetic causation to be like that: indirect, based on common pathways that do many kinds of different work in different parts of the organism.

That leaves X-related genes a vague notion that can contract or expand at will. From now on, I will think twice before using it.

# Using R: Animal model with simulated data

Last week’s post just happened to use MCMCglmm as an example of an R package that can get confused by tibble-style data frames. To make that example, I simulated some pedigree and trait data. Just for fun, let’s look at the simulation code, and use MCMCglmm and AnimalINLA to get heritability estimates.

First, here is some AlphaSimR code that creates a small random mating population, and collects trait and pedigree:

library(AlphaSimR)

## Founder population
FOUNDERPOP <- runMacs(nInd = 100,
nChr = 20,
inbred = FALSE,
species = "GENERIC")

## Simulation parameters
SIMPARAM <- SimParam$new(FOUNDERPOP) SIMPARAM$addTraitA(nQtlPerChr = 100,
mean = 100,
var = 10)
SIMPARAM$setGender("yes_sys") SIMPARAM$setVarE(h2 = 0.3)

## Random mating for 9 more generations
generations <- vector(mode = "list", length = 10)
generations[[1]] <- newPop(FOUNDERPOP,
simParam = SIMPARAM)

for (gen in 2:10) {

generations[[gen]] <- randCross(generations[[gen - 1]],
nCrosses = 10,
nProgeny = 10,
simParam = SIMPARAM)

}

## Put them all together
combined <- Reduce(c, generations)

## Extract phentoypes
pheno <- data.frame(animal = combined@id,
pheno = combined@pheno[,1])

## Extract pedigree
ped <- data.frame(id = combined@id,
dam = combined@mother,
sire =combined@father)
ped$dam[ped$dam == 0] <- NA
ped$sire[ped$sire == 0] <- NA

## Write out the files
write.csv(pheno,
file = "sim_pheno.csv",
row.names = FALSE,
quote = FALSE)

write.csv(ped,
file = "sim_ped.csv",
row.names = FALSE,
quote = FALSE)


In turn, we:

1. Set up a founder population with a AlphaSimR’s generic livestock-like population history, and 20 chromosomes.
2. Choose simulation parameters: we have an organism with separate sexes, a quantitative trait with an additive polygenic architecture, and we want an environmental variance to give us a heritability of 0.3.
3. We store away the founders as the first generation, then run a loop to give us nine additional generations of random mating.
4. Combine the resulting generations into one population.
5. Extract phenotypes and pedigree into their own data frames.
6. Optionally, save the latter data frames to files (for the last post).

Now that we have some data, we can fit a quantitative genetic pedigree model (”animal model”) to estimate genetic parameters. We’re going to try two methods to fit it: Markov Chain Monte Carlo and (the unfortunately named) Integrated Nested Laplace Approximation. MCMC explores the posterior distribution by sampling; I’m not sure where I heard it described as ”exploring a mountain by random teleportation”. INLA makes approximations to the posterior that can be integrated numerically; I guess it’s more like building a sculpture of the mountain.

First, a Gaussian animal model in MCMCglmm:

library(MCMCglmm)

## Gamma priors for variances
prior_gamma <- list(R = list(V = 1, nu = 1),
G = list(G1 = list(V = 1, nu = 1)))

## Fit the model
model_mcmc  <- MCMCglmm(scaled ~ 1,
random = ~ animal,
family = "gaussian",
prior = prior_gamma,
pedigree = ped,
data = pheno,
nitt = 100000,
burnin = 10000,
thin = 10)

## Calculate heritability for heritability from variance components
h2_mcmc_object  <- model_mcmc$VCV[, "animal"] / (model_mcmc$VCV[, "animal"] + model_mcmc$VCV[, "units"]) ## Summarise results from that posterior h2_mcmc <- data.frame(mean = mean(h2_mcmc_object), lower = quantile(h2_mcmc_object, 0.025), upper = quantile(h2_mcmc_object, 0.975), method = "MCMC", stringsAsFactors = FALSE)  And here is a similar animal model in AnimalINLA: library(AnimalINLA) ## Format pedigree to AnimalINLA's tastes ped_inla <- ped ped_inla$id  <- as.numeric(ped_inla$id) ped_inla$dam  <- as.numeric(ped_inla$dam) ped_inla$dam[is.na(ped_inla$dam)] <- 0 ped_inla$sire  <- as.numeric(ped_inla$sire) ped_inla$sire[is.na(ped_inla$sire)] <- 0 ## Turn to relationship matrix A_inv <- compute.Ainverse(ped_inla) ## Fit the model model_inla <- animal.inla(response = scaled, genetic = "animal", Ainverse = A_inv, type.data = "gaussian", data = pheno, verbose = TRUE) ## Pull out summaries from the model object summary_inla <- summary(model_inla) ## Summarise results h2_inla <- data.frame(mean = summary_inla$summary.hyperparam["Heritability", "mean"],
lower = summary_inla$summary.hyperparam["Heritability", "0.025quant"], upper = summary_inla$summary.hyperparam["Heritability", "0.975quant"],
method = "INLA",
stringsAsFactors = FALSE)


If we wrap this all in a loop, we can see how the estimation methods do on replicate data (full script on GitHub). Here are estimates and intervals from ten replicates (black dots show the actual heritability in the first generation):

As you can see, the MCMC and INLA estimates agree pretty well and mostly hit the mark. In the one replicate dataset where they falter, they falter together.