Robertson on genetic correlation and loss of variation

It’s not too uncommon to see animal breeding papers citing a paper by Alan Robertson (1959) to support a genetic correlation of 0.8 as a cut-off point for what is a meaningful difference. What is that based on?

The paper is called ”The sampling variance of the genetic correlation coefficient” and, as the name suggests, it is about methods for estimating genetic correlations. It contains a section about the genetic correlation between environments as a way to measure gene-by-environment interaction. There, Robertson discusses experimental designs for detecting gene-by-environment interaction–that is, estimating whether a genetic correlation between different environments is less than one. He finds that you need much larger samples than for estimating heritabilities. It is in this context that the 0.8 number comes up. Here is the whole paragraph:

No interaction means a genetic correlation of unity. How much must the correlation fall before it has biological or agricultural importance? I would suggest that this figure is around 0.8 and that no experiment on genotype-environment interaction would have been worth doing unless it could have detected, as a significant deviation from unity, a genetic correlation of 0.6. In the first instance, I propose to argue from the standpoint of a standard error of 0.2 as an absolute minimum.

That is, in the context of trying to make study design recommendations for detecting genotype-by-environment interactions, Robertson suggests that a genetic correlation of 0.8 might be a meaningful difference from 1. The paper does not deal with designing breeding programs for multiple environments or the definition of traits, and it has no data on any of that. It seems to be a little bit like Fisher’s p < 0.05: Suggest a rule of thumb, and risk it having a life of its own in the future.

In the process of looking up this quote, I also found this little gem, from ”The effect of selection on the estimation of genetic parameters” (Robertson 1977). It talks about the problems that arise with estimating genetic parameters in populations under selection, when many quantitative genetic results, in one way or another, depend on random mating. Here is how it ends:

This perhaps points the moral of this paper. The individuals of one generation are the parents of the next — if they are accurately evaluated and selected in the first generation, the variation between families will be reduced in the next. You cannot have your cake and eat it.


Robertson, A. ”The sampling variance of the genetic correlation coefficient.” Biometrics 15.3 (1959): 469-485.

Robertson, A. ”The effect of selection on the estimation of genetic parameters.” Zeitschrift für Tierzüchtung und Züchtungsbiologie 94.1‐4 (1977): 131-135.

Using R: setting a colour scheme in ggplot2

Note to self: How to quickly set a colour scheme in ggplot2.

Imagine we have a series of plots that all need a uniform colour scale. The same category needs to have the same colour in all graphics, made possibly with different packages and by different people. Instead of hard-coding the colours and the order of categories, we can put them in a file, like so:

colours <- read_csv("scale_colours.csv")
# A tibble: 5 x 2
  name   colour 
1 blue   #d4b9da
2 red    #c994c7
3 purple #df65b0
4 green  #dd1c77
5 orange #980043

Now a plot with default colours, using some made-up data:

x <- 1:100

beta <- rnorm(5, 1, 0.5)

stroop <- data.frame(x,
                     sapply(beta, function(b) x * b + rnorm(100, 1, 10)))
colnames(stroop)[2:6] <- c("orange", "blue", "red", "purple", "green") 

data_long <- pivot_longer(stroop, -x)

plot_y <- qplot(x = x,
                y = value,
                colour = name,
                data = data_long) +
  theme_minimal() +
  theme(panel.grid = element_blank())

Now we can add the custom scale like this:

plot_y_colours <- plot_y + 
  scale_colour_manual(limits = colours$name,
                      values = colours$colour)

Virtual animal breeding journal club: ”An eQTL in the cystathionine beta synthase gene is linked to osteoporosis in laying hens”

The other day the International Virtual Animal Breeding Journal Club, organised by John Cole, had its second meeting. I presented a recent paper about using genetic mapping and gene expression to find a putative causative gene for a region associated with bone strength in layer chickens. This from colleauges I know and work with, but I wasn’t involved in this work myself.

Here is the paper:

De Koning, Dirk-Jan, et al. ”An eQTL in the cystathionine beta synthase gene is linked to osteoporosis in laying hens.” Genetics Selection Evolution 52.1 (2020): 1-17.

Here are my slides:

Ian Dunn and DJ de Koning were both on the call to answer some questions and give the authors’ perspective, which, again, I thought was very useful. I hope this becomes a recurring theme of the journal club.

I chose the paper because I think it’s a good example of the QTL–eQTL paradigm of causative gene identification. We got some discussion about that. Conclusions: You never really know whether an association with gene expression is causal or reactive, unless there’s some kind of experimental manipulation. We all want more annotation, more functional genomics and more genome sequences. I can’t argue with that.

Here is the a review of layer chicken bone biology referred to in the slides, if you want to look into that:

Whitehead, C. C. ”Overview of bone biology in the egg-laying hen.” Poultry science 83.2 (2004): 193-199.

If you want to follow the journal club, see the Google group and Twitter account for announcements.

Virtual animal breeding journal club: ”Structural equation models to disentangle the biological relationship between microbiota and complex traits …”

The other day was the first Virtual breeding and genetics journal club organised by John Cole. This was the first online journal club I’ve attended (shocking, given how many video calls I’ve been on for other sciencey reasons), so I thought I’d write a little about it: both the format and the paper. You can look the slide deck from the journal club here (pptx file).

The medium

We used Zoom, and that seemed to work, as I’m sure anything else would, if everyone just mute their microphone when they aren’t speaking. As John said, the key feature of Zoom seems to be the ability for the host to mute everyone else. During the call, I think we were at most 29 or so people, but only a handful spoke. It will probably get more intense with the turn taking if more people want to speak.

The format

John started the journal club with a code of conduct, which I expect helped to set what I felt was a good atmosphere. In most journal clubs I’ve been in, I feel like the atmosphere has been pretty good, but I think we’ve all heard stories about hyper-critical and hostile journal clubs, and that doesn’t sound particularly fun or useful. On that note, one of the authors, Oscar González-Recio, was on the call and answered some questions.

The paper

Saborío‐Montero, Alejandro, et al. ”Structural equation models to disentangle the biological relationship between microbiota and complex traits: Methane production in dairy cattle as a case of study.” Journal of Animal Breeding and Genetics 137.1 (2020): 36-48.

The authors measured methane emissions (by analysing breath with with an infrared gas monitor) and abundance of different microbes in the rumen (with Nanopore sequencing) from dairy cows. They genotyped the animals for relatedness.

They analysed the genetic relationship between breath methane and abundance of each taxon of microbe, individually, with either:

  • a bivariate animal model;
  • a structural equations model that allows for a causal effect of abundance on methane, capturing the assumption that the abundance of a taxon can affect the methane emission, but not the other way around.

They used them to estimate heritabilities of abundances and genetic correlations between methane and abundances, and in the case of the structural model: conditional on the assumed causal model, the effect of that taxon’s abundance on methane.

My thoughts

It’s cool how there’s a literature building up on genetic influences on the microbiome, with some consistency across studies. These intense high-tech studies on relatively few cattle might build up to finding new traits and proxies that can go into larger scale phenotyping for breeding.

As the title suggests, the paper advocates for using the structural equations model: ”Genetic correlation estimates revealed differences according to the usage of non‐recursive and recursive models, with a more biologically supported result for the recursive model estimation.” (Conclusions)

While I agree that a priori, it makes sense to assume a structural equations model with a causal structure, I don’t think the results provide much evidence that it’s better. The estimates of heritabilities and genetic correlations from the two models are near indistinguishable. Here is the key figure 4, comparing genetic correlation estimates:


As you can see, there are a couple of examples of genetic correlations where the point estimate switches sign, and one of them (Succinivibrio sp.) where the credible intervals don’t overlap. ”Recursive” is the structural equations model. The error bars are 95% credible intervals. This is not strong evidence of anything; the authors are responsible about it and don’t go into interpreting this difference. But let us speculate! They write:

All genera in this case, excepting Succinivibrio sp. from the Proteobacteria phylum, resulted in overlapped genetic cor- relations between the non‐recursive bivariate model and the recursive model. However, high differences were observed. Succinivibrio sp. showed the largest disagreement changing from positively correlated (0.08) in the non‐recursive bivariate model to negatively correlated (−0.20) in the recursive model.

Succinivibrio are also the taxon with the estimated largest inhibitory effect on methane (from the structural equations model).

While some taxa, such as ciliate protozoa or Methanobrevibacter sp., increased the CH4 emissions …, others such as Succinivibrio sp. from Proteobacteria phylum decreased it

Looking at the paper that first described these bacteria (Bryan & Small 1955),  Succinivibrio were originally isolated from the cattle rumen, and their name is because ”they ferment glucose with the production of a large amount of succinic acid”. Bryant & Small made a fermentation experiment to see what came out, and it seems that the bacteria don’t produce methane:


This is also in line with a rRNA sequencing study of high and low methane emitting cows (Wallace & al 2015) that found lower Succinivibrio abundance in high methane emitters.

We may speculate that Succinivibrio species could be involved in diverting energy from methanogens, and thus reducing methane emissions. If that is true, then the structural equations model estimate (larger genetic negative correlation between Succinivibrio abundance and methane) might be better than one from the animal model.

Finally, while I’m on board with the a priori argument for using a structural equations model, as with other applications of causal modelling (gene networks, Mendelian randomisation etc), it might be dangerous to consider only parts of the system independently, where the microbes are likely to have causal effects on each other.


Saborío‐Montero, Alejandro, et al. ”Structural equation models to disentangle the biological relationship between microbiota and complex traits: Methane production in dairy cattle as a case of study.” Journal of Animal Breeding and Genetics 137.1 (2020): 36-48.

Wallace, R. John, et al. ”The rumen microbial metagenome associated with high methane production in cattle.” BMC genomics 16.1 (2015): 839.

Bryant, Marvin P., and Nola Small. ”Characteristics of two new genera of anaerobic curved rods isolated from the rumen of cattle.” Journal of bacteriology 72.1 (1956): 22.

Preprint: ”Genetics of recombination rate variation in the pig”

We have a new preprint posted, showing that recombination rate in the pig is lowly heritable and associated with alleles at RNF212.

We developed a new method to estimate recombinations in 150,000 pigs, and used that to estimate heritability and perform genome-wide association studies in 23,000.

Here is the preprint:

Johnsson M*, Whalen A*, Ros-Freixedes R, Gorjanc G, Chen C-Y, Herring WO, de Koning D-J, Hickey JM. (2020) Genetics of recombination rate variation in the pig. BioRxiv preprint. (* equal contribution)

Here is the abstract:

Background In this paper, we estimated recombination rate variation within the genome and between individuals in the pig for 150,000 pigs across nine genotyped pedigrees. We used this to estimate the heritability of recombination and perform a genome-wide association study of recombination in the pig.

Results Our results confirmed known features of the pig recombination landscape, including differences in chromosome length, and marked sex differences. The recombination landscape was repeatable between lines, but at the same time, the lines also showed differences in average genome-wide recombination rate. The heritability of genome-wide recombination was low but non-zero (on average 0.07 for females and 0.05 for males). We found three genomic regions associated with recombination rate, one of them harbouring the RNF212 gene, previously associated with recombination rate in several other species.

Conclusion Our results from the pig agree with the picture of recombination rate variation in vertebrates, with low but nonzero heritability, and a major locus that is homologous to one detected in several other species. This work also highlights the utility of using large-scale livestock data to understand biological processes.

Using R: simple Gantt chart with ggplot2

Jeremy Yoder’s code for a simple Gantt chart on the Molecular Ecologist blog uses geom_line and gather to prepare the data structure. I like using geom_linerange and a coord_flip, which lets you use start and end columns directly without pivoting.

Here is a very serious data frame of activities:

# A tibble: 6 x 4
  activity       category        start               end                
1 Clean house    preparations    2020-07-01 00:00:00 2020-07-03 00:00:00
2 Pack bags      preparations    2020-07-05 10:00:00 2020-07-05 17:00:00
3 Run to train   travel          2020-07-05 17:00:00 2020-07-05 17:15:00
4 Sleep on train travel          2020-07-05 17:15:00 2020-07-06 08:00:00
5 Procrastinate  procrastination 2020-07-01 00:00:00 2020-07-05 00:00:00
6 Sleep          vacation        2020-07-06 08:00:00 2020-07-09 00:00:00

And here is the code:


activities <- read_csv("activities.csv")

## Set factor level to order the activities on the plot
activities$activity <- factor(activities$activity,
                              levels = activities$activity[nrow(activities):1])
plot_gantt <- qplot(ymin = start,
                    ymax = end,
                    x = activity,
                    colour = category,
                    geom = "linerange",
                    data = activities,
                    size = I(5)) +
    scale_colour_manual(values = c("black", "grey", "purple", "yellow")) +
    coord_flip() +
    theme_bw() +
    theme(panel.grid = element_blank()) +
    xlab("") +
    ylab("") +
    ggtitle("Vacation planning")

Using R: 10 years with R

Yesterday, 29 Feburary 2020, was the 20th anniversary of the release R 1.0.0. Jozef Hajnala’s blog has a cute anniversary post with some trivia. I realised that it is also (not to the day, but to the year) my R anniversary.

I started using R in 2010, during my MSc project in Linköping. Daniel Nätt, who was a PhD student there at the time, was using it for gene expression and DNA methylation work. I think that was the reason he was pulled into R; he needed the Bioconductor packages for microarrays. He introduced me. Thanks, Daniel!

I think I must first have used it to do something with qPCR melting curves. I remember that I wrote some function to reshape/pivot data between long and wide format. It was probably an atrocity of nested loops and hard bracket indexing. Coming right from an undergraduate programme with courses using Ada and C++, even if we had also used Minitab for statistics and Matlab for engineering, I spoke R with a strong accent. At any rate, I was primed to think that doing my data analysis with code was a good idea, and jumped at the opportunity to learn a tool for it. Thanks, undergraduate programme!

I think the easiest thing to love about R is the package system. You can certainly end up in dependency hell with R and metaphorically shoot your own foot, especially on a shared high performance computing system. But I wouldn’t run into any of that until after several years. I was, and still am, impressed by how packages just worked, and could do almost anything. So, the Bioconductor packages were probably, indirectly, why I was introduced to R, and after that, my R story can be told in a series of packages. Thanks, CRAN!

The next package was R/qtl, that I relied on for my PhD. I had my own copy of the R/qtl book. For a period, I probably wrote thing every day:


cross <- read.cross(file = "F8_geno_trim.csv", format = "csv")

R/qtl is one of my favourite pieces or research software, relatively friendly and with lots of documentation. Thanks, R/qtl developers!

Of course it was Dom Wright, who was my PhD supervisor, who introduced me to R/qtl, and I think it was also he who introduced me to ggplot2. At least he used it, and at some point we were together trying to fix the formatting of a graph, probably with some ugly hack. I decided to use ggplot2 as much as possible, and as it is wont to, ggplot2 made me care about rearranging data, thus leading to reshape2 and plyr. ”The magic is not in plotting the data but in tidying and rearranging the data for plotting.” After a while, most everything I wrote used the ddply function in some way. Thank you, Hadley Wickham!

Then came the contemporary tidyverse. For the longest time, I was uneasy with tidyr, and I’m still not a regular purrr user, but one can’t avoid loving dplyr. How much? My talk at the Swedish Bioinformatics Workshop in 2016 had a slide expressing my love of the filter function. It did not receive the cheers that the function deserves. Maybe the audience were Python users. With new file reading functions, new data frames and functions to manipulate data frames, modern R has become smoother and friendlier. Thanks, tidyverse developers!

The history of R on this blog started in 2011, originally as a way to make notes for myself or, ”a fellow user who’s trying to google his or her way to a solution”. This turned into a series of things to help teach R to biologists around me.

There was the Slightly different introduction to R series of blog posts. It used packages that feel somewhat outdated, and today, I don’t think there’s anything even slightly different about advocating RStudio, and teaching ggplot2 from the beginning.

This spawned a couple of seminars in course for PhD students, which were updated for the Wright lab computation lunches, and eventually turned into a course of its own given in 2017. It would be fun to update it and give it again.

The last few years, I’ve been using R for reasonably large genome datasets in a HPC environment, and gotten back to the beginnings, I guess, by using Bioconducor a lot more. However, the package that I think epitomises the last years of my R use is AlphaSimR, developed by colleagues in Edinburgh. It’s great to be able throw together a quick simulation to check how some feature of genetics behaves. AlphaSimR itself is also an example of how far the R/C++ integration has come with RCpp and RCppArmadillo. Thanks, Chris!

In summary, R is my tool of choice for almost anything. I hope we’ll still be using it, in new and interesting ways, in another ten years. Thank you, R core team!

Reflektioner om högskolepedagogik, tagning 2

Kära dagbok,

Mer tankar från fortsättningskurs i högskolepedagogik.

På det andra kurstillfället ägnade vi rätt mycket tid åt ett romantiskt ideal för universitetet: både passet om pedagogisk utveckling i ett större sammanhang och passet om forskningsanknuten undervisning drog mycket inspiration från Humboldts ideal om högre utbildning som en helhet utbildning och forskning, som ska bibringa studenterna bildning och en generell färdighet att tänka självständigt, snarare än ämnes- och yrkeskunskaper.

Det står i högskolelagen och allt:

Verksamheten skall bedrivas så att det finns ett nära samband mellan forskning och utbildning.

Utbildning på grundnivå ska utveckla studenternas
– förmåga att göra självständiga och kritiska bedömningar,
– förmåga att självständigt urskilja, formulera och lösa problem, och
– beredskap att möta förändringar i arbetslivet.

Inom det område som utbildningen avser ska studenterna, utöver kunskaper och färdigheter, utveckla förmåga att
– söka och värdera kunskap på vetenskaplig nivå,
– följa kunskapsutvecklingen, och
– utbyta kunskaper även med personer utan specialkunskaper inom området.

Men lagen skriver också om arbetslivet, yrkesverksamhet och så vidare. Det låter som en blandning av den ovanstående visionen och andra hänsyn, och det överensstämmer kanske med universitetets historia som innehåller både holistisk Humboldt och en start som glorifierat prästseminarium.

Sådant pratade vi alltså om. Vad är ett universitet egentligen? Vad är poängen med den typ av utbildning som vi driver? Jag fick några gånger intrycket att den här delen av kursen handlade om att övertyga oss om att bildningsideal är bra och viktigt.

Men den lokala miljön är förmodligen ännu viktigare än den större inramningen. En bra miljö med kollegor som bryr sig och stöttar varandra hjälper förstås att göra undervisningen bättre. Mårtensson & Roxå (2011) tittade på fyra ”starka akademiska mikrokulturer”, det vill säga ställen på universitet där undervisning och forskning ansågs fungera väldigt bra. Som man kanske kan vänta sig är det här grupper där man håller undervisning för väldigt viktigt, har förtroende för varandra, har goda relationer med andra, och upplever ett gemensamt ärende.

Ärenden ifråga verkade vara långsiktiga mål som var riktade utåt. Poängen var att forma fältet, utbilda studenter som kommer att utveckla yrket, påverka omvärlden. Inte ”att bli en excellent utbildningsmiljö”, eller något sådant som förmodligen står i dokument på högre nivå i organisationen. Författarna använder ordet ”enterprise”. Det påminner om skillnaden mellan extern och intern motivation hos den som lär sig. Klart läraren också blir mer driven av av intern motivation som en önskan att förändra sitt fält, än av extern motivation som att universitetet har en strategi att bli bäst på undervisning.

Det kom också fram att i de här miljöerna såg man undervisning inte bara som väldigt viktigt, utan som något som är oskiljaktigt från forskning. På så sätt ligger de också i linje med Humboldt.

Sätten att göra forskningsanknuten undervisning kan (såklart, som allting annat här i världen, beskrivas med en fyrfälting):

Det vill säga, undervisningen kan vara inriktad på forskningsresultat eller på forskningsprocesser, och det kan vara läraren som berättar eller studenten som gör. Forskningsanknytning kan alltså bestå i uppdatera materialet med det senaste från forskningsfronten (eller i alla fall något senare än det som står i läroboken), avslöja insider-information om hur forskningen går till, låta studenter själva läsa och analysera primär forskningslitteratur, eller låta studenter öva forskningsmetoder. Jag gillade särskilt en formulering som Göran Hartman använde om uppgifter där det inte finns en färdig lösning ”som ni lärare sitter och tjuvhåller på”.

Har jag jobbat något med forskningsanknytning? Ja, mest i det lärarledda innehållsinriktade hörnet. Det är ju bland det roliga med att göra en ny föreläsning, att försöka hitta någon forskningsartikel att passa in. Förstå min glädje när jag såg ett Tinbergen-citat mitt i en tung Drosophila-genetikstudie (Hoopfer et al 2015) och hittade en ursäkt att ta med den i en gästföreläsning om beteendegenetik. Det var ett litet exempel, men ändå.

Jag lärde mig också ett nytt fint ord. Jag hade ingen aning om att det rätt prosaiska svenska ”forskningsanknuten undervisning” på engelska heter ”the research–teaching nexus”. Fint ska det vara.

Using R: from plyr to purrr, part 0 out of however many

This post is me thinking out loud about applying functions to vectors or lists and getting data frames back.

Using R is an ongoing process of finding nice ways to throw data frames, lists and model objects around. While tidyr has arrived at a comfortable way to reshape dataframes with pivot_longer and pivot_wider, I don’t always find the replacements for the good old plyr package as satisfying.

Here is an example of something I used to like to do with plyr. Don’t laugh!

Assume we have a number of text files, all in the same format, that we need to read and combine. This arises naturally if you run some kind of analysis where the dataset gets split into chunks, like in genetics, where chunks might be chromosomes.

## Generate vector of file names
files <- paste("data/chromosome", 1:20, ".txt", sep = "")

genome <- ldply(files, read_tsv)

This gives us one big data frame, containing the rows from all those files.

If we want to move on from plyr, what are our options?

We can go old school with base R functions lapply and Reduce.


chromosomes <- lapply(files, read_tsv)
genome <- Reduce(rbind, chromosomes)

Here, we first let lapply read each file and store it in a list. Then we let Reduce fold the list with rbind, which binds the data frames in the list together, one below the other.

If that didn’t make sense, here it is again: lapply maps a function to each element of a vector or list, collecting the results in a list. Reduce folds the elements in a list together, using a function that takes in two arguments. The first argument will be the results it’s accumulated so far, and the second argument will be the next element of the list.

In the end, this leaves us, as with ldply, with one big data frame.

We can also use purrr‘s map_dfr. This seems to be the contemporary most elegant solution:


genome <- map_dfr(files, read_tsv)

map_dfr, like good old ldply will map over a vector or list, and collect resulting data frames. The ”r” in the name means adding the next data frame as rows. There is also a ”c” version (map_dfc) for adding as columns.

Things that really don’t matter: megabase or megabasepair

Should we talk about physical distance in genetics as number of base pairs (kbp, Mbp, and so on) or bases (kb, Mb)?

I got into a discussion about this recently, and I said I’d continue the struggle on my blog. Here it is. Let me first say that I don’t think this matters at all, and if you make a big deal out of this (or whether ”data” can be singular, or any of those inconsequential matters of taste we argue about for amusement), you shouldn’t. See this blog post as an exorcism, helping me not to trouble my colleagues with my issues.

What I’m objecting to is mostly the inconsistency of talking about long stretches of nucleotides as ”kilobase” and ”megabase” but talking about short stretches as ”base pairs”. I don’t think it’s very common to call a 100 nucleotide stretch ”a 100 b sequence”; I would expect ”100 bp”. For example, if we look at Ensembl, they might describe a large region as 1 Mb, but if you zoom in a lot, they give length in bp. My impression is that this is a common practice. However, if you consistently use ”bases” and ”megabase”, more power to you.

Unless you’re writing a very specific kind of bioinformatics paper, the risk of confusion with the computer storage unit isn’t a problem. But there are some biological arguments.

A biological argument for ”base”, might be that we care about the identity of the base, not the base pairing. We note only one nucleotide down when we write a nucleic acid sequence. The base pair is a different thing: that base bound to the one on the other strand that it’s paired with, or, if the DNA or RNA is single-stranded, it’s not even paired at all.

Conversely, a biochemical argument for ”base pair” might be that in a double-stranded molecule, the base pair is the relevant informational unit. We may only write one base in our nucleotide sequence for convenience, but because of the rules of base pairing, we know the complementing pair. In this case, maybe we should use ”base” for single-stranded molecules.

If we consult two more or less trustworthy sources, The Encylopedia of Life Sciences and Wiktionary, they both seem to take this view.

eLS says:

A megabase pair, abbreviated Mbp, is a unit of length of nucleic acids, equal to one million base pairs. The term ‘megabase’ (or Mb) is commonly used inter-changeably, although strictly this would refer to a single-stranded nucleic acid.

Wiktionary says:

A length of nucleic acid containing one million nucleotides (bases if single-stranded, base pairs if double-stranded)

Please return next week for the correct pronunciation of ”loci”.


Dear, P.H. (2006). Megabase Pair (Mbp). eLS.