# Using R: Correlation heatmap with ggplot2

(This post was originally written on 2013-03-23. Since then, it has persistently remained one of my most visited posts, and I’ve decided to revisit and update it. I may do the same to some other old R-related posts that people still arrive on through search engines. There was also this follow-up, which I’ve now incorporated here.)

Just a short post to celebrate when I learned how incredibly easy it is to make a heatmap of correlations with ggplot2 (with some appropriate data preparation, of course). Here is a minimal example using the reshape2 package for preparation and the built-in attitude dataset:

```library(ggplot2)
library(reshape2)
qplot(x = Var1, y = Var2,
data = melt(cor(attitude)),
fill = value,
geom = "tile")
```

What is going on in that short passage?

• cor makes a correlation matrix with all the pairwise correlations between variables (twice; plus a diagonal of ones).
• melt takes the matrix and creates a data frame in long form, each row consisting of id variables Var1 and Var2 and a single value.
• We then plot with the tile geometry, mapping the indicator variables to rows and columns, and value (i.e. correlations) to the fill colour.

However, there is one more thing that is really need, even if just for the first quick plot one makes for oneself: a better scale. The default scale is not the best for correlations, which range from -1 to 1, because it’s hard to tell where zero is. Let’s use the airquality dataset for illustration as it actually has some negative correlations. In ggplot2, a scale that has a midpoint and a different colour in each direction is called scale_colour_gradient2, and we just need to add it. I also set the limits to -1 and 1, which doesn’t change the colour but fills out the legend for completeness. Done!

```data <- airquality[,1:4]
qplot(x = Var1, y = Var2,
data = melt(cor(data, use = "p")),
fill = value,
geom = "tile") +
```

Finally, if you’re anything like me, you may be phasing out reshape2 in favour of tidyr. If so, you’ll need another function call to turn the matrix into a data frame, like so:

```library(tidyr)

correlations <- data.frame(cor(data, use = "p"))
correlations\$Var1 <- rownames(correlations)
melted <- gather(correlations, "value", "Var2", -Var1)

qplot(x = Var1, y = Var2,
data = melted,
fill = value,
geom = "tile") +
```

The data preparation is no longer a oneliner, but, honestly, it probably shouldn’t be.

Okay, you won’t stop reading until we’ve made a solution with pipes? Sure, we can do that! It will be pretty gratuitous and messy, though. From the top!

```library(magrittr)

airquality %>%
'['(1:4) %>%
data.frame %>%
transform(Var1 = rownames(.)) %>%
gather("Var2", "value", -Var1) %>%
ggplot() +
geom_tile(aes(x = Var1,
y = Var2,
fill = value)) +
```

# Jennifer Doudna & Samuel Sternberg ”A Crack In Creation”

While the blog is on a relaxed summer schedule, you can read my book review of Jennifer Doudna’s and Samuel Sternberg’s CRISPR-related autobiography and discussion of the future of genome editing in University of Edinburgh’s Science Magazine EUSci issue #24.

The book is called A Crack in Creation, subtitled The New Power to Control Evolution, or depending on edition, Gene Editing and the Unthinkable Power to Control Evolution. There are a couple of dramatic titles for you. The book starts out similarly dramatic, with Jennifer Doudna dreaming of a beach in Hawaii, where she comes from, imagining a wave rising from the ocean to crash down on her. The wave is CRISPR/Cas9 genome editing, the technological force of nature that Doudna and colleagues have let loose on the world.

I like the book, talk about some notable omissions, and take issue with the bombastic and inaccurate title(s). Read the rest here.

Pivotal CRISPR patent battle won by Broad Institute. Nature News. 2018.

Sharon Begley & Andrew Joseph. The CRISPR shocker: How genome-editing scientist He Jiankui rose from obscurity to stun the world. Stat News. 2018.

A Crack in Creation. The book’s website.

# ‘Simulating genetic data with R: an example with deleterious variants (and a pun)’

A few weeks ago, I gave a talk at the Edinburgh R users group EdinbR on the RAGE paper. Since this is an R meetup, the talk concentrated on the mechanics of genetic data simulation and with the paper as a case study. I showed off some of what Chris Gaynor’s AlphaSimR can do, and how we built on that to make the specifics of this simulation study. The slides are on the EdinbR Github.

Genetic simulation is useful for all kinds of things. Sure, they’re only as good as the theory that underpins them, but the willingness to try things out in simulations is one of the things I always liked about breeding research.

This is my description of the logic of genetic simulation: we think of the genome as a large table of genotypes, drawn from some distribution of allele frequencies.

To make an utterly minimal simulation, we could draw allele frequencies from some distribution (like a Beta distribution), and then draw the genotypes from a binomial distribution. Done!

However, there is a ton of nuance we would like to have: chromosomes, linkage between variants, sexes, mating, selection …

AlphaSimR addresses all of this, and allows you to throw individuals and populations around to build pretty complicated designs. Here is the small example simulation I used in the talk.

```
library(AlphaSimR)
library(ggplot2)

## Generate founder chromsomes

FOUNDERPOP <- runMacs(nInd = 1000,
nChr = 10,
segSites = 5000,
inbred = FALSE,
species = "GENERIC")

## Simulation parameters

SIMPARAM <- SimParam\$new(FOUNDERPOP)
mean = 100,
var = 10)
SIMPARAM\$setGender("yes_sys")

## Founding population

pop <- newPop(FOUNDERPOP,
simParam = SIMPARAM)

pop <- setPheno(pop,
varE = 20,
simParam = SIMPARAM)

## Breeding

print("Breeding")
breeding <- vector(length = 11, mode = "list")
breeding[[1]] <- pop

for (i in 2:11) {
print(i)
sires <- selectInd(pop = breeding[[i - 1]],
gender = "M",
nInd = 25,
trait = 1,
use = "pheno",
simParam = SIMPARAM)

dams <- selectInd(pop = breeding[[i - 1]],
nInd = 500,
gender = "F",
trait = 1,
use = "pheno",
simParam = SIMPARAM)

breeding[[i]] <- randCross2(males = sires,
females = dams,
nCrosses = 500,
nProgeny = 10,
simParam = SIMPARAM)
breeding[[i]] <- setPheno(breeding[[i]],
varE = 20,
simParam = SIMPARAM)
}

## Look at genetic gain and shift in causative variant allele frequency

mean_g <- unlist(lapply(breeding, meanG))
sd_g <- sqrt(unlist(lapply(breeding, varG)))

plot_gain <- qplot(x = 1:11,
y = mean_g,
ymin = mean_g - sd_g,
ymax = mean_g + sd_g,
geom = "pointrange",
main = "Genetic mean and standard deviation",
xlab = "Generation", ylab = "Genetic mean")

start_geno <- pullQtlGeno(breeding[[1]], simParam = SIMPARAM)
start_freq <- colSums(start_geno)/(2 * nrow(start_geno))

end_geno <- pullQtlGeno(breeding[[11]], simParam = SIMPARAM)
end_freq <- colSums(end_geno)/(2 * nrow(end_geno))

plot_freq_before <- qplot(start_freq, main = "Causative variant frequency before")
plot_freq_after <- qplot(end_freq, main = "Causative variant frequency after")
```

This code builds a small livestock population, breeds it for ten generations, and looks at the resulting selection response in the form of a shift of the genetic mean, and the changes in the underlying distribution of causative variants. Here are the resulting plots:

# ‘Approaches to genetics for livestock research’ at IASH, University of Edinburgh

A couple of weeks ago, I was at a symposium on the history of genetics in animal breeding at the Institute of Advanced Studies in the Humanities, organized by Cheryl Lancaster. There were talks by two geneticists and two historians, and ample time for discussion.

First geneticists:

Gregor Gorjanc presented the very essence of quantitative genetics: the pedigree-based model. He illustrated this with graphs (in the sense of edges and vertices) and by predicting his own breeding value for height from trait values, and from his personal genomics results.

Then, yours truly gave this talk: ‘Genomics in animal breeding from the perspectives of matrices and molecules’. Here are the slides (only slightly mangled by Slideshare). This is the talk I was preparing for when I collected the quotes I posted a couple of weeks ago.

I talked about how there are two perspectives on genomics: you can think of genomes either as large matrices of ancestry indicators (statistical perspective) or as long strings of bases (sequence perspective). Both are useful, and give animal breeders and breeding researchers different tools (genomic selection, reference genomes). I also talked about potential future breeding strategies that use causative variants, and how they’re not about stopping breeding and designing the perfect animal in a lab, but about supplementing genomic selection in different ways.

Then, historians:

Cheryl Lancaster told the story of how ABGRO, the Animal Breeding and Genetics Research Organisation in Edinburgh, lost its G. The organisation was split up in the 1950s, separating fundamental genetics research and animal breeding. She said that she had expected this split to be do to scientific, methodological or conceptual differences, but instead found when going through the archives, that it all was due to personal conflicts. She also got into how the ABGRO researchers justified their work, framing it as ”fundamental research”, and aspired to do long term research projects.

Jim Lowe talked about the pig genome sequencing and mapping efforts, how it was different from the human genome project in organisation, and how it used comparisons to the human genome a lot. Here he’s showing a photo of Alan Archibald using the gEVAL genome browser to quality-check the pig genome. He also argued that the infrastructural outcomes of a project like the human genome project, such as making it possible for pig genome scientists to use the human genome for comparisons, are more important and less predictable than usually assumed.

The discussion included comments by some of the people who were there (Chris Haley, Bill Hill), discussion about the breed concept, and what scientists can learn from history.

What is a breed? Is it a genetical thing, defined by grouping individuals based on their relatedness, a historical thing, based on what people think a certain kind of animal is supposed to look like, or a marketing tool, naming animals that come from a certain system? It is probably a bit of everything. (I talked with Jim Lowe during lunch; he had noticed how I referred to Griffith & Stotz for gene concepts, but omitted the ”post-genomic” gene concept they actually favour. This is because I didn’t find it useful for understanding how animal breeding researchers think. It is striking how comfortable biologists are with using fuzzy concepts that can’t be defined in a way that cover all corner cases, because biology doesn’t work that way. If the nominal gene concept is broken by trans-splicing, practicing genomicists will probably think of that more as a practical issue with designing gene databases than a something that invalidates talking about genes in principle.)

What would researchers like to learn from history? Probably how to succeed with large research endeavors and how to get funding for them. Can one learn that from history? Maybe not, but there might be lessons about thinking of research as ”basic”, ”fundamental”, ”applied” etc, and about what the long term effects of research might be.

# What single step does with relationship

We had a journal club about the single step GBLUP method for genomic evaluation a few weeks ago. In this post, we’ll make a few graphs of how the single step method models relatedness between individuals.

Imagine you want to use genomic selection in a breeding program that already has a bunch of historical pedigree and trait information. You could use some so-called multistep evaluation that uses one model for the classical pedigree + trait quantitative genetics and one model for the genotype + trait genomic evaluation, and then mix the predictions from them together. Or you could use the single-step method, which combines pedigree, genotypes and traits into one model. It does this by combining the relationship estimates from pedigree and genotypes. That matrix can then go into your mixed model.

We’ll illustrate this with a tiny simulated population: five generations of 100 individuals per generation, where ten random pairings produce the next generation, with families of ten individuals. (The R code is on Github and uses AlphaSimR for simulation and AGHmatrix for matrices). Here is a heatmap of the pedigree-based additive relationship matrix for the population:

What do we see? In the lower left corner are the founders, and not knowing anything about their heritage, the matrix has them down as unrelated. The squares of high relatedness along the diagonal are the families in each generation. As we go upwards and to the right, relationship is building up.

Now, imagine the last generation of the population also has been genotyped with a SNP chip. Here is a heatmap of their genomic relationship matrix:

Genomic relationship is more detailed. We can still discern the ten families within the last generation, but no longer are all the siblings equally related to each other and to their ancestors. The genotyping helps track segregation within families, pointing out to us when relatives are more or less related than the average that we get from the pedigree.

Enter the single-step relationship matrix. The idea is to put in the genomic relationships for the genotyped individuals into the big pedigree-based relationship matrix, and then adjust the rest of the matrix to propagate that extra information we now have from the genotyped individuals to their ungenotyped relatives. Here is the resulting heatmap:

You can find the matrix equations in Legarra, Aguilar & Misztal (2009). The matrix, called H, is broken down into four partitions called H11, H12, H21, and H22. H22 is the part that pertains to the genotyped animals, and it’s equal to the genomic relationship matrix G (after some rescaling). The others are transformations of G and the corresponding parts of the additive relationship matrix, spreading G onto A.

To show what is going on, here is the difference between the additive relationship matrix and the single-step relationship matrix, with lines delineating the genotyped animals and breaking the matrix into the four partitions:

What do we see? In the top right corner, we have a lot of difference, where the genomic relationship matrix has been plugged in. Then, fading as we go from top to bottom and from right to left, we see the influence of the genomic relationship on relatives, diminishing the further we get from the genotyped individuals.

Literature

Legarra, Andres, I. Aguilar, and I. Misztal. ”A relationship matrix including full pedigree and genomic information.” Journal of dairy science 92.9 (2009): 4656-4663.

# Excerpts about genomics in animal breeding

Here are some good quotes I’ve come across while working on something.

Artificial selection on the phenotypes of domesticated species has been practiced consciously or unconsciously for millennia, with dramatic results. Recently, advances in molecular genetic engineering have promised to revolutionize agricultural practices. There are, however, several reasons why molecular genetics can never replace traditional methods of agricultural improvement, but instead they should be integrated to obtain the maximum improvement in economic value of domesticated populations.

Lande R & Thompson R (1990) Efficiency of marker-assisted selection in the improvement of quantitative traits. Genetics.

Smith and Smith suggested that the way to proceed is to map QTL to low resolution using standard mapping methods and then to increase the resolution of the map in these regions in order to locate more closely linked markers. In fact, future developments should make this approach unnecessary and make possible high resolution maps of the whole genome, even, perhaps, to the level of the DNA sequence. In addition to easing the application of selection on loci with appreciable individual effects, we argue further that the level of genomic information available will have an impact on infinitesimal models. Relationship information derived from marker information will replace the standard relationship matrix; thus, the average relationship coefficients that this represents will be replaced by actual relationships. Ultimately, we can envisage that current models combining few selected QTL with selection on polygenic or infinitesimal effects will be replaced with a unified model in which different regions of the genome are given weights appropriate to the variance they explain.

Haley C & Visscher P. (1998) Strategies to utilize marker–quantitative trait loci associations. Journal of Dairy Science.

Instead, since the late 1990s, DNA marker genotypes were included into the conventional BLUP analyses following Fernando and Grossman (1989): add the marker genotype (0, 1, or 2, for an animal) as a fixed effect to the statistical model for a trait, obtain the BLUP solutions for the additive polygenic effect as before, and also obtain the properly adjusted BLUE solution for the marker’s allele substitution effect; multiply this BLUE by 0, 1, or 2 (specic for the animal) and add the result to the animal’s BLUP to obtain its marker-enhanced EBV. A logical next step was to treat the marker genotypes as semi-random effects, making use of several different shrinkage strategies all based on the marker heritability (e.g., Tsuruta et al., 2001); by 2007, breeding value estimation packages such as PEST (Neumaier and Groeneveld, 1998) supported this strategy as part of their internal calculations. At that time, a typical genetic evaluation run for a production trait would involve up to 30 markers.

Knol EF, Nielsen B, Knap PW. (2016) Genomic selection in commercial pig breeding. Animal Frontiers.

Although it has not caught the media and public imagination as much as transgenics and cloning, genomics will, I believe, have just as great a long-term impact. Because of the availability of information from genetically well-researched species (humans and mice), genomics in farm animals has been established in an atypical way. We can now see it as progressing in four phases: (i) making a broad sweep map (~20 cM) with both highly informative (microsatellite) and evolutionary conserved (gene) markers; (ii) using the informative markers to identify regions of chromosomes containing quantitative trait loci (QTL) controlling commercially important traits–this requires complex pedigrees or crosses between phenotypically anc genetically divergent strains; (iii) progressing from the informative markers into the QTL and identifying trait genes(s) themselves either by complex pedigrees or back-crossing experiments, and/or using the conserved markers to identify candidate genes from their position in the gene-rich species; (iv) functional analysis of the trait genes to link the genome through physiology to the trait–the ‘phenotype gap’.

Bulfield G. (2000) Biotechnology: advances and impact. Journal of the Science of Food and Agriculture.

I believe animal breeding in the post-genomic era will be dramatically different to what it is today. There will be massive research effort to discover the function of genes including the effect of DNA polymorphisms on phenotype. Breeding programmes will utilize a large number of DNA-based tests for specific genes combined with new reproductive techniques and transgenes to increase the rate of genetic improvement and to produce for, or allocate animals to, the product line to which they are best suited. However, this stage will not be reached for some years by which time many of the early investors will have given up, disappointed with the early benefits.

Goddard M. (2003). Animal breeding in the (post-) genomic era. Animal Science.

Genetics is a quantitative subject. It deals with ratios, with measurements, and with the geometrical relationships of chromosomes. Unlike most sciences that are based largely on mathematical techniques, it makes use of its own system of units. Physics, chemistry, astronomy, and physiology all deal with atoms, molecules, electrons, centimeters, seconds, grams–their measuring systems are all reducible to these common units. Genetics has none of these as a recognizable component in its fundamental units, yet it is a mathematically formulated subject that is logically complete and self-contained.

Sturtevant AH & Beadle GW. (1939) An introduction to genetics. W.B. Saunders company, Philadelphia & London.

We begin by asking why genes on nonhomologous chromosomes assort independently. The simple cytological story rehearsed above answers the questions. That story generates further questions. For example, we might ask why nonhomologous chromosomes are distributed independently at meiosis. To answer this question we could describe the formation of the spindle and the migration of chromosomes to the poles of the spindle just before meiotic division. Once again, the narrative would generate yet further questions. Why do the chromosomes ”condense” at prophase? How is the spindle formed? Perhaps in answering these questions we would begin to introduce the chemical details of the process. Yet simply plugging a molecular account into the narratives offered at the previous stages would decrease the explanatory power of those narratives.

Kitcher, P. (1984) 1953 and all that. A tale of two sciences. Philosophical Review.

And, of course, this great quote by Jay Lush.

# There is no breeder’s equation for environmental change

This post is about why heritability coefficients of human traits can’t tell us what to do. Yes, it is pretty much an elaborate subtweet.

Let us begin in a different place, where heritability coefficients are useful, if only a little. Imagine there is selection going on. It can be natural or artificial, but it’s selection the old-fashioned way: there is some trait of an individual that makes it more or less likely to successfully reproduce. We’re looking at one generation of selection: there is one parent generation, some of which reproduce and give rise to the offspring generation.

Then, if we have a well-behaved quantitative trait, no systematic difference between the environments that the two generations experience (also, no previous selection; this is one reason I said ‘if only a little’), we can get an estimate of the response to selection, that is how the mean of the trait will change between the generations:

$R = h^2S$

R is the response. S, the selection differential, is the difference between the mean all of the parental generation and the selected parents, and thus measures the strength of the selection. h2 is the infamous heritability, which measures the accuracy of the selection.

That is, the heritability coefficient tells you how well the selection of parents reflect the offspring traits. A heritability coefficient of 1 would mean that selection is perfect; you can just look at the parental individuals, pick the ones you like, and get the whole selection differential as a response. A heritability coefficient of 0 means that looking at the parents tells you nothing about what their offspring will be like, and selection thus does nothing.

Conceptually, the power of the breeder’s equation comes from the mathematical properties of selection, and the quantitative genetic assumptions of a linear parent–offspring relationship. (If you’re a true connoisseur of theoretical genetics or a glutton for punishment, you can derive it from the Price equation; see Walsh & Lynch (2018).) It allows you to look (one generation at a time) into the future only because we understand what selection does and assume reasonable things about inheritance.

We don’t have that kind machinery for environmental change.

Now, another way to phrase the meaning of the heritability coefficient is that it is a ratio of variances, namely the additive genetic variance (which measures the trait variation that runs in families) divided by the total variance (which measures the total variation in the population, duh). This is equally valid, more confusing, and also more relevant when we’re talking about something like a population of humans, where no breeding program is going on.

Thus, the heritability coefficient is telling us, in a specific highly geeky sense, how much of trait variation is due to inheritance. Anything we can measure about a population will have a heritability coefficient associated with it. What does this tell us? Say, if drug-related crime has yay big heritability, does that tell us anything about preventing drug-related crime? If heritability is high, does that mean interventions are useless?

The answers should be evident from the way I phrased those rhetorical questions and from the above discussion: There is no theoretical genetics machinery that allows us to predict the future if the environment changes. We are not doing selection on environments, so the mathematics of selection don’t help us. Environments are not inherited according to the rules of quantitative genetics. Nothing prevents a trait from being eminently heritable and respond even stronger to changes in the environment, or vice versa.

(There is also the argument that quantitative genetic modelling of human traits matters because it helps control for genetic differences when estimating other factors. One has to be more sympathetic towards that, because who can argue against accurate measurement? But ought implies can. For quantitative genetic models to be better, they need to solve the problems of identifying variance components and overcoming population stratification.)

Much criticism of heritability in humans concern estimation problems. These criticisms may be valid (estimation is hard) or silly (of course, lots of human traits have substantial heritabilities), but I think they miss the point. Even if accurately estimated, heritabilities don’t do us much good. They don’t help us with the genetic component, because we’re not doing breeding. They don’t help us with the environmental component, because there is no breeder’s equation for environmental change.