På dna-dagen: dna-metaforer

Det finns olika metaforer för deoxyribonukleinsyran och vad den betyder för oss. Dna kan vara en ritning, ett recept, ett program eller skrift.

Det är nästan omöjligt att säga något om molekylärgenetik utan metaforer. Med kvantitativ genetik går det lite lättare, i all fall tills de statistiska modellerna och beräkningarna kommer fram. Kvantitativ genetik handlar om saker som alla kan se i vardagen, som familjelikhet och släktskap. Molekylärgenetik handlar om saker som, i och för sig finns i det allmäna medvetandet, men inte syns omkring oss.

Men metaforer kan vara ohjälpsamma och leda tanken fel. Bilden av dna som en ritning av organismen kan verka för enkel och leda tanken till genetisk determinism. Nu vet jag, trots att jag ska föreställa ingenjör, inte mycket om ritningar. På flera sätt är det inte så tokigt: en ritning representerar det som ska byggas med ett specialiserat bildspråk i en lägre dimension. Ett hus är i 3D, men en ritning i 2D. Proteiner är tredimensionella; den genetiska koden beskriver dem i en dimension. Men det kanske är sant att ordet ”ritning” (eller ”blåkopia”) för tanken till något som är för exakt och för avbildande.

Ett alternativ är att dna är ett recept (det är många som föreslagit det; bland annat Richard Dawkins i The Blind Watchmaker, 1986). Receptet har den fördelen att det beskriver en process med både ingredienser och instruktioner. Det är lite som organismens utveckling från ett befruktat ägg till en vuxen. ”Tillsätt maternell bicoid i ena änden och nanos i andra änden; låt proteinerna blandas fritt”, och så vidare (Gilbert 2000). En annan fördel är att det naturligt påminner om att dna inte är allt. Samma recept med lokala skillnader i ingredienser och improvisationer från den som lagar blir olika anrättningar. Å andra sidan överdriver receptet vad som finns i dna. Vilka gener som uttrycks var och när är ett samspel av dna och de proteiner och rna som redan finns i en cell vid en viss tidpunkt.

Eller så är dna ett program. Program är också instruktioner, så det har samma fördelar och nackdelar som receptet på den punkten. Å andra sidan är program abstrakta och fria från konkreta ingredienser och associationer till matlagning. Lite som en ritning låter det mekaniskt och exakt. Det spelar tydligt också roll vad dna skulle vara en ritning av eller ett recept på. Det är viss skillnad att kalla dna en ritning av proteiner än ett recept på en organism.

Till sist finns det metaforer inskrivna i själva terminologin. När genetiker pratar om dna, hur det förs vidare och används, pratar vi om det som ett skriftspråk. Det kallas kopiering när dna reproduceras när celler ska dela sig. Det kallas transkription, alltså kopiering men med en ton av överföring till en annan form eller ett annat medium, när rna produceras från dna. Det kallas translation, översättning, när rna i sin tur fungerar som mall för proteinsyntes. Till råga på allt skriver vi dna med ett alfabet på fyra bokstäver: A, C, T, G. Det är en bild som är så passande att den nästan är sann.

(Den 25 april 1953 publicerades artiklarna som presenterade dna-molekylens struktur. Därav dna-dagen. Gamla dna-dagsposter: Genetik utan dna (2016), Gener, orsak och verkan (2015), På dna-dagen (2014))

NASA and Orphan Black

A few months ago I wrote a post about the (fictitious, and also evil) clone experiment in Orphan Black. I said that comparison of complex traits between a handful of individuals isn’t, even in principle, a ”scientifically beautiful setup to learn myriad things”, but garbage. You can’t take two humans, even if they’re clones, put them in different environments, and expect to learn much of anything.

Funnily enough, it seems like NASA has been doing just that with the NASA twin study: there are two astronauts who are twins, and researchers have compared various things between them and before/after one of them went to space. Of course, those various things include headline-attracting assays like telomere length and DNA methylation (including ”epigenetic age” — something like Horvath 2013, I assume).

The news coverage has been confused — mixing up DNA methylation, gene expression and mutation. But can one blame news outlet for reporting about ”7% changes to his DNA” and ”space genes” when the press release said this:

Another interesting finding concerned what some call the “space gene”, which was alluded to in 2017. Researchers now know that 93% of Scott’s genes returned to normal after landing. However, the remaining 7% point to possible longer term changes in genes related to his immune system, DNA repair, bone formation networks, hypoxia, and hypercapnia.

Someone who knows some biology can guess that this doesn’t refer to mutation, but it’s not making things easy for the reader, and when put like that, the 7% could be DNA methylation, gene expression, or something else transient and genomic. (They’ve since clarified that it was gene expression — in some sample; my bet is on white blood cells.)

Now that we’ve made fun of NASA a little, there are some circumstances when we can learn useful things from studies of even a single individual. For example, if Chaser the Border Collie can learn the names of 1000 toys, and learn new toy names through reasoning by exclusion (Pilley & Reid 2011), then we can safely assume that this is within the realm of dog abilities. Another example is a reference genome, which in the best case is made from a single individual, ideally an individual who is as homozygous as possible. When comparing the reference genome to that of other species, we feel confident enough to publish genome papers with comparisons of gene content, gene family evolution, and selection on protein coding sequences over evolutionary timescales. But when it comes to functional genomics, many variable molecular trait measurements all along the genome? No.

The study is not out. It may be better than the advertisement. It’s seems they’ve compared the two men before and after, so they can get some handle on differences that came about in the years leading up to the study. And maybe they’ve run a crazy number of technical replicates to make sure that the value they get from each data point is as a good measurement as possible. And maybe there is data on what happens with these kinds of assays when people do other strenuous things, putting the differences into context. Maybe.

Literature

Pilley, John W., and Alliston K. Reid. ”Border collie comprehends object names as verbal referents.” Behavioural processes 86.2 (2011): 184-195.

Horvath, Steve. ”DNA methylation age of human tissues and cell types.” Genome biology 14.10 (2013): 3156.

Selected, causal, and relevant

What is ”function”? In discussions about junk DNA people often make the distinction between ”selected effects” and ”causal roles”. Doolittle & Brunet (2017) put it like this:

By the first (selected effect, or SE), the function(s) of trait T is that (those) of its effects E that was (were) selected for in previous generations. They explain why T is there. … [A]ny claim for an SE trait has an etiological justification, invoking a history of selection for its current effect.

/…/

ENCODE assumed that measurable effects of various kinds—being transcribed, having putative transcription factor binding sites, exhibiting (as chromatin) DNase hypersensitivity or histone modifications, being methylated or interacting three-dimensionally with other sites — are functions prima facie, thus embracing the second sort of definition of function, which philosophers call causal role …

In other words, their argument goes: a DNA sequence can be without a selected effect while it has, potentially several, causal roles. Therefore, junk DNA isn’t dead.

Two things about these ideas:

First, if we want to know the fraction of the genome that is functional, we’d like to talk about positions in some reference genome, but the selected effect definition really only works for alleles. Positions aren’t adaptive, but alleles can be. They use the word ”trait”, but we can think of an allele as a trait (with really simple genetics — its genetic basis its presence or absence in the genome).

Also, unfortunately for us, selection doesn’t act on alleles in isolation; there is linked selection, where alleles can be affected by selection without causally contributing anything to the adaptive trait. In fact, they may counteract the adaptive trait. It stands to reason that linked variants are not functional in the selected effect sense, but they complicate analysis of recent adaptation.

The authors note that there is a problem with alleles that have not seen positive selection, but only purifying selection (that could happen in constructive neutral evolution, which is when something becomes indispensable through a series of neutral or deleterious substitutions). Imagine a sequence where most mutations are neutral, but deleterious mutations can happen rarely. A realistic example could be the causal mutation for Freidreich’s ataxia: microsatellite repeats in an intron that occasionally expand enough to prevent transcription (Bidichandani et al. 1998, Ohshima et al. 1998; I recently read about it in Nessa Carey’s ”Junk DNA”). In such cases, selection does not preserve any function of the microsatellite. That a thing can break in a dangerous way is not enough to know that it was useful when whole.

Second, these distinctions may be relevant to the junk DNA debate, but for any research into the genetic basis of traits currently or in the future, such as medical genetics or breeding, neither of these perspectives is what we need. The question is not what parts of the genome come from adaptive alleles, nor what parts of the genome have causal roles. The question is what parts of the genome have causal roles that are relevant to the traits we care about.

The same example is relevant. It seems like the Friedriech’s ataxia-associated microsatellite does not fulfill the selected effect criterion. It does, however, have a causal role, and a causal role relevant to human disease, at that.

I do not dare to guess whether the set of sequences with causal roles relevant to human health is bigger or smaller than the set of sequences with selected effects. But they are not identical. And I will dare to guess that the relevant set, like the selected effect set, is a small fraction of the genome.

Literature

Doolittle, W. Ford, and Tyler DP Brunet. ”On causal roles and selected effects: our genome is mostly junk.” BMC biology 15.1 (2017): 116.

Nessa Carey ”Junk DNA”

I read two popular science books over Christmas. The other one was in Swedish, so I’ll do that in Swedish.

Nessa Carey’s ”Junk DNA: A Journey Through the Dark Matter of the Genome” is about noncoding DNA in the human genome. ”Coding” in this context means that it serves as template for proteins. ”Noncoding” is all the rest of the genome, 98% or so.

The book is full of fun molecular genetics: X-inactivation, rather in-depth discussion of telomeres and centromeres, the mechanism of noncoding microsatellite disease mutations, splicing — some of which isn’t often discussed at such length and clarity. It gives the reader a good look at how messy genomics can be. It has wonderful metaphors — two baseball bats with magnetic paint and velcro, for example. It even has an amusing account of the ENCODE debate. I wonder if it’s true that evolutionary biologists are more emotional than other biologists?

But it really suffers from the framing as a story about how noncoding DNA used to be dismissed as pointless, and now, surprisingly, turns out to have regulatory functions. This makes me a bit hesitant to recommend the book; you may come away from reading it with a lot of neat details, but misled about the big picture. In particular, you may believe a false history of all this was thought to be junk; look how wrong they were in the 70s, and the very dubious view that most of the human genome is important for our health.

On the first page of the book, junk DNA is defined like this:

Anything that doesn’t code for protein will be described as junk, as it originally was in the old days (second half of the twentieth century). Purists will scream, and that’s OK.

We should scream, or at least shake our heads, because this definition leads, for example, to describing ribosomes and transfer-RNA as ”junk” (chapter 11), even if both of them have been known to be noncoding and functional since at least the 60s. I guess the term ”junk” sticks, and that is why the book uses it, and why biologists love to argue about it. You couldn’t call the book something unspeakably dry like ”Noncoding DNA”.

So, this is a fun a popular science book about genomics. Read it, but keep in mind that if you want to define ”junk DNA” for any other purpose than to immediately shoot it down, it should be something like this:

For most of the 50 years since Ohno’s article, many of us accepted that most of our genome is ”junk”, by which we would loosely have meant DNA that is neither protein-coding nor involved in regulating the expression of DNA that is. (Doolittle & Brunet 2017)

The point of the term is not to dismiss everything that is not coding for a protein. The point is that the bulk of DNA in the genome is neither protein coding nor regulatory. This is part of why molecular genetics is so tricky: it is hard to find the important parts among all the rest. Researchers have become much better at sifting through the noncoding parts of the genome to find the sequences that are interesting and useful. Think of lots of tricky puzzles being solved, rather than of a paradigm being overthrown by revolution.

Literature

Carey, Nessa. (2015) Junk DNA: A Journey Through the Dark Matter of the Genome. Icon Books, London.

Doolittle, W. Ford, and Tyler DP Brunet. (2017) ”On causal roles and selected effects: our genome is mostly junk.” BMC Biology.

Griffin & Nesseth ”The science of Orphan Black: the official companion”

I didn’t know that science fiction series Orphan Black actually had a real Cosima: Cosima Herter, science consultant. After reading this interview and finishing season 5, I realised that there is also a new book I needed to read: The science of Orphan Black: The official companion by PhD candidate in development, stem cells and regenerative medicine Casey Griffin and science communicator Nina Nesseth with a foreword by Cosima Hertner.

(Warning: This post contains serious spoilers for Orphan Black, and a conceptual spoiler for GATTACA.)

One thing about science fiction struck me when I was watching the last episodes of Orphan Black: Sometimes it makes a lot more sense if we don’t believe everything the fictional scientists tell us. Like real scientists, they may be wrong, or they may be exaggerating. The genetically segregated future of GATTACA becomes no less chilling when you realise that the silly high predictive accuracies claimed are likely just propaganda from a oppressive society. And as you realise that the dying P.T. Westmorland is an imposter, you can break your suspension of disbelief about LIN28A as a fountain of youth gene … Of course, genetics is a little more complicated than that, and he is just another rich dude who wants science to make him live forever.

However, it wouldn’t be Orphan Black if there weren’t a basis in reality: there are several single gene mutations in model animals (e.g. Kenyon & al 1993) that can make them live a lot longer than normal, and LIN28A is involved in ageing (reviewed by Jun-Hao & al 2016). It’s not out of the question that an engineered single gene disruption that substantially increases longevity in humans could be possible. Not practical, and not necessarily without unpleasant side effects, but not out of the question.

Orphan Black was part slightly scary adventure, part festival of ideas about science and society, part character-driven web of relationships, and part, sadly, bricolage of clichés. I found when watching season five that I’d forgotten most of the plots of seasons two through four, and I will probably never make the effort to sit through them again. The first and last seasons make up for it, though.

The series seems to have been set on squeezing as many different biological concepts as possible in there, so the book has to try to do the same. It has not just clones and transgenes, but also gene therapy, stem cells, prion disease, telomeres, dopamine, ancient DNA, stem cells in cosmetics and so on. Two chapters try valiantly to make sense of the clone disease and the cure. It shows that the authors have encyclopedic knowledge of life science, with a special interest in development and stem cells.

But I think they slightly oversell how accurate the show is. Like when Cosima tells Scott to ”run a PCR on these samples, see if there are any genetic markers” and ”can you sequence for cytochrome c?”, and Scott replies ”the barcode gene? that’s the one we use for species differentiation” … That’s what screen science is like. The right words, but not always in the right order.

Cosima and Scott sciencing at university, before everything went pear-shaped. One of the good thing about Orphan Black was the scientist characters. There was a ton of them! The good ones, geniuses with sparse resources and self experimentation, the evil ones, well funded and deeply unethical, and Delphine. This scene is an exception in that it plays the cringe-inducing nerd angle. Cosima and Scott grew after than this.

There are some scientific oddities. They must be impossible to avoid. For example, the section on epigenetics treats it as a completely new field, sort of missing the history of the subfield. DNA methylation research was going on already in the 1970s (Gitschier 2009). Genomic imprinting, arguably the only solid example of transgenerational epigenetic effects in humans, and X inactivation were both being discovered during 70s and 80s (reviewed by Ferguson-Smith 2011). The book also makes a hash of genome sequencing, which is a shame but understandable. It would have taken a lot of effort to disentangle how sequencing worked when the fictional clone experiment started and how it got to how it works in season five, when Cosima runs Nanopore sequencing.

The idea of human cloning is evocative. Orphan Black flipped it on its head by making the main clone characters strikingly different. It also cleverly acknowledged that human cloning is a somewhat dated 20th century idea, and that the cutting edge of life science has moved on. But I wish the book had been harder on the premise of the clone experiment:

By cloning the human genome and fostering a set of experimental subjects from birth, the scientists behind the project would gain many insights into the inner workings of the human body, from the relay of genetic code into observable traits (called phenotypes), to the viability of manipulated DNA as a potential therapeutic tool, to the effects of environmental factors on genetics. It’s a scientifically beautiful setup to learn myriad things about ourselves as humans, and the doctors at Dyad were quick to jump at that opportunity. (Chapter 1)

This is the very problem. Of course, sometimes ethically atrocious fictional science would, in principle, generate useful knowledge. But when when fictional science is near useless, let’s not pretend that it would produce a lot of valuable knowledge. When it comes to genetics and complex traits like human health, small sample studies of this kind (even if it was using clones) would be utterly useless. Worse than useless, they would likely be biased and misleading.

Researchers still float the idea of a ”baseline”, though, but in the form of a cell line, where it makes more sense. See the the (Human) Genome Project-write (Boeke & al 2016), suggesting the construction of an ideal baseline cell line for understanding human genome function:

Additional pilot projects being considered include … developing a homozygous reference genome bearing the most common pan-human allele (or allele ancestral to a given human population) at each position to develop cells powered by ”baseline” human genomes. Comparison with this baseline will aid in dissecting complex phenotypes, such as disease susceptibility.

In the end, the most important part of science in science fiction isn’t to be a factually correct, nor to be a coherent prediction about the future. If Orphan Black has raised interest in science, and I’m sure it has, that is great. And if it has stimulated discussions about the relationship between biological science, culture and ethics, that is even better.

The timeline of when relevant scientific discoveries happened in the real world and in Orphan Black is great. The book has a partial bibliography. The ”Clone Club Q&A” boxes range from silly fun to great open questions.

Orphan Black was probably the best genetics TV show around, and this book is a wonderful companion piece.

Plaque at the Roslin Institute to the sheep that haunts Orphan Black. ”Baa.”

Literature

Boeke, JD et al (2016) The genome project-write. Science.

Ferguson-Smith, AC (2011) Genomic imprinting: the emergence of an epigenetic paradigm. Nature reviews Genetics.

Gitschier, J. (2009). On the track of DNA methylation: An interview with Adrian Bird. PLOS Genetics.

Jun-Hao, E. T., Gupta, R. R., & Shyh-Chang, N. (2016). Lin28 and let-7 in the Metabolic Physiology of Aging. Trends in Endocrinology & Metabolism.

Kenyon, C., Chang, J., Gensch, E., Rudner, A., & Tabtiang, R. (1993). A C. elegans mutant that lives twice as long as wild type. Nature, 366(6454), 461-464.

”These are all fairly obvious” (says Sewall Wright)

I was checking a quote from Sewall Wright, and it turned out that the whole passage was delightful. Here it is, from volume 1 of Genetics and the Evolution of Populations (pages 59-60):

There are a number of broad generalizations that follow from this netlike relationship between genome and complex characters. These are all fairly obvious but it may be well to state them explicitly.

1) The variations of most characters are affected by a great many loci (the multiple factor hypothesis).

2) In general, each gene replacement has effects on many characters (the principle of universal pleiotropy).

3) Each of the innumerable possible alleles at any locus has a unique array of differential effects on taking account of pleiotropy (uniqueness of alleles).

4) The dominance relation of two alleles is not an attribute of them but of the whole genome and of the environment. Dominance may differ for each pleiotropic effect and is in general easily modifiable (relativity of dominance).

5) The effects of multiple loci on a character in general involve much nonadditive interaction (universality of interaction effects).

6) Both ontogenetic and phylogenetic homology depend on calling into play similar chains of gene-controlled reactions under similar developmental conditions (homology).

7) The contributions of measurable characters to overall selective value usually involve interaction effects of the most extreme sort because of the usually intermediate position of the optimum grade, a situation that implies the existence of innumerable different selective peaks (multiple selective peaks).

What can we say about this?

It seems point one is true. People may argue about whether the variants behind complex traits are many, relatively common, with tiny individual effects or many, relatively rare, and with larger effects that average out to tiny effects when measured in the whole population. In any case, there are many causative variants, alright.

Point two — now also known as the omnigenetic model — hinges on how you read ”in general”, I guess. In some sense, universal pleiotropy follows from genome crowding. If there are enough causative variants and a limited number of genes, eventually every gene will be associated with every trait.

I don’t think that point three is true. I would assume that many loss of function mutations to protein coding genes, for example, would be interchangeable.

I don’t really understand points six and seven, about homology and fitness landscapes, that well. The later section about homology reads to me as if it could be part of a debate going on at the time. Number seven describes Wright’s view of natural selection as a kind of fitness whack-a-mole, where if a genotype is fit in one dimension, it probably loses in some other. The hypothesis and the metaphor have been extremely influential — I think largely because many people thought that it was wrong in many different ways.

Points four and five are related and, I imagine, the most controversial of the list. Why does Wright say that there is universal epistasis? Because of physiological genetics. Or, in modern parlance, maybe because of gene networks and systems biology. On page 71, he puts it like this:

Interaction effects necessarily occur with respect to the ultimate products of chains of metabolic processes in which each step is controlled by a different locus. This carries with it the implication that interaction effects are universal in the more complex characters that trace such processes.

The argument seems to persists to this day, and I think it is true. On the other hand, there is the question how much this matters to the variants that actually segregate in a given population and affect a given trait.

This is often framed as a question of variance. It turns out that even with epistatic gene action, in many cases, most of the genetic variance is still additive (Mäki-Tanila & Hill 2014, Huang & Mackay 2016). But something similar must apply to the effects that you will see from a locus. They also depend on the allele frequencies at other loci. An interaction does nothing when one of the interaction partners are fixed. If they are nearly to fixed, it will do nearly nothing. If they’re all at intermediate frequency, things become more interesting.

Wright’s principle of universal interaction is also grounded in his empirical work. A lot of space in this book is devoted to results from pigmentation genetics in guinea pigs, which includes lots of dominance and interaction. It could be that Wright was too quick to generalise from guinea pig coat colours to other traits. It could be that working in a system consisting of inbred lines draws your attention to nonlinearities that are rare and marginal in the source populations. On the other hand, it’s in these systems we can get a good handle on the dominance and interaction that may be missed elsewhere.

Study of effects in combination indicates a complicated network of interacting processes with numerous pleiotropic effects. There is no reason to suppose that a similar analysis of any character as complicated as melanin pigmentation would reveal a simpler genetic system. The inadequacy of any evolutionary theory that treats genes as if they had constant effects, favourable or unfavourable, irrespective of the rest of the genome, seems clear. (p. 88)

I’m not that well versed in pigmentation genetics, but I hope that someone is working on this. In an era where we can identify the molecular basis of classical genetic variants, I hope that someone keeps track of all these A, C, P, Q etc, and to what extent they’ve been mapped.

Literature

Wright, Sewall. ”Genetics and the Evolution of Populations” Volume 1 (1968).

Mäki-Tanila, Asko, and William G. Hill. ”Influence of gene interaction on complex trait variation with multilocus models.” Genetics 198.1 (2014): 355-367.

Huang, Wen, and Trudy FC Mackay. ”The genetic architecture of quantitative traits cannot be inferred from variance component analysis.” PLoS genetics 12.11 (2016): e1006421.

20170705_183042.jpg

Yours truly outside the library on Thomas Bayes’ road, incredibly happy with having found the book.

Summer of data science 1: Genomic prediction machines #SoDS17

Genetics is a data science, right?

One of my Summer of data science learning points was to play with out of the box prediction tools. So let’s try out a few genomic prediction methods. The code is on GitHub, and the simulated data are on Figshare.

Genomic selection is the happy melding of quantitative and molecular genetics. It means using genetic markers en masse to predict traits and and make breeding decisions. It can give you better accuracy in choosing the right plants or animals to pair, and it can allow you to take shortcuts by DNA testing individuals instead of having to test them or their offspring for the trait. There are a bunch of statistical models that can be used for genomic prediction. Now, the choice of prediction algorithm is probably not the most important part of genomic selection, but bear with me.

First, we need some data. For this example, I used AlphaSim (Faux & al 2016), and the AlphaSim graphical user interface, to simulate a toy breeding population. We simulate 10 chromosomes á 100 cM, with 100 additively acting causal variants and 2000 genetic markers per chromosome. The initial genotypes come from neutral simulations. We run one generation of random mating, then three generations of selection on trait values. Each generation has 1000 individuals, with 25 males and 500 females breeding.

So we’re talking a small-ish population with a lot of relatedness and reproductive skew on the male side. We will use the two first two generations of selection (2000 individuals) to train, and try to predict the breeding values of the fourth generation (1000 individuals). Let’s use two of the typical mixed models used for genomic selection, and two tree methods.

We start by splitting the dataset and centring the genotypes by subtracting the mean of each column. Centring will not change predictions, but it may help with fitting the models (Strandén & Christensen 2011).

Let’s begin with the workhorse of genomic prediction: the linear mixed model where all marker coefficients are drawn from a normal distribution. This works out to be the same as GBLUP, the GCTA model, GREML, … a beloved child has many names. We can fit it with the R package BGLR. If we predict values for the held-out testing generation and compare with the real (simulated) values, it looks like this. The first panel shows a comparison with phenotypes, and the second with breeding values.

This gives us correlations of 0.49 between prediction and phenotype, and 0.77 between prediction and breeding value.

This is a plot of the Markov chain Monte Carlo we use to sample from the model. If a chain behaves well, it is supposed to have converged on the target distribution, and there is supposed to be low autocorrelation. Here is a trace plot of four chains for the marker variance (with the coda package). We try to be responsible Bayesian citizens and run the analysis multiple times, and with four chains we get very similar results from each of them, and a potential scale reduction factor of 1.01 (it should be close to 1 when it works). But the autocorrelation is high, so the chains do not explore the posterior distribution very efficiently.

BGLR can also fit a few of the ”Bayesian alphabet” variants of the mixed model. They put different priors on the distribution of marker coefficients allow for large effect variants. BayesB uses a mixture prior, where a lot of effects are assumed to be zero (Meuwissen, Hayes & Goddard 2001). The way we simulated the dataset is actually close to the BayesB model: a lot of variants have no effect. However, mixture models like BayesB are notoriously difficult to fit — and in this case, it clearly doesn’t work that well. The plots below show chains for two BayesB parameters, with potential scale reduction factors of 1.4 and 1.5. So, even if the model gives us the same accuracy as ridge regression (0.77), we can’t know if this reflects what BayesB could do.

On to the trees! Let’s try Random forest and Bayesian additive regression trees (BART). Regression trees make models as bifurcating trees. Something like the regression variant of: ”If the animal has a beak, check if it has a venomous spur. If it does, say that it’s a platypus. If it doesn’t, check whether it quacks like a duck …” The random forest makes a lot of trees on random subsets of the data, and combines the inferences from them. BART makes a sum of trees. Both a random forest (randomForest package) and a BART model on this dataset (fit with bartMachine package), gives a lower accuracy — 0.66 for random forest and 0.72 for BART. This is not so unexpected, because the strength of tree models seems to lie in capturing non-additive effects. And this dataset, by construction, has purely additive inheritance. Both BART and random forest have hyperparameters that one needs to set. I used package defaults for random forest, values that worked well for Waldmann (2016), but one probably should choose them by cross validation.

Finally, we can use classical quantitative genetics to estimate breeding values from the pedigree and relatives’ trait values. Fitting the so called animal model in two ways (pedigree package and MCMCglmm) give accuracies of 0.59 and 0.60.

So, in summary, we recover the common wisdom that the linear mixed model does the job well. It was more accurate than just pedigree, and a bit better than BART. Of course, the point of this post is not to make a fair comparison of methods. Also, the real magic of genomic selection, presumably, happens on every step of the way. How do you get to that neat individual-by-marker matrix in the first place, how do you deal with missing data and data from different sources, what and when do you measure, what do you do with the predictions … But you knew that already.