What is a locus, anyway?

”Locus” is one of those confusing genetics terms (its meaning, not just its pronunciation). We can probably all agree with a dictionary and with Wikipedia that it means a place in the genome, but a place of what and in what sense? We also use place-related word like ”site” and ”region” that one might think were synonymous, but don’t seem to be.

For an example, we can look at this relatively recent preprint (Chebib & Guillaume 2020) about a model of the causes of genetic correlation. They have pairs of linked loci that each affect one trait each (that’s the tight linkage condition), and also a set of loci that affect both traits (the pleiotropic condition), correlated Gaussian stabilising selection, and different levels of mutation, migration and recombination between the linked pairs. A mutation means adding a number to the effect of an allele.

This means that loci in this model can have a large number of alleles with quantitatively different effects. The alleles at a locus share a distribution of mutation effects, that can be either two-dimensional (with pleiotropy) or one-dimensional. They also share a recombination rate with all other loci, which is constant.

What kind of DNA sequences can have these properties? Single nucleotide sites are out of the question, as they can have four, or maybe five alleles if you count a deletion. Larger structural variants, such as inversions or allelic series of indels might work. A protein-coding gene taken as a unit could have a huge number of different alleles, but they would probably have different distributions of mutational effects in different sites, and (relatively small) differences in genetic distance to different sites.

It seems to me that we’re talking about an abstract group of potential alleles that have sufficiently similar effects and that are sufficiently closely linked. This is fine; I’m not saying this to criticise the model, but to explore how strange a locus really is.

They find that there is less genetic correlation with linkage than with pleiotropy, unless the mutation rate is high, which leads to a discussion about mutation rate. This reasoning about the mutation rate of a locus illustrates the issue:

A high rate of mutation (10−3) allows for multiple mutations in both loci in a tightly linked pair to accumulate and maintain levels of genetic covariance near to that of mutations in a single pleiotropic locus, but empirical estimations of mutation rates from varied species like bacteria and humans suggests that per-nucleotide mutation rates are in the order of 10−8 to 10−9 … If a polygenic locus consists of hundreds or thousands of nucleotides, as in the case of many quantitative trait loci (QTLs), then per-locus mutation rates may be as high as 10−5, but the larger the locus the higher the chance of recombination between within-locus variants that are contributing to genetic correlation. This leads us to believe that with empirically estimated levels of mutation and recombination, strong genetic correlation between traits are more likely to be maintained if there is an underlying pleiotropic architecture affecting them than will be maintained due to tight linkage.

I don’t know if it’s me or the authors who are conceptually confused here. If they are referring to QTL mapping, it is true that the quantitative trait loci that we detect in mapping studies often are huge. ”Thousands of nucleotides” is being generous to mapping studies: in many cases, we’re talking millions of them. But the size of a QTL region from a mapping experiment doesn’t tell us how many nucleotides in it that matter to the trait. It reflects our poor resolution in delineating the, one or more, causative variants that give rise to the association signal. That being said, it might be possible to use tricks like saturation mutagenesis to figure out which mutations within a relevant region that could affect a trait. Then, we could actually observe a locus in the above sense.

Another recent theoretical preprint (Chantepie & Chevin 2020) phrases it like this:

[N]ote that the nature of loci is not explicit in this model, but in any case these do not represent single nucleotides or even genes. Rather, they represent large stretches of effectively non-recombining portions of the genome, which may influence the traits by mutation. Since free recombination is also assumed across these loci (consistent with most previous studies), the latter can even be thought of as small chromosomes, for which mutation rates of the order to 10−2 seem reasonable.

Literature

Chebib and Guillaume. ”Pleiotropy or linkage? Their relative contributions to the genetic correlation of quantitative traits and detection by multi-trait GWA studies.” bioRxiv (2019): 656413.

Chantepie and Chevin. ”How does the strength of selection influence genetic correlations?” bioRxiv (2020).

Journal club of one: ”Eliciting priors and relaxing the single causal variant assumption in colocalisation analyses”

This paper (Wallace 2020) is about improvements to the colocalisation method for genome-wide association studies called coloc. If you have an association to trait 1 in a region, and another association with trait 2, coloc investigates whether they are caused by the same variant or not. I’ve never used coloc, but I’m interested because setting reasonable priors is related to getting reasonable parameters for genetic architecture.

The paper also looks at how coloc is used in the literature (with default settings, unsurprisingly), and extends coloc to relax the assumption of only one causal variant per region. In that way, it’s a solid example of thoughtfully updating a popular method.

(A note about style: This isn’t the clearest paper, for a few reasons. The structure of the introduction is indirect, talking a lot about Mendelian randomisation before concluding that coloc isn’t Mendelian randomisation. The paper also uses numbered hypotheses H1-H4 instead of spelling out what they mean … If you feel a little stupid reading it, it’s not just you.)

coloc is what we old QTL mappers call a pleiotropy versus linkage test. It tries to distinguish five scenarios: no association, trait 1 only, trait 2 only, both traits with linked variants, both traits with the same variant.

This paper deals with the priors: What is the prior probability of a causal association to trait 1 only p_1, trait 2 only p_2, or both traits p_{12} , and are the defaults good?

They reparametrise the priors so that it becomes possible to get some estimates from the literature. They work with the probability that a SNP is causally associated with each trait (which means adding the probabilities of association q_1 = p_1 + p_{12} ) … This means that you can look at single trait association data, and get an idea of the number of marginal associations, possibly dependent on allele frequency. The estimates from a gene expression dataset and a genome-wide association catalog work out to a prior around 10 ^ {-4} , which is the coloc default. So far so good.

How about p_{12} ?

If traits were independent, you could just multiply q_1 and q_2. But not all of the genome is functional. If you could straightforwardly define a functional proportion, you could just divide by it.

You could also look at the genetic correlation between traits. It makes sense that the overall genetic relationship between two traits should inform the prior that you see overlap at this particular locus. This gives a lower limit for p_{12} . Unfortunately, this still leaves us dependent on what kinds of traits we’re analysing. Perhaps, it’s not so surprising that there isn’t one prior that universally works for all kinds of pairs of trait:

Attempts to colocalise disease and eQTL signals have ranged from underwhelming to positive. One key difference between outcomes is the disease-specific relevance of the cell types considered, which is consistent with variable chromatin state enrichment in different GWAS according to cell type. For example, studies considering the overlap of open chromatin and GWAS signals have convincingly shown that tissue relevance varies by up to 10 fold, with pancreatic islets of greatest relevance for traits like insulin sensitivity and immune cells for immune-mediated diseases. This suggests that p_{12} should depend explicitly on the specific pair of traits under consideration, including cell type in the case of eQTL or chromatin mark studies. One avenue for future exploration is whether fold change in enrichment of open chromatin/GWAS signal overlap between cell types could be used to modulate p_{12} and select larger values for more a priori relevant tissues.

Literature

Wallace, Chris. ”Eliciting priors and relaxing the single causal variant assumption in colocalisation analyses.” PLoS Genetics 16.4 (2020): e1008720.

Temple Grandin at Roslin: optimisation and overselection

A couple of months ago (16 May to be precise), I listened to a talk by Temple Grandin at the Roslin Institute.

Grandin is a captivating speaker, and as an animal scientist (of some kind), I’m happy to have heard her talk at least once. The lecture contained a mix of:

  • practical experiences from a career of troubleshooting livestock management and systems,
  • how thinking differently (visually) helps in working with animal behaviour,
  • terrific personal anecdotes, among other things about starting up her business as a livestock management consultant from a student room,
  • a recurring theme, throughout the talk, of unintended side-effects in animal breeding, framed as a risk of ”overselecting” for any one trait, uncertainty about ”what is optimal”, and the importance of measuring and soberly evaluating many different things about animals and systems.

This latter point interests me, because it concerns genetics and animal breeding. Judging by the question in the Q&A, it also especially interested rest of the audience, mostly composed of vet students.

Grandin repeatedly cautioned against ”overselecting”. She argued that if you take one trait, any trait, and apply strong directional selection, bad side-effects will emerge. As a loosely worded biological principle, and taken to extremes, this seems likely to be true. If we assume that traits are polygenic, that means both that variants are likely to be pleiotropic (because there are many causal variants and a limited number of genes; this one argument for the omnigenic model) and that variants are likely to be linked to other variants that affect other traits. So changing one trait a lot is likely to affect other traits. And if we assume that the animal was in a pretty well-functioning state before selection, we should expect that if some trait that we’re not consciously selecting on changes far enough from that state, that is likely to cause problems.

We can also safely assume that there are always more traits that we care about than we can actually measure, either because they haven’t become a problem yet, or because we don’t have a good way to measure them. Taken together, this sound like a case for being cautious, measuring a lot of things about animal performance and welfare, and continuously re-evaluating what one is doing. Grandin emphasised the importance of measurement, drumming in that: ”you will manage what you measure”, ”this happens gradually”, and therefore, there is a risk that ”the bad becomes the new normal” if one does not keep tabs on the situation by recording hard quantitative data.

Doesn’t this sound a lot like the conventional view of mainstream animal breeding? I guess it depends: breeding is a big field, covering a lot of experiences and views, from individual farmers’ decisions, through private and public breeding organisations, to the relative Castalia of academic research. However, the impression from my view of the field, is Grandin and mainstream animal breeders are in agreement about the importance of:

  1. recording lots of traits about all aspects of the performance and functioning of the animal,
  2. optimising them with good performance on the farm as the goal,
  3. constantly re-evaluating practice and updating the breeding goals and management to keep everything on track.

To me, what Grandin presented as if it was a radical message (and maybe it was, some time ago, or maybe it still is, in some places) sounded much like singing the praises of economic selection indices. I had expected something more controversial. Then again, that depends on what assumptions are built into words like ”good performance”, ”on track”, ”functioning of the animal” etc. For example, she talked a bit about the strand of animal welfare research that aims to quantify positive emotions in animals; one could take the radical stance that we should measure positive emotions and include that in the breeding goal.

”Overselection” as a term also carries connotations that I don’t agree with, because I don’t think that the framing as biological overload is helpful. To talk about overload and ”overselection” makes one think of selection as a force that strains the animal in itself, and the alternative as ”backing off” (an expression term Grandin repeatedly used in the talk). But if the breeding goal is off the mark, in the sense that it doesn’t get towards what’s actually optimal for the animal on the farm, breeding less efficiently is not getting you to a better outcome; it only gets towards the same, suboptimal, outcome more slowly. The problem isn’t efficiency in itself, but misspecification, and uncertainty about what the goal should be.

Grandin expands on this idea in the introductory chapter to ”Are we pushing animals to their biological limits? Welfare and ethical applications” (Grandin & Whiting 2018, eds). (I don’t know much about the pig case used as illustration, but I can think of a couple of other examples that illustrate the same point.) It ends with this great metaphor about genomic power tools, that I will borrow for later:

We must be careful not to repeat the mistakes that were made with conventional breeding where bad traits were linked with desirable traits. One of the best ways to prevent this is for both animal and plant breeders to do what I did in the 1980s and 1990s: I observed many different pigs from many places and the behaviour problems became obvious. This enabled me to compare animals from different lines in the same environment. Today, both animal and plant breeders have ‘genomic power tools’ for changing an organism’s genetics. Power tools are good things, but they must be used carefully because changes gan be made more quickly. A circular saw can chop your hand off much more easily than a hand saw. It has to be used with more care.

Different worlds

Some time ago, I gave a seminar about some work involving chicken combs, and I made some offhand remark about how I don’t think that the larger combs of modern layer chickens are the result of direct selection. I think it is more likely to be be a correlated response to selection on reproductive traits. During question time, someone disagreed, proposing that ornamental traits should be very likely to have been under artificial selection.

I choose this example partly because the stakes are so low. I may very well be wrong, but it doesn’t matter for the work at hand. Clearly, I should be more careful to acknowledge all plausible possibilities, and not speculate so much for no reason. But I think this kind of thing is an example of something quite common.

That is: researchers, even those who fit snugly into the same rather narrow sub-field, may make quite different default assumptions about the world. I suspect, for instance, that we were both, in the absence of hard evidence, trying to be conservative in falling back on the most parsimonious explanation. I know that I think of a trait being under direct selection as a strong claim, and ”it may just be hitch-hiking on something else” as a conservative attitude. But on the other hand, one could think of direct artificial selection as a simpler explanation as opposed to a scenario that demands pleiotropy.

I can see a point to both attitudes, and in different contexts, I’d probably think of either direct selection and pleiotropy as the more far-fetched. For example, I am hard pressed to believe that reductions in fearfulness and changes in pigmentation of domestic animals are generally explained by pleiotropic variants.

This is why I think that arguments about Occam’s razor, burdens of proof, and what the appropriate ”null” hypothesis for a certain field is supposed to be, while they may be appealing (especially so when they support your position), are fundamentally unhelpful. And this is why I think incommensurability is not that outlandish a notion. Sometimes, researchers might as well be living in different worlds.

”These are all fairly obvious” (says Sewall Wright)

I was checking a quote from Sewall Wright, and it turned out that the whole passage was delightful. Here it is, from volume 1 of Genetics and the Evolution of Populations (pages 59-60):

There are a number of broad generalizations that follow from this netlike relationship between genome and complex characters. These are all fairly obvious but it may be well to state them explicitly.

1) The variations of most characters are affected by a great many loci (the multiple factor hypothesis).

2) In general, each gene replacement has effects on many characters (the principle of universal pleiotropy).

3) Each of the innumerable possible alleles at any locus has a unique array of differential effects on taking account of pleiotropy (uniqueness of alleles).

4) The dominance relation of two alleles is not an attribute of them but of the whole genome and of the environment. Dominance may differ for each pleiotropic effect and is in general easily modifiable (relativity of dominance).

5) The effects of multiple loci on a character in general involve much nonadditive interaction (universality of interaction effects).

6) Both ontogenetic and phylogenetic homology depend on calling into play similar chains of gene-controlled reactions under similar developmental conditions (homology).

7) The contributions of measurable characters to overall selective value usually involve interaction effects of the most extreme sort because of the usually intermediate position of the optimum grade, a situation that implies the existence of innumerable different selective peaks (multiple selective peaks).

What can we say about this?

It seems point one is true. People may argue about whether the variants behind complex traits are many, relatively common, with tiny individual effects or many, relatively rare, and with larger effects that average out to tiny effects when measured in the whole population. In any case, there are many causative variants, alright.

Point two — now also known as the omnigenetic model — hinges on how you read ”in general”, I guess. In some sense, universal pleiotropy follows from genome crowding. If there are enough causative variants and a limited number of genes, eventually every gene will be associated with every trait.

I don’t think that point three is true. I would assume that many loss of function mutations to protein coding genes, for example, would be interchangeable.

I don’t really understand points six and seven, about homology and fitness landscapes, that well. The later section about homology reads to me as if it could be part of a debate going on at the time. Number seven describes Wright’s view of natural selection as a kind of fitness whack-a-mole, where if a genotype is fit in one dimension, it probably loses in some other. The hypothesis and the metaphor have been extremely influential — I think largely because many people thought that it was wrong in many different ways.

Points four and five are related and, I imagine, the most controversial of the list. Why does Wright say that there is universal epistasis? Because of physiological genetics. Or, in modern parlance, maybe because of gene networks and systems biology. On page 71, he puts it like this:

Interaction effects necessarily occur with respect to the ultimate products of chains of metabolic processes in which each step is controlled by a different locus. This carries with it the implication that interaction effects are universal in the more complex characters that trace such processes.

The argument seems to persists to this day, and I think it is true. On the other hand, there is the question how much this matters to the variants that actually segregate in a given population and affect a given trait.

This is often framed as a question of variance. It turns out that even with epistatic gene action, in many cases, most of the genetic variance is still additive (Mäki-Tanila & Hill 2014, Huang & Mackay 2016). But something similar must apply to the effects that you will see from a locus. They also depend on the allele frequencies at other loci. An interaction does nothing when one of the interaction partners are fixed. If they are nearly to fixed, it will do nearly nothing. If they’re all at intermediate frequency, things become more interesting.

Wright’s principle of universal interaction is also grounded in his empirical work. A lot of space in this book is devoted to results from pigmentation genetics in guinea pigs, which includes lots of dominance and interaction. It could be that Wright was too quick to generalise from guinea pig coat colours to other traits. It could be that working in a system consisting of inbred lines draws your attention to nonlinearities that are rare and marginal in the source populations. On the other hand, it’s in these systems we can get a good handle on the dominance and interaction that may be missed elsewhere.

Study of effects in combination indicates a complicated network of interacting processes with numerous pleiotropic effects. There is no reason to suppose that a similar analysis of any character as complicated as melanin pigmentation would reveal a simpler genetic system. The inadequacy of any evolutionary theory that treats genes as if they had constant effects, favourable or unfavourable, irrespective of the rest of the genome, seems clear. (p. 88)

I’m not that well versed in pigmentation genetics, but I hope that someone is working on this. In an era where we can identify the molecular basis of classical genetic variants, I hope that someone keeps track of all these A, C, P, Q etc, and to what extent they’ve been mapped.

Literature

Wright, Sewall. ”Genetics and the Evolution of Populations” Volume 1 (1968).

Mäki-Tanila, Asko, and William G. Hill. ”Influence of gene interaction on complex trait variation with multilocus models.” Genetics 198.1 (2014): 355-367.

Huang, Wen, and Trudy FC Mackay. ”The genetic architecture of quantitative traits cannot be inferred from variance component analysis.” PLoS genetics 12.11 (2016): e1006421.

20170705_183042.jpg

Yours truly outside the library on Thomas Bayes’ road, incredibly happy with having found the book.

Morning coffee: null in genetics

coffee

Andrew Gelman sometimes writes that in genetics it might make sense to have a null hypothesis of zero effect, but in social science nothing is ever exactly zero (and interactions abound). I wonder whether that is actually true even for genetics. Think about pleiotropy. Be it universal or modular, I think the evidence still points in the direction that we should expect any genetic variant to affect lots of traits, albeit with often very small effects. And think of gene expression where genes always show lots of correlation structure: do we expect transcripts from the same cells to ever be independent of each other? It doesn’t seem to me that the null can be strictly true here. Most of these differences have to be too small for us to practically be able to model them, though — and maybe the small effects are so far below the detection limit that we can pretend that they could be zero. (Note: not trying to criticise anybody’s statistical method or view of effect sizes here, just thinking aloud about the ”no true null effect” argument.)

Morning coffee: pleiotropy

kaffekatt

In the simplest terms pleiotropy means genetic side-effects: a pleiotropic gene is a gene that does several things and a pleiotropic variant is a variant that makes its carrier different from carriers of other variants in more than one trait. It’s just that the words ‘gene’ , ‘trait’ and ‘different’ are somewhat ambiguous. Paaby & Rockman (2013) have written a nice analytical review about the meaning of pleiotropy. In their terminology, molecular gene pleiotropy is when the product of a gene is involved in more than one biological process. Developmental pleiotropy, on the other hand, deals with genetic variants: a variant is developmentally pleiotropic if it affects more than one trait. This is the sense of the word I’d normally think of. Third, selectional pleiotropy is deals with variants that affect several aspects of fitness, possibly differently for different individuals.

Imagine that we have found a variant associated with two variables. Have we got a pleiotropic variant on our hands? If the variables are just different measures of the same thing, clearly we’re dealing with one trait. But imagine that the variables are actually driven by largely different factors. They might respond to different environmental stimuli and have mostly separate genetic architectures. If so, we have two different traits and a pleiotropic variant affecting both. My point is that it depends on the actual functional relationship between the traits. Without knowing something about how the organism works we can’t count traits. With that in mind, it seems very bold to say things about variants in general and traits in general. Paaby & Rockman’s conclusion seems to be that genetic mapping is not the way to go, because of low power to detect variants of small effect, and instead they bring up alternative statistical and quantitative genetics methods to demonstrate pleiotropy on a large scale. I agree that these results reinforce that pleiotropy must be important, in some sense of the word. But I think the opposite approach still has value: the way to figure out how important pleiotropy is for any given suite of traits is to study them mechanistically.

(Zombie kitty by Anna Nygren.)