Kauai field trip 2018

Let’s keep the tradition of delayed travel posts going!

In August last year, I joined Dom Wright, Rie Henriksen, and Robin Abbey-Lee, as part of Dom’s FERALGEN project, on their field work on Kauai. I did some of my dissertation work on the Kauai feral chickens, but I never saw them live until now. Our collaborator Eben Gering was also on the islands, but the closest we got to each other was Skyping between the islands. It all went smoothly until the end of the trip, when a hurricane came uncomfortably close to the island for a while. Here are some pictures. In time, I promise to blog about the actual research too.

Look! Chickens by the sea, chickens on parking lots, a sign telling people not to feed the chickens on a sidewalk in central Kapaa! Lots of chickens.

I’m not kidding: lots of chickens.

Links

An old Nature News feature from a previous field trip (without me)

My post about our 2016 paper on Kauai feralisation genomics

‘Any distinction in principle between qualitative and quantitative characters disappeared long ago’

Any distinction in principle between qualitative and quantitative characters disappeared long ago, although in the early days of Mendelism it was often conjectured that they might be inherited according to fundamentally different laws.

If it is still convenient to call some characters qualitative and others quantitative, it is only to denote that the former naturally have a discontinuous and the latter a continuous distribution, or that the former are not easily measured on a familiar metrical scale. Colors are an example. Differences between colors can be measured in terms of length of light waves, hue, brilliance etc., but most of us find it difficult to compare those measurements with our own visual impressions.

Most quantitative characters are affected by many pairs of genes and also importantly by environmental variations. It is rarely possible to identify the pertinent genes in a Mendelian way or to map the chromosomal position of any of them. Fortunately this inability to identify and describe the genes individually is almost no handicap to the breeder of economic plants or animals. What he would actually do if he knew the details about all the genes which affect a quantitative character in that population differs little from what he will do if he merely knows how heritable it is and whether much of the hereditary variance comes from dominance or overdominance, and from epistatic interactions between the genes.

(That last part might not always be true anymore, but it still remained on point for more than half the time that genetics as a discipline has existed.)

Jay L Lush (1949) Heritability of quantitative characters in farm animals

Using R: plotting the genome on a line

Imagine you want to make a Manhattan-style plot or anything else where you want a series of intervals laid out on one axis after one another. If it’s actually a Manhattan plot you may have a friendly R package that does it for you, but here is how to cobble the plot together ourselves with ggplot2.

We start by making some fake data. Here, we have three contigs (this could be your chromosomes, your genomic intervals or whatever) divided into one, two and three windows, respectively. Each window has a value that we’ll put on the y-axis.

library(dplyr)
library(ggplot2)

data <- data_frame(contig = c("a", "a", "a", "b", "b", "c"),
                   start = c(0, 500, 1000, 0, 500, 0),
                   end = c(500, 1000, 1500, 500, 1000, 200),
                   value = c(0.5, 0.2, 0.4, 0.5, 0.3, 0.1))

We will need to know how long each contig is. In this case, if we assume that the windows cover the whole thing, we can get this from the data. If not, say if the windows don’t go up to the end of the chromosome, we will have to get this data from elsewhere (often some genome assembly metadata). This is also where we can decide in what order we want the contigs.

contig_lengths <- summarise(group_by(data, contig), length = max(end))

Now, we need to transform the coordinates on each contig to coordinates on our new axis, where we lay the contings after one another. What we need to do is to add an offset to each point, where the offset is the sum of the lengths of the contigs we’ve layed down before this one. We make a function that takes three arguments: two vectors containing the contig of each point and the position of each point, and also the table of lengths we just made.

flatten_coordinates <- function(contig, coord, contig_lengths) {
    coord_flat <- coord
    offset <- 0

    for (contig_ix in 1:nrow(contig_lengths)) {
        on_contig <- contig == contig_lengths$contig[contig_ix]
        coord_flat[on_contig] <- coord[on_contig] + offset
        offset <- offset + contig_lengths$length[contig_ix]
    }

    coord_flat
}

Now, we use this to transform the start and end of each window. We also transform the vector of the length of the contigs, so we can use it to add vertical lines between the contigs.

data$start_flat <- flatten_coordinates(data$contig,
                                       data$start,
                                       contig_lengths)
data$end_flat <- flatten_coordinates(data$contig,
                                     data$end,
                                     contig_lengths)
contig_lengths$length_flat <- flatten_coordinates(contig_lengths$contig,
                                                  contig_lengths$length,
                                                  contig_lengths)

It would be nice to label the x-axis with contig names. One way to do this is to take the coordinates we just made for the vertical lines, add a zero, and shift them one position, like so:

axis_coord <- c(0, contig_lengths$length_flat[-nrow(contig_lengths)])

Now it’s time to plot! We add one layer of points for the values on the y-axis, where each point is centered on the middle of the window, followed by a layer of vertical lines at the borders between contigs. Finally, we add our custom x-axis, and also some window dressing.

plot_genome <- ggplot() +
    geom_point(aes(x = (start_flat + end_flat)/2,
                   y = value),
               data = data) +
    geom_vline(aes(xintercept = length_flat),
               data = contig_lengths) +
    scale_x_continuous(breaks = axis_coord,
                       labels = contig_lengths$contig,
                       limits = c(0, max(contig_lengths$length_flat))) +
    xlab("Contig") + ylim(0, 1) + theme_bw()

And this is what we get:

I’m sure your plot will look more impressive, but you get the idea.

Neutral citation again

Here is a piece of advice about citation:

Rule 4: Cite transparently, not neutrally

Citing, even in accordance with content, requires context. This is especially important when it happens as part of the article’s argument. Not all citations are a part of an article’s argument. Citations to data, resources, materials, and established methods require less, if any, context. As part of the argument, however, the mere inclusion of a citation, even when in the right spot, does not convey the value of the reference and, accordingly, the rationale for including it. In a recent editorial, the Nature Genetics editors argued against so-called neutral citation. This citation practice, they argue, appears neutral or procedural yet lacks required displays of context of the cited source or rationale for including [11]. Rather, citations should mention assessments of value, worth, relevance, or significance in the context of whether findings support or oppose reported data or conclusions.

This flows from the realisation that citations are political, even though that term is rarely used in this context. Researchers can use them to accurately represent, inflate, or deflate contributions, based on (1) whether they are included and (2) whether their contributions are qualified. Context or rationale can be qualified by using the right verbs. The contribution of a specific reference can be inflated or deflated through the absence of or use of the wrong qualifying term (‘the authors suggest’ versus ‘the authors establish’; ‘this excellent study shows’ versus ‘this pilot study shows’). If intentional, it is a form of deception, rewriting the content of scientific canon. If unintentional, it is the result of sloppy writing. Ask yourself why you are citing prior work and which value you are attributing to it, and whether the answers to these questions are accessible to your readers.

When Nature Genetics had an editorial condemning neutral citation, I took it to be a demand that authors show that they’ve read and thought about the papers they cite.

This piece of advice seems to ask something different: that authors be honest about their opinions about a work they cite. That is a radical suggestion, because if people were, I believe readers would get offended. That is, if the paper wasn’t held back by offended peer reviewers before it reached any readers. Honestly, as a reviewer, I would probably complain if I saw a value-laden and vacuous statement like ‘this excellent study’ in front of a citation. It would seem to me an rude attempt to tell the reader what to think.

So how are we to cite a study? On the one hand, we can’t just drop the citation in a sentence, but are obliged to ‘mention assessments of value, worth, relevance or significance’. On the other hand, we must make sure that they are ‘qualified by using the right verbs’. And if citation is political, then whether a study ‘suggests’ or ‘establishes’ conclusions is also political.

Disclaimer: I don’t like the 10 simple rules format at all. I find that they belong on someone’s personal blog and not in a scientific journal, given that their evidence for their assertions usually amounts to nothing more than my own meandering experience … This one is an exception, because Bart Penders does research on how scientists collaborate and communicate (even if he cites no research in this particular part of the text).

Penders B (2018) Ten simple rules for responsible referencing. PLoS Computional Biology

Journal club of one: ‘Biological relevance of computationally predicted pathogenicity of noncoding variants’

Wouldn’t it be great if we had a way to tell genetic variants that do something to gene function and regulation from those that don’t? This is a Really Hard Problem, especially for variants that fall outside of protein-coding regions, and thus may or may not do something to gene regulation.

There is a host of bioinformatic methods to tackle the problem, and they use different combinations of evolutionary analysis (looking at how often the position of the variant differs between or within species) and functional genomics (what histone modifications, chromatin accessibility etc are like at the location of the variant) and statistics (comparing known functional variants to other variants).

When a new method is published, it’s always accompanied by a receiver operating curve showing it predicting held-out data well, and some combination of comparisons to other methods and analyses of other datasets of known or presumed functional variants. However, one wonders how these methods will do when we use them to evaluate unknown variants in the lab, or eventually in the clinic.

This is what this paper, Liu et al (2019) Biological relevance of computationally predicted pathogenicity of noncoding variants is trying to do. They construct three test cases that are supposed to be more realistic (pessimistic) test beds for six noncoding variant effect predictors.

The tasks are:

  1. Find out which allele of a variant is the deleterious one. The presumed deleterious test alleles here are ones that don’t occur in any species of a large multiple genome alignment.
  2. Find a causative variant among a set of linked variants. The test alleles are causative variants from the Human Gene Mutation Database and some variants close to them.
  3. Enrich for causative variants among increasingly bigger sets of non-functional variants.

In summary, the methods don’t do too well. The authors think that they have ‘underwhelming performance’. That isn’t happy news, but I don’t think it’s such a surprise. Noncoding variant prediction is universally acknowledged to be tricky. In particular, looking at Task 3, the predictors are bound to look much less impressive in the face of class imbalance than in those receiver operating curves. Then again, class imbalance is going to be a fact when we go out to apply these methods to our long lists of candidate variants.

Task 1 isn’t that well suited to the tools, and the way it’s presented is a bit silly. After describing how they compiled their evolution-based test variant set, the authors write:

Our expectation was that a pathogenic allele would receive a significantly higher impact score (as defined for each of the six tested methods) than a non-pathogenic allele at the same position. Instead, we found that these methods were unsuccessful at this task. In fact, four of them (LINSIGHT, EIGEN, GWAVA, and CATO) reported identical scores for all alternative alleles at every position as they were not designed for allelic contrasts …

Sure, it’s hard to solve this problem with a program that only produces one score per site, but you knew that when you started writing this paragraph, didn’t you?

The whole paper is useful, but to me, the most interesting insight is that variants close to each other tend to have correlated features, meaning that there is little power to tell them apart (Task 2). This might be obvious if you think about it (e.g., if two variants fall in the same enhancer, how different can their chromatin state and histone modifications really be?), but I guess I haven’t thought that hard about it before. This high correlation is unfortunate, because that means that methods for finding causative variants (association and variant effect prediction) have poor spatial resolution. We might need something else to solve the fine mapping problem.

Figure 4 from Liu et al., showing correlation between features of linked variants.

Finally, shout-out to Reviewer 1 whose comment gave rise to these sentences:

An alternative approach is to develop a composite score that may improve upon individual methods. We examined one such method, namely PRVCS, which unfortunately had poor performance (Supplementary Figure 11).

I thought this read like something prompted by an eager beaver reviewer, and thanks to Nature Communications open review policy, we can confirm my suspicions. So don’t say that open review is useless.

Comment R1.d. Line 85: It would be interesting to see if a combination of the examined scores would better distinguish between pathogenic and non-pathogenic non-coding regions. Although we suspect there to be high correlation between features this will test the hypothesis that each score may not be sufficient on its own to make any distinction between pathogenic and non-pathogenic ncSNVs. However, a combined model might provide more discriminating power than individual scores, suggesting that each score captures part of the underlying information with regards to a region’s pathogenicity propensity.

Literature

Liu, L., Sanderford, M. D., Patel, R., Chandrashekar, P., Gibson, G., & Kumar, S. (2019). Biological relevance of computationally predicted pathogenicity of noncoding variants. Nature Communications, 10(1), 330.

Journal club of one: ‘The heritability fallacy’

Public debate about genetics often seems to centre on heritability and on psychiatric and mental traits, maybe because we really care about our minds, and because for a long time heritability was all human geneticists studying quantitative traits could estimate. Here is an anti-heritabililty paper that I think articulates many of the common grievances: Moore & Shenk (2016) The heritability fallacy. The abstract gives a snappy summary of the argument:

The term ‘heritability,’ as it is used today in human behavioral genetics, is one of the most misleading in the history of science. Contrary to popular belief, the measurable heritability of a trait does not tell us how ‘genetically inheritable’ that trait is. Further, it does not inform us about what causes a trait, the relative influence of genes in the development of a trait, or the relative influence of the environment in the development of a trait. Because we already know that genetic factors have significant influence on the development of all human traits, measures of heritability are of little value, except in very rare cases. We, therefore, suggest that continued use of the term does enormous damage to the public understanding of how human beings develop their individual traits and identities.

At first glance, this paper should be a paper for me. I tend to agree that heritability estimates of human traits aren’t very useful. I also agree that geneticists need to care about the interpretations of their claims beyond the purely scientific domain. But the more I read, the less excited I became. The paper is a list of complaints about heritability coefficients. Some are more or less convincing. For example, I find it hard to worry too much about the ‘equal environments assumption’ in twin studies. But sure, it’s hard to identify variance components, and in practice, researchers sometimes restort to designs that are a lot iffier than twin studies.

But I think the main thrust of the paper is this huge overstatement:

Most important of all is a deep flaw in an assumption that many people make about biology: That genetic influences on trait development can be separated from their environmental context. However, contemporary biology has demonstrated beyond any doubt that traits are produced by interactions between genetic and nongenetic factors that occur in each moment of developmental time … That is to say, there are simply no such things as gene-only influences.

There certainly is such a thing as additive genetic variance as well as additive gene action. This passage only makes sense to me if ‘interaction’ is interpreted not as a statistical term but as describing a causal interplay. If so, it is perfectly true that all traits are the outcomes of interplay between genes and environment. It doesn’t follow that genetic variants in populations will interact with variable environments to the degree that quantitative genetic models are ‘nonsensical in most circumstances’.

They illustrate with this parable: Billy and Suzy are filling a bucket. Suzy is holding the hose and Billy turns on the tap. How much of the water is due to Billy and how much is due to Suzy? The answer is supposed to be that the question makes no sense, because they are both filling the bucket through a causal interplay. Well. If they’re filling a dozen buckets, and halfway through, Billy opens the tap half a turn more, and Suzy starts moving faster between buckets, because she’s tired of this and wants lunch … The correct level of analysis for the quantitative bucketist isn’t Billy, Suzy and the hose. It is the half-turn of the tap and Suzy’s moving of the nozzle.

The point is that quantitative genetic models describe variation between individuals. The authors know this, of course, but they write as if genetic analysis of variance is some kind of sleight of hand, as if quantitative genetics ought to be about development, and the fact that it isn’t is a deliberate obfuscation. Here is how they describe Jay Lush’s understanding of heritability:

The intention was ‘to quantify the level of predictability of passage of a biologically interesting phenotype from parent to offspring’. In this way, the new technical use of ‘heritability’ accurately reflected that period’s understanding of genetic determinism. Still, it was a curious appropriation of the term, because—even by the admission of its proponents—it was meant only to represent how variation in DNA relates to variation in traits across a population, not to be a measure of the actual influence of genes on the development of any given trait.

I have no idea what position Lush took on genetic determinism. But we can find the context of heritability by looking at the very page before in Animal breeding plans. The definition of the heritability coefficient occurs on page 87. This is how Lush starts the chapter on page 86:

In the strictest sense of the word, the question of whether a characteristic is hereditary or environmental has no meaning. Every characteristic is both hereditary and environmental, since it is the end result of a long chain of interactions of the genes with each other, with the environment and with the intermediate products at each stage of development. The genes cannot develop the characteristic unless they have the proper environment, and no amount of attention to the environment will cause the characteristc to develop unless the necessary genes are present. If either the genes or the environment are changed, the characteristic which results from their interactions may be changed.

I don’t know — maybe the way quantitative genetics has been used in human behavioural and psychiatric genetics invites genetic determinism. Or maybe genetic determinism is one of those false common-sense views that are really hard to unlearn. In any case, I don’t think it’s reasonable to put the blame on the concept of heritability for not being some general ‘measure of the biological inheritability of complex traits’ — something that it was never intended to be, and cannot possibly be.

My guess is that new debates will be about polygenic scores and genomic prediction. I hope that will be more useful.

Literature

David S. Moore & David Shenk (2016) The heritability fallacy

Jay Lush Animal breeding plans. Online at: https://archive.org/details/animalbreedingpl032391mbp/page/n99