Various positions II

Again, what good is a blog if you can’t post your arbitrary idiosyncratic opinions as if you were an authority?

Don’t make a conference app

I get it, you can’t print a full-blown paper program book: it is too much, no one reads it, and it feels wasteful. But please, please, for the love of everything holy, don’t make an app. Put the text, straight up, on a website in plaintext. It loads quickly, it’s searchable, it can be automatically generated. The conference app will be cloddy, take up space on the phone, eat bandwidth on some strained mobile contract, and invariably freeze.

Posters, still bad in 2020

Don’t believe the lies: a once folded canvas poster will never look good again. You haven’t had fun on a conference before you’ve tried ironing a poster on a hostel floor with an iron that belongs in a museum.

Poster sessions are bad by necessity. If they had had space and time to be anything other than a crowded mess, the conference would have to accept substantially fewer posters. That means fewer participants, probably especially earlier career participants, and the value of having them outweighs the value of a somewhat better poster session.

Gene accession numbers

PLOS Genetics has a great policy in their submission guidelines that doesn’t seem to get followed very much in papers they actually publish. This should be the norm in every genetics paper. I feel bad that it’s not the case in all my papers.

As much as possible, please provide accession numbers or identifiers for all entities such as genes, proteins, mutants, diseases, etc., for which there is an entry in a public database, for example:

Ensembl
Entrez Gene
FlyBase
InterPro
Mouse Genome Database (MGD)
Online Mendelian Inheritance in Man (OMIM)
PubChem

Identifiers should be provided in parentheses after the entity on first use.

In the future, with the right ontologies and repositories in place, I hope this will be the case with traits, methods and so on as well.

UK Biobank and dbGAP are not open data

And that is fine.

Stop it with the work-life balance tweets

No-one should tweet about work-life balance; whether you write about how much you work or how diligent you are about your hours, it comes off as bragging.

Tenses

Write your papers in the past or present tense, whichever you prefer. In the context of a scientific paper, the difference between past and present communicates nothing. I suppose you’re not supposed to mix tenses, but that doesn’t matter either. Most readers probably won’t notice. If you ask me about my stylistic opinion: present tense for everything. But again, it doesn’t matter.

Journal club of one: ”Chromosome-level and haplotype-resolved genome assembly enabled by high-throughput single-cell sequencing of gamete genomes”

Genome assembly researchers are still figuring out new wild ways of combining different kinds of data. For example, ”trio binning” took what used to be a problem, — the genetic difference between the two genome copies that a diploid individual carries — and turned it into a feature: if you assemble a hybrid individual with genetically distant parents, you can separate the two copies and get two genomes in one. (I said that admixture was the future of every part of genetics, didn’t I?) This paper (Campoy et al. 2020) describes ”gamete binning” which uses sequencing of gametes perform a similar trick.

Expressed another way, gamete binning means building an individual-specific genetic map and then using it to order and phase the pieces of the assembly. This means two sequence datasets from the same individual — one single cell short read dataset from gametes (10X linked reads) and one long read dataset from the parent (PacBio) — and creatively re-using them in different ways.

This is what they do:

1. Assemble the long reads into a preliminary assembly, which will be a mosaic of the two genome copies (barring gross differences, ”haplotigs”, that can to some extent be removed).

2. Align the single cell short reads to the preliminary assembly and call SNPs. (They also did some tricks to deal with regions without SNPs, separating those that were not variable between genomes and those that were deleted in one genome.) Because the gametes are haploid, they get the phase of the parent’s genotype.

3. Align the long reads again. Now, based on the phased genotype, the long reads can be assigned to the genome copy they belong to. So they can partition the reads into one bin per genome copy and chromosome.

4. Assemble those bins separately. They now get one assembly for each homologous chromosome.

They apply it to an apricot tree, which has a 250 Mbp genome. When they sequence the parents of the tree, it seems to separate the genomes well. The two genome copies have quite a bit of structural variation:

Despite high levels of synteny, the two assemblies revealed large-scale rearrangements (23 inversions, 1,132 translocation/transpositions and 2,477 distal duplications) between the haplotypes making up more than 15% of the assembled sequence (38.3 and 46.2 Mb in each of assemblies; Supplementary Table 1). /…/ Mirroring the huge differences in the sequences, we found the vast amount of 942 and 865 expressed, haplotype-specific genes in each of the haplotypes (Methods; Supplementary Tables 2-3).

They can then go back to the single cell data and look at the recombination landscape and at chromosomal arrangements during meiosis.

This is pretty elegant. I wonder how dependent it is on the level of variation within the individual, and how it compares in cost and finickiness to other assembly strategies.

Literature

Campoy, José A., et al. ”Chromosome-level and haplotype-resolved genome assembly enabled by high-throughput single-cell sequencing of gamete genomes.” BioRxiv (2020).

What is a locus, anyway?

”Locus” is one of those confusing genetics terms (its meaning, not just its pronunciation). We can probably all agree with a dictionary and with Wikipedia that it means a place in the genome, but a place of what and in what sense? We also use place-related word like ”site” and ”region” that one might think were synonymous, but don’t seem to be.

For an example, we can look at this relatively recent preprint (Chebib & Guillaume 2020) about a model of the causes of genetic correlation. They have pairs of linked loci that each affect one trait each (that’s the tight linkage condition), and also a set of loci that affect both traits (the pleiotropic condition), correlated Gaussian stabilising selection, and different levels of mutation, migration and recombination between the linked pairs. A mutation means adding a number to the effect of an allele.

This means that loci in this model can have a large number of alleles with quantitatively different effects. The alleles at a locus share a distribution of mutation effects, that can be either two-dimensional (with pleiotropy) or one-dimensional. They also share a recombination rate with all other loci, which is constant.

What kind of DNA sequences can have these properties? Single nucleotide sites are out of the question, as they can have four, or maybe five alleles if you count a deletion. Larger structural variants, such as inversions or allelic series of indels might work. A protein-coding gene taken as a unit could have a huge number of different alleles, but they would probably have different distributions of mutational effects in different sites, and (relatively small) differences in genetic distance to different sites.

It seems to me that we’re talking about an abstract group of potential alleles that have sufficiently similar effects and that are sufficiently closely linked. This is fine; I’m not saying this to criticise the model, but to explore how strange a locus really is.

They find that there is less genetic correlation with linkage than with pleiotropy, unless the mutation rate is high, which leads to a discussion about mutation rate. This reasoning about the mutation rate of a locus illustrates the issue:

A high rate of mutation (10−3) allows for multiple mutations in both loci in a tightly linked pair to accumulate and maintain levels of genetic covariance near to that of mutations in a single pleiotropic locus, but empirical estimations of mutation rates from varied species like bacteria and humans suggests that per-nucleotide mutation rates are in the order of 10−8 to 10−9 … If a polygenic locus consists of hundreds or thousands of nucleotides, as in the case of many quantitative trait loci (QTLs), then per-locus mutation rates may be as high as 10−5, but the larger the locus the higher the chance of recombination between within-locus variants that are contributing to genetic correlation. This leads us to believe that with empirically estimated levels of mutation and recombination, strong genetic correlation between traits are more likely to be maintained if there is an underlying pleiotropic architecture affecting them than will be maintained due to tight linkage.

I don’t know if it’s me or the authors who are conceptually confused here. If they are referring to QTL mapping, it is true that the quantitative trait loci that we detect in mapping studies often are huge. ”Thousands of nucleotides” is being generous to mapping studies: in many cases, we’re talking millions of them. But the size of a QTL region from a mapping experiment doesn’t tell us how many nucleotides in it that matter to the trait. It reflects our poor resolution in delineating the, one or more, causative variants that give rise to the association signal. That being said, it might be possible to use tricks like saturation mutagenesis to figure out which mutations within a relevant region that could affect a trait. Then, we could actually observe a locus in the above sense.

Another recent theoretical preprint (Chantepie & Chevin 2020) phrases it like this:

[N]ote that the nature of loci is not explicit in this model, but in any case these do not represent single nucleotides or even genes. Rather, they represent large stretches of effectively non-recombining portions of the genome, which may influence the traits by mutation. Since free recombination is also assumed across these loci (consistent with most previous studies), the latter can even be thought of as small chromosomes, for which mutation rates of the order to 10−2 seem reasonable.

Literature

Chebib and Guillaume. ”Pleiotropy or linkage? Their relative contributions to the genetic correlation of quantitative traits and detection by multi-trait GWA studies.” bioRxiv (2019): 656413.

Chantepie and Chevin. ”How does the strength of selection influence genetic correlations?” bioRxiv (2020).

Journal club of one: ”Versatile simulations of admixture and accurate local ancestry inference with mixnmatch and ancestryinfer”

Admixture is the future of every sub-field of genetics, just in case you didn’t know. Both in the wild and domestic animals, populations or even species sometimes cross. This causes different patterns of relatedness than in well-mixed populations. Often we want to estimate ”local ancestry”, that is: what source population a piece of chromosome in an individual originates from. It is one of those genetics problems that is made harder by the absence of any way to observe it directly.

This recent paper (Schumer et al 2020; preprint version, which I read, here) presents a method for simulating admixed sequence data, and a method for inferring local ancestry from it. It does something I like, namely to pair analysis with fake-data simulation to check methods.

The simulation method is a built from four different simulators:

1. macs (Chen, Majoram & Wall 2009), which creates polymorphism data under neutral evolution from a given population history. They use macs to generate starting chromosomes from two ancestral populations.

2. Seq-Gen (Rambaut & Grassly 1997). Chromosomes from macs are strings of 0s and 1s representing the state at biallelic markers. If you want DNA-level realism, with base composition, nucleotide substitution models and so on, you need something else. I don’t really follow how they do this. You can tell from the source code that they use the local trees that macs spits out, which Seq-Gen can then simulate nucleotides from. As they put it, the resulting sequence ”lacks other complexities of real genome sequences such as repetitive elements and local variation in base composition”, but it is a step up from ”0000110100”.

3. SELAM (Corbett-Detig & Jones 2016), which simulates admixture between populations with population history and possibly selection. Here, SELAM‘s role is to simulate the actual recombination and interbreeding to create the patterns of local ancestry, that they will then fill with the sequences they generated before.

4. wgsim, which simulates short reads from a sequence. At this point, mixnmatch has turned a set of population genetic parameters into fasta files. That is pretty cool.

On the one hand, building on tried and true tools seems to be the right thing to do, less wheel-reinventing. It’s great that the phylogenetic simulator Seq-Gen from 1997 can be used in a paper published in 2020. On the other hand, looking at the dependencies for running mixnmatch made me a little pale: seven different bioinformatics or population genetics softwares (not including the dependencies you need to compile them), R, Perl and Python plus Biopython. Computational genetics is an adventure of software installation.

They use the simulator to test the performance of a hidden Markov model for inferring local ancestry (Corbett-Detig & Nielsen 2017) with different population histories and settings, and then apply it to swordtail fish data. In particular, one needs to set thresholds for picking ”ancestry informative” (i.e. sufficiently differentiated) markers between the ancestral populations, and that depends on population history and diversity.

In passing, they use the estimate the swordtail recombination landscape:

We used the locations of observed ancestry transitions in 139 F2 hybrids that we generated between X. birchmanni and X. malinche … to estimate the recombination rate in 5 Mb windows. … We compared inferred recombination rates in this F2 map to a linkage disequilibrium based recombination map for X. birchmanni that we had previously generated (Schumer et al., 2018). As expected, we observed a strong correlation in estimated recombination rate between the linkage disequilibrium based and crossover maps (R=0.82, Figure 4, Supporting Information 8). Simulations suggest that the observed correlation is consistent with the two recombination maps being indistinguishable, given the low resolution of the F2 map (Supporting Information 8).

Twin lambs with different fathers

I just learned that in sheep, lambs from the same litter pretty often have different fathers, if the ewe has mated with different males. Berry et al. (2020) looked at sheep flocks on Irland that used more than one ram, and:

Of the 539 pairs of twins included in the analysis, 160 (i.e. 30%) were sired by two different rams. Of the 137 sets of triplets included in the analysis, 73 (i.e. 53%) were sired by more than one ram. Of the nine sets of quadruplets, eight were sired by two rams with the remaining litter being mono‐paternal. The overall incidence of heteropaternal superfecundation among litters was therefore 35%. Given that the incidence of multiple births in these flocks was 65%, heteropaternal superfecundation is expected to be relatively common in sheep; this is especially true as all but two of the litter‐mates were polyzygotic.

They figured this out by looking at individuals genotyped on SNP chips with tens of thousands of SNPs, with both lambs and the potential parents genotyped, so there can’t be much uncertainty in the assignment. You don’t need that many genotyped markers to get a confident assignment, and they don’t have that many rams to choose from.

Time for some Mendelian inheritance

Let’s simulate a situation like this: We set up a population and a marker panel for genotyping, split them into ewes and rams, and make some lambs.

library(AlphaSimR)

founderpop <- runMacs(nInd = 105,
                      nChr = 10,
                      segSites = 100)

simparam <- SimParam$new(founderpop)
simparam$setGender("no")

simparam$addSnpChip(nSnpPerChr = 100)

parents <- newPop(founderpop,
                  simParam = simparam)

ewes <- parents[1:100]
rams <- parents[101:105]

lambs <- randCross2(females = ewes,
                    males = rams,
                    nCrosses = 100,
                    nProgeny = 2,
                    simParam = simparam)

Now, if we have the genotypes of a lamb and its mother, how do we know the father? In this paper, they use exclusion methods: They compared the genotypes from the offspring with the parents and used inheritance rules to exclude rams that can't be the father because if they were, the offspring couldn't have the genotypes it had. Such breaking of regular inheritance patterns would be a "Mendelian inconsistency". This is the simplest kind of parentage assignment; fancier methods will calculate the probabilities of different genotypes, and allow you to reconstruct unknown relationships.

We can do this in two ways:

  • ignore the ewe’s genotypes and look for opposite homozygotes between lamb and ram, which are impossible regardless of the mother’s genotype
  • use both the ewe’s and ram’s genotypes to look what lamb genotypes are possible from a cross between them; this adds a few more cases where we can exclude a ram even if the lamb is heterozygous

To do the first, we count the number of opposite homozygous markers. In this genotype coding, 0 and 2 are homozygotes, and 1 is a heterozygous marker.

opposite_homozygotes <- function(ram,
                                 lamb) {
    sum(lamb == 0 & ram == 2) +
        sum(lamb == 2 & ram == 0)

}

When we include the ewe's genotype, there are a few more possible cases. We could enumerate all of them, but here is some R code to generate them. We first get all possible gametes from each parent, we combine the gametes in all possible combinations, and that gives us the possible lamb genotypes at that marker. If the lamb does, in fact, not have any of those genotypes, we declare the marker inconsistent. Repeat for all markers.

## Generate the possible gametes from a genotype

possible_gametes <- function(genotype) {

    if (genotype == 0) {
        gametes <- 0
    } else if (genotype == 1) {
        gametes <- c(0, 1)
    } else if (genotype == 2) {
        gametes <- 1
    }

    gametes
}

## Generate the possible genotypes for an offspring from
## parent possible gametes

possible_genotypes <- function(father_gametes,
                               mother_gametes) {

    possible_combinations <- expand.grid(father_gametes, mother_gametes)
    resulting_genotypes <- rowSums(possible_combinations)
    unique(resulting_genotypes)
}

## Check offspring genotypes for consistency with parent genotypes

mendelian_inconsistency <- function(ewe,
                                    ram,
                                    lamb) {

    n_markers <- length(ewe)
    inconsistent <- logical(n_markers)

    for (marker_ix in 1:n_markers) {

        possible_lamb_genotypes <-
          possible_genotypes(possible_gametes(ewe[marker_ix]),
                             possible_gametes(ram[marker_ix]))

        inconsistent[marker_ix] <-
          !lamb[marker_ix] %in% possible_lamb_genotypes
    }

    sum(inconsistent)
}

(These functions assume that we have genotypes in vectors. The full code that extracts this information from the simulated data and repeats for all markers is on Gitbhub.)

Here is the outcome for a set of random lambs. The red dots point out the true fathers: because we have perfect genotype data simulated without errors, the true father always has 100% consistent markers.

If we compare how many markers are found inconsistent with the two methods, we get a pattern like this graph. Including the ewe’s genotypes lets us discover a lot more inconsistent markers, but in this case, with plentiful and error-free markers, it doesn’t make a difference.

Thresholds and errors

If I have any complaint with the paper, it’s that the parentage analysis isn’t really described in the methods. This is what it says:

Parentage testing using simple exclusion‐based approaches is determined by the proportion of opposing homozygotes in putative sire–offspring pairs.

/…/

Maternal verification was undertaken using the exclusion method (Double et al . 1997) comparing the genotype of the dam with that of her putative progeny and only validated dam–offspring pairs were retained. Genotypes of the mature rams in the flock were compared with all lambs born in that flock using the exclusion method.

(The reference is related to exclusion methods, but it’s describing how to calculate exclusion probabilities in a certain circumstance. That is, it’s part of a methodological conversation about exclusion methods, but doesn’t actually describe what they did.)

I don’t doubt that they did it well. Still, it would be interesting to know the details, because in the absence of perfect genotype data, they must have had some thresholds for error and some criterion for deciding which ram was right, even if it seemed obvious.

Literature

Berry, D. P., et al. ”Heteropaternal superfecundation frequently occurs in multiple‐bearing mob‐mated sheep.” Animal Genetics (2020).

Journal club of one: ”Eliciting priors and relaxing the single causal variant assumption in colocalisation analyses”

This paper (Wallace 2020) is about improvements to the colocalisation method for genome-wide association studies called coloc. If you have an association to trait 1 in a region, and another association with trait 2, coloc investigates whether they are caused by the same variant or not. I’ve never used coloc, but I’m interested because setting reasonable priors is related to getting reasonable parameters for genetic architecture.

The paper also looks at how coloc is used in the literature (with default settings, unsurprisingly), and extends coloc to relax the assumption of only one causal variant per region. In that way, it’s a solid example of thoughtfully updating a popular method.

(A note about style: This isn’t the clearest paper, for a few reasons. The structure of the introduction is indirect, talking a lot about Mendelian randomisation before concluding that coloc isn’t Mendelian randomisation. The paper also uses numbered hypotheses H1-H4 instead of spelling out what they mean … If you feel a little stupid reading it, it’s not just you.)

coloc is what we old QTL mappers call a pleiotropy versus linkage test. It tries to distinguish five scenarios: no association, trait 1 only, trait 2 only, both traits with linked variants, both traits with the same variant.

This paper deals with the priors: What is the prior probability of a causal association to trait 1 only p_1, trait 2 only p_2, or both traits p_{12} , and are the defaults good?

They reparametrise the priors so that it becomes possible to get some estimates from the literature. They work with the probability that a SNP is causally associated with each trait (which means adding the probabilities of association q_1 = p_1 + p_{12} ) … This means that you can look at single trait association data, and get an idea of the number of marginal associations, possibly dependent on allele frequency. The estimates from a gene expression dataset and a genome-wide association catalog work out to a prior around 10 ^ {-4} , which is the coloc default. So far so good.

How about p_{12} ?

If traits were independent, you could just multiply q_1 and q_2. But not all of the genome is functional. If you could straightforwardly define a functional proportion, you could just divide by it.

You could also look at the genetic correlation between traits. It makes sense that the overall genetic relationship between two traits should inform the prior that you see overlap at this particular locus. This gives a lower limit for p_{12} . Unfortunately, this still leaves us dependent on what kinds of traits we’re analysing. Perhaps, it’s not so surprising that there isn’t one prior that universally works for all kinds of pairs of trait:

Attempts to colocalise disease and eQTL signals have ranged from underwhelming to positive. One key difference between outcomes is the disease-specific relevance of the cell types considered, which is consistent with variable chromatin state enrichment in different GWAS according to cell type. For example, studies considering the overlap of open chromatin and GWAS signals have convincingly shown that tissue relevance varies by up to 10 fold, with pancreatic islets of greatest relevance for traits like insulin sensitivity and immune cells for immune-mediated diseases. This suggests that p_{12} should depend explicitly on the specific pair of traits under consideration, including cell type in the case of eQTL or chromatin mark studies. One avenue for future exploration is whether fold change in enrichment of open chromatin/GWAS signal overlap between cell types could be used to modulate p_{12} and select larger values for more a priori relevant tissues.

Literature

Wallace, Chris. ”Eliciting priors and relaxing the single causal variant assumption in colocalisation analyses.” PLoS Genetics 16.4 (2020): e1008720.

Adrian Bird on genome ecology

I recently read this essay by Adrian Bird on ”The Selfishness of Law-Abiding Genes”. That is a colourful title in itself, but it doesn’t stop there; this is an extremely metaphor-rich piece. In terms of the theoretical content, there is not much new under the sun. Properties of the organism like complexity, redundancy, and all those exquisite networks of developmental gene regulation may be the result of non-adaptive processes, like constructive neutral evolution and intragenomic conflict. As the title suggests, Bird argues that this kind of thinking is generally accepted about things like transposable elements (”selfish DNA”), but that the same logic applies to regular ”law-abiding” genes. They may also be driven by other evolutionary forces than a net fitness gain at the organismal level.

He gives a couple of possible examples: toxin–antitoxin gene pairs, RNA editing and MeCP2 (that’s probably Bird’s favourite protein that he has done a lot of work on). He gives this possible description of MeCP2 evolution:

Loss of MeCP2 via mutation in humans leads to serious defects in the brain, which might suggest that MeCP2 is a fundamental regulator of nervous system development. Evolutionary considerations question this view, however, as most animals have nervous systems, but only vertebrates, which account for a small proportion of the animal kingdom, have MeCP2. This protein therefore appears to be a late arrival in evolutionary terms, rather than being a core ancestral component of brain assembly. A conventional view of MeCP2 function is that by exerting global transcriptional restraint it tunes gene expression in neurons to optimize their identity, but it is also possible to devise a scenario based on self-interest. Initially, the argument goes, MeCP2 was present at low levels, as it is in non-neuronal tissues, and therefore played little or no role in creating an optimal nervous system. Because DNA methylation is sparse in the great majority of the genome, sporadic mutations that led to mildly increased MeCP2 expression would have had a minimal dampening effect on transcription that may initially have been selectively neutral. If not eliminated by drift, further chance increases might have followed, with neuronal development incrementally adjusting to each minor hike in MeCP2-mediated repression through compensatory mutations in other genes. Mechanisms that lead to ‘constructive neutral evolution’ of this kind have been proposed. Gradually, brain development would accommodate the encroachment of MeCP2 until it became an essential feature. So, in response to the question ‘why do brains need MeCP2?’, the answer under this speculative scenario would be: ‘they do not; MeCP2 has made itself indispensable by stealth’.

I think this is a great passage, and it can be read both as a metaphorical reinterpretation, and as substantive hypothesis. The empirical question ”Did MeCP2 offer an important innovation to vertebrate brains as it arose?”, is a bit hard to answer with data, though. On the other hand, if we just consider the metaphor, can’t you say the same about every functional protein? Sure, it’s nice to think of p53 as the Guardian of the Genome, but can’t it also be viewed as a gangster extracting protection money from the organism? ”Replicate me, or you might get cancer later …”

The piece argues for a gene-centric view, that thinks of molecules and the evolutionary pressures they face. This doesn’t seem so be the fashionable view (sorry, extended synthesists!) but Bird argues that it would be healthy for molecular cell biologists to think more about the alternative, non-adaptive, bottom-up perspective. I don’t think the point is to advocate that way of thinking to the exclusion of the all other. To me, the piece reads more like an invitation to use a broader set of metaphors and verbal models to aid hypothesis generation.

There are too may good quotes in this essay, so I’ll just quote one more from the end, where we’ve jumped from the idea of selfish law-abiding genes, over ”genome ecology” — not in the sense of using genomics in ecology, but in the sense of thinking of the genome as some kind of population of agents with different niches and interactions, I guess — to ”Genetics Meets Sociology?”

Biologists often invoke parallels between molecular processes of life and computer logic, but a gene-centered approach suggests that economics or social science may be a more appropriate model …

I feel like there is a circle of reinforcing metaphors here. Sometimes when we have to explain how something came to be, for example a document, a piece of computer code or a the we do things in an organisation, we say ”it grew organically” or ”it evolved”. Sometimes we talk about the genome as a computer program, and sometimes we talk about our messy computer program code as an organism. Like viruses are just like computer viruses, only biological.

Literature

Bird, Adrian. ”The Selfishness of Law-Abiding Genes.” Trends in Genetics 36.1 (2020): 8-13.

Journal club of one: ”Genomic predictions for crossbred dairy cattle”

A lot of dairy cattle is crossbred, but genomic evaluation is often done within breed. What about the crossbred individuals? This paper (VanRaden et al. 2020) describes the US Council on Dairy Cattle Breeding’s crossbred genomic prediction that started 2019.

In short, the method goes like this: They describe each crossbred individual in terms of their ”genomic breed composition”, get predictions for each them based on models from all the breeds separately, and then combine the results in proportion to the genomic breed composition. The paper describes how they estimate the genomic breed composition, and evaluated accuracy by predicting held-out new data from older data.

The genomic breed composition is a delightfully elegant hack: They treat ”how much breed X is this animal” as a series of traits and run a genomic evaluation on them. The training set: individuals from sets of reference breeds with their trait value set to 100% for the breed they belong to and 0% for other breeds. ”Marker effects for GBC [genomic breed composition] were then estimated using the same software as for all other traits.” Neat. After some adjustment, they can be interpreted as breed percentages, called ”base breed representation”.

As they already run genomic evaluations from each breed, they can take these marker effects and then animal’s genotypes, and get one estimate for each breed. Then they combine them, weighting by the base breed representation.

Does it work? Yes, in the sense that it provides genomic estimates for animals that otherwise wouldn’t have any, and that it beats parent average estimates.

Accuracy of GPTA was higher than that of [parent average] for crossbred cows using truncated data from 2012 to predict later phenotypes in 2016 for all traits except productive life. Separate regressions for the 3 BBR categories of crossbreds suggest that the methods perform equally well at 50% BBR, 75% BBR, and 90% BBR.

They mention in passing comparing these estimates to estimates from a common set of marker effects for all breeds, but there is no detail about that model or how it compared in accuracy.

The discussion starts with this sentence:

More breeders now genotype their whole herds and may expect evaluations for all genotyped animals in the future.

That sounds like a reasonable expectation, doesn’t it? Before what they could do with crossbred genotypes was to throw it away. There are lots of other things that might be possible with crossbred evaluation in the future (pulling in crossbred data into the evaluation itself, accounting for ancestry in different parts of the genome, estimating breed-of-origin of alleles, looking at dominance etc etc).

My favourite result in the paper is Table 8, which shows:

Example BBR for animals from different breeding systems are shown in Table 8. The HO cow from a 1964 control line had 1960s genetics from a University of Minnesota experimental selection project and a relatively low relationship to the current HO population because of changes in breed allele frequencies over the past half-century. The Danish JE cow has alleles that differ somewhat from the North American JE population. Other examples in the table show various breed crosses, and the example for an animal from a breed with no reference population shows that genetic contributions from some other breed may be evenly distributed among the included breeds so that BBR percentages sum to 100. These examples illustrate that GBC can be very effective at detecting significant percentages of DNA contributed by another breed.

Literature

VanRaden, P. M., et al. ”Genomic predictions for crossbred dairy cattle.” Journal of Dairy Science 103.2 (2020): 1620-1631.

A partial success

In 2010, Poliseno & co published some results on the regulation of a gene by a transcript from a pseudogene. Now, Kerwin & co have published a replication study, the protocol for which came out in 2015 (Khan et al). An editor summarises it like this in an accompanying commentary (Calin 2020):

The partial success of a study to reproduce experiments that linked pseudogenes and cancer proves that understanding RNA networks is more complicated than expected.

I guess he means ”partial success” in the sense that they partially succeeded in performing the replication experiments they wanted. These experiments did not reproduce the gene regulation results from 2010.

Seen from the outside — I have no insight in what is going on here or who the people involved are — something is not working here. If it takes five years from paper to replication effort, and then another five years to replication study accompanied by an editorial commentary that subtly undermines it, we can’t expect replication studies to update the literature, can we?

Communication

What’s the moral of the story, according to Calin?

What are the take-home messages from this Replication Study? One is the importance of fruitful communication between the laboratory that did the initial experiments and the lab trying to repeat them. The lack of such communication – which should extend to the exchange of protocols and reagents – was the reason why the experiments involving microRNAs could not be reproduced. The original paper did not give catalogue numbers for these reagents, so the wrong microRNA reagents were used in the Replication Study. The introduction of reporting standards at many journals means that this is less likely to be an issue for more recent papers.

There is something right and something wrong about this. On the one hand, talking to your colleagues in the field obviously makes life easier. We would like researchers to put all pertinent information in writing, and we would like there to be good communication channels in cases where the information turns out not to be what the reader needed. On the other hand, we don’t want science to be esoteric. We would like experiments to be reproducible without the special artifact or secret sauce. If nothing else, because the people’s time and willingness to provide tech support for their old papers might be limited. Of course, this is hard, in a world where the reproducibility of an experiment might depend on the length of digestion (Hines et al 2014) or that little plastic thingamajig you need for the washing step.

Another take-home message is that it is finally time for the research community to make raw data obtained with quantitative real-time PCR openly available for papers that rely on such data. This would be of great benefit to any group exploring the expression of the same gene/pseudogene/non-coding RNA in the same cell line or tissue type.

This is true. You know how doctored, or just poor, Western blots are a notorious issue in the literature? I don’t think that’s because Western blot as a technique is exceptionally bad, but because there is a culture of showing the raw data (the gel), so people can notice problems. However, even if I’m all for showing real-time PCR amplification curves (as well as melting curves, standard curves, and the actual batch and plate information from the runs), I doubt that it’s going to be possible to trouble-shoot PCR retrospectively from those curves. Maybe sometimes one would be able to spot a PCR that looks iffy, but beyond that, I’m not sure what we would learn. PCR issues are likely to have to do with subtle things like primer design, reaction conditions and handling that can only really be tackled in the lab.

The world is messy, alright

Both the commentary and the replication study (Kerwin et al 2020) are cautious when presenting their results. I think it reads as if the authors themselves either don’t truly believe their failure to replicate or are bending over backwards to acknowledge everything that could have gone wrong.

The original study reported that overexpression of PTEN 3’UTR increased PTENP1 levels in DU145 cells (Figure 4A), whereas the Replication Study reports that it does not. …

However, the original study and the Replication Study both found that overexpression of PTEN 3’UTR led to a statistically significant decrease in the proliferation of DU145 cells compared to controls.

In the original study Poliseno et al. reported that two microRNAs – miR-19b and miR-20a – suppress the transcription of both PTEN and PTENP1 in DU145 prostate cancer cells (Figure 1D), and that the depletion of PTEN or PTENP1 led to a statistically significant reduction in the corresponding pseudogene or gene (Figure 2G). Neither of these effects were seen in the Replication Study. There are many possible explanations for this. For example, although both studies used DU145 prostate cancer cells, they did not come from the same batch, so there could be significant genetic differences between them: see Andor et al. (2020) for more on cell lines acquiring mutations during cell cultures. Furthermore, one of the techniques used in both studies – quantitative real-time PCR – depends strongly on the reagents and operating procedures used in the experiments. Indeed, there are no widely accepted standard operating procedures for this technique, despite over a decade of efforts to establish such procedures (Willems et al., 2008; Schwarzenbach et al., 2015).

That is both commentary and replication study seem to subscribe to a view of the world where biology is so rich and complex that both might be right, conditional on unobserved moderating variables. This is true, but it throws us into a discussion of generalisability. If a result only holds in some genotypes of DU145 prostate cancer cells, which might very well be the case, does it generalise enough to be useful for cancer research?

Power underwhelming

There is another possible view of the world, though … Indeed, biology rich and complicated, but in the absence of accurate estimates, we don’t know which of all these potential moderating variables actually do anything. First order, before we start imagining scenarios that might explain the discrepancy, is to get a really good estimate of it. How do we do that? It’s hard, but how about starting with a cell size greater than N = 5?

The registered report contains power calculations, which is commendable. As far as I can see, it does not describe how they arrived at the assumed effect sizes. Power estimates for a study design depend on the assumed effect sizes. Small studies tend to exaggerate effect sizes (because, if an estimate is small the difference can’t be significant). This means that taking the estimates as staring effect sizes might leave you with a design that is still unable to detect a true effect of reasonable size.

I don’t know what effect sizes one should expect in these kinds of experiments, but my intuition would be that even if you think that you can get good power with a handful of samples per cell, can’t you please run a couple more? We are all limited by resources and time, but if you’re running something like a qPCR, the cost per sample must be much smaller than the cost for doing one run of the experiment in the first place. It’s really not as simple as adding one row on a plate, but almost.

Literature

Calin, George A. ”Reproducibility in Cancer Biology: Pseudogenes, RNAs and new reproducibility norms.” eLife 9 (2020): e56397.

Hines, William C., et al. ”Sorting out the FACS: a devil in the details.” Cell reports 6.5 (2014): 779-781.

Kerwin, John, and Israr Khan. ”Replication Study: A coding-independent function of gene and pseudogene mRNAs regulates tumour biology.” eLife 9 (2020): e51019.

Khan, Israr, et al. ”Registered report: a coding-independent function of gene and pseudogene mRNAs regulates tumour biology.” Elife 4 (2015): e08245.

Poliseno, Laura, et al. ”A coding-independent function of gene and pseudogene mRNAs regulates tumour biology.” Nature 465.7301 (2010): 1033-1038.

Journal club of one: ”Evolutionary stalling and a limit on the power of natural selection to improve a cellular module”

This is a relatively recent preprint on how correlations between genetic variants can limit the response to selection, with experimental evolution in bacteria.

Experimental evolution and selection experiments live on the gradient from modelling to observations in the wild. Experimental evolution researchers can design the environments and the genotypes to pose problems for evolution, and then watch in real time as organisms solve them. With sequencing, they can also watch how the genome responds to selection.

In this case, the problem posed is how to improve a particular cellular function (”module”). The researcher started out with engineered Escherichia coli that had one component of their translation machinery manipulated: they started out with only one copy of an elongation factor gene (where E.coli normally has two) that could be either from another species, an reconstructed ancestral form, or the regular E.coli gene as a control.

Then, they sequenced samples from replicate populations over time, and looked for potentially adaptive variants: that is, operationally, variants that had large changes in frequency (>20%) and occurred in genes that had more than one candidate adaptive variant.

Finally, because they looked at what genes these variants occurred in. Were they related to translation (”TM-specific” as they call it) or not (”generic”). That gave them trajectories of potentially adaptive variants like this. The horizontal axis is time and the vertical frequency of the variant. The letters are populations of different origin, and the numbers replicates thereof. The colour shows the classification of variants. (”fimD” and ”trkH” in the figure are genes in the ”generic” category that are of special interest for other reasons. The orange–brown shading signifies structural variation at the elongation factor gene.)

This figure shows their main observations:

  • V, A and P populations had more adaptive variants in translation genes, and also worse fitness at the start of the experiment. This goes together with improving more during the experiment. If a population has poor translation, a variant in a translation gene might help. If it has decent translation efficiency already, there is less scope for improvement, and adaptive variants in other kinds of genes happen more often.

    We found that populations whose TMs were initially mildly perturbed (incurring ≲ 3% fitness cost) adapted by acquiring mutations that did not directly affect the TM. Populations whose TM had a moderately severe defect (incurring ~19% fitness cost) discovered TM-specific mutations, but clonal interference often prevented their fixation. Populations whose TMs were initially severely perturbed (incurring ~35% fitness cost) rapidly discovered and fixed TM-specific beneficial mutations.

  • Adaptive variants in translation genes tended to increase fast and early during the experiment and often get fixed, suggesting that they have larger effects than. Again, the your translation capability is badly broken, a large-effect variant in a translation gene might help.

    Out of the 14 TM-specific mutations that eventually fixed, 12 (86%) did so in the first selective sweep. As a result, an average TM-specific beneficial mutation reached fixation by generation 300 ± 52, and only one (7%) reached fixation after generation 600 … In contrast, the average fixation time of generic mutations in the V, A and P populations was 600 ± 72 generations, and 9 of them (56%) fixed after the first selective sweep

  • After one adaptive variant in a translation gene, it seems to stop at that.

The question is: when there aren’t further adaptive variants in translation genes, is that because it’s impossible to improve translation any further, or because of interference from other variants? They use the term ”evolutionary stalling”, kind of asexual linked selection. Because variants occur together, selection acts on the net effect of all the variants in an individual. Adaptation in a certain process (in this case translation) might stall, if there are large-effect adaptive variants in other, potentially completely unrelated processes, that swamp the effect on translation.

They argue for three kinds of indirect evidence that the adaptation in translation has stalled in at least some of the populations:

  1. Some of the replicate populations of V didn’t fix adaptive translation variants.
  2. In some populations, there were a second adaptive translation variant, not yet fixed.
  3. There have been adaptive translation mutations in the Long Term Evolution Experiment, which is based on E.coli with unmanipulated translation machinery.

Stalling depends on large-effect variants, but after they have fixed, adaptation might resume. They use the metaphor of natural selection ”shifting focus”. The two non-translation genes singled out in the above figure might be examples of that:

While we did not observe resumption of adaptive evolution in [translation] during the duration of this experiment, we find evidence for a transition from stalling to adaptation in trkH and fimD genes. Mutations in these two genes appear to be beneficial in all our genetic backgrounds (Figure 4). These mutations are among the earliest to arise and fix in E, S and Y populations where the TM does not adapt … In contrast, mutations in trkH and fimD arise in A and P populations much later, typically following fixations of TM-specific mutations … In other words, natural selection in these populations is initially largely focused on improving the TM, while adaptation in trkH and fimD is stalled. After a TM-specific mutation is fixed, the focus of natural selection shifts away from the TM to other modules, including trkH and fimD.

This is all rather indirect, but interesting. Both ”the focus of natural selection shifting” and ”coupling of modules by the emergent neutrality threshold” are inspiring ways to think about the evolution of genetic architecture, and new to me.

Literature

Venkataram, Sandeep, et al. ”Evolutionary Stalling and a Limit on the Power of Natural Selection to Improve a Cellular Module.” bioRxiv (2019): 850644.