”Dangerous gene myths abound”

Philip Ball, who is a knowledgeable and thoughtful science writer, published an piece in the Guardian a couple of months ago about the misunderstood legacy of the human genome project: ”20 years after the human genome was first sequenced, dangerous gene myths abound”.

The human genome project published the draft reference genome for the human species 20 years ago. Ball argues, in short, that the project was oversold with promises that it couldn’t deliver, and consequently has not delivered. Instead, the genome project was good for other things that had more to do with technology development and scientific infrastructure. The sequencing of the human genome was the platform for modern genome science, but it didn’t, for example, cure cancer or uncover a complete set of instructions for building a human.

He also argues that the rhetoric of human genome hype, which did not end with the promotion of the human genome project (see the ENCODE robot punching cancer in the face, for example), is harmful. It is scientifically harmful because it oversimplifies modern genetics, and it is politically harmful because it aligns well with genetic determinism and scientific racism.

I believe Ball is entirely right about this.

Selling out

The breathless hype around the human genome project was embarrassing. Ball quotes some fragments, but you can to to the current human genome project site and enjoy quotes like ”it’s a transformative textbook of medicine, with insights that will give health care providers immense new powers to treat, prevent and cure disease”. This image has some metonymical truth to it — human genomics is helping medical science in different ways — but even as a metaphor, it is obviously false. You can go look at the human reference genome if you want, and you will discover that the ”text” such as it is looks more like this than a medical textbook:

TTTTTTTTCCTTTTTTTTCTTTTGAGATGGAGTCTCGCTCTGCCGCCCAGGCTGGAGTGC
AGTAGCTCGATCTCTGCTCACTGCAAGCTCCGCCTCCCGGGTTCACGCCATTCTCCTGCC
TCAGCCTCCTGAGTAGCTGGGACTACAGGCGCCCACCACCATGCCCAGCTAATTTTTTTT
TTTTTTTTTTTGGTATTTTTAGTAGAGACGGGGTTTCACCGTGTTAGCCAGGATGGTCTC
AATCTCCTGACCTTGTGATCCGCCCGCCTCGGCCTCCCACAGTGCTGGGATTACAGGC

This is a human Alu element from chromosome 17. It’s also in an intron of a gene, flanking a promoter, a few hundred basepairs away from an insulator (see the Ensembl genome browser) … All that is stuff that cannot be read from the sequence alone. You might be able to tell that it’s Alu if you’re an Alu genius or run a sequence recognition software, but there is no to read the other contextual genomic information, and there is no way you can tell anything about human health by reading it.

I think Ball is right that this is part of simplistic genetics that doesn’t appreciate the complexity either quantitative or molecular genetics. In short, quantitative genetics, as a framework, says that inheritance of traits between relatives is due to thousands and thousands of genetic differences each of them with tiny effects. Molecular genetics says that each of those genetic differences may operate through any of a dizzying selection of Rube Goldberg-esque molecular mechanisms, to the point where understanding one of them might be a lifetime of laboratory investigation.

Simple inheritance is essentially a fiction, or put more politely: a simple model that is useful as a step to build up a more better picture of inheritance. This is not knew; the knowledge that everything of note is complex has been around since the beginning of genetics. Even rare genetic diseases understood as monogenic are caused by sometimes thousands of different variants that happen in a particular small subset of the genome. Really simple traits, in the sense of one variant–one phenotype, seldom happen in large mixing and migrating populations like humans; they may occur in crosses constructed in the lab, or in extreme structured populations like dog breeds or possibly with balancing selection.

Can you market thick sequencing?

Ball is also right about what it was most useful about the human genome project: it enabled research at scale into human genetic variation, and it stimulated development of sequencing methods, both generating and using DNA sequence. Lowe (2018) talks about ”thick” sequencing, a notion of sequencing that includes associated activities like assembly, annotation and distribution to a community of researchers — on top of ”thin” sequencing as determination of sequences of base pairs. Thick sequencing better captures how genome sequencing is used and stimulates other research, and aligns with how sequencing is an iterative process, where reference genomes are successively refined and updated in the face of new data, expert knowledge and quality checking.

It is hard to imagine gene editing like CRISPR being applied in any human cell without a good genome sequence to help find out what to cut out and what to put instead. It is hard to imagine the developments in functional genomics that all use short read sequencing as a read-out without having a good genome sequence to anchor the reads on. It is possible to imagine genome-wide association just based on very dense linkage maps, but it is a bit far-fetched. And so on.

Now, this raises a painful but interesting question: Would the genome project ever have gotten funded on reasonable promises and reasonable uncertainties? If not, how do we feel about the genome hype — necessary evil, unforgivable deception, something in-between? Ball seems to think that gene hype was an honest mistake and that scientists were surprised that genomes turned out to be more complicated than anticipated. Unlike him, I do not believe that most researchers honestly believed the hype — they must have known that they were overselling like crazy. They were no fools.

An example of this is the story about how many genes humans have. Ball writes:

All the same, scientists thought genes and traits could be readily matched, like those children’s puzzles in which you trace convoluted links between two sets of items. That misconception explains why most geneticists overestimated the total number of human genes by a factor of several-fold – an error typically presented now with a grinning “Oops!” rather than as a sign of a fundamental error about what genes are and how they work.

This is a complicated history. Gene number estimates are varied, but enjoy this passage from Lewontin in 1977:

The number of genes is not large

While higher organisms have enough DNA to specify from 100,000 to 1,000,000 proteins of average size, it appears that the actual number of cistrons does not exceed a few thousand. Thus, saturation lethal mapping of the fourth chromosome (Hochman, 1971) and the X chromosome (Judd, Shen and Kaufman, 1972) of Drosophila melanogbaster make it appear that there is one cistron per salivary chromosome band, of which there are 5,000 in this species. Whether 5,000 is a large or small number of total genes depends, of course, on the degree of interaction of various cistrons in influencing various traits. Nevertheless, it is apparent that either a given trait is strongly influenced by only a small number of genes, or else there is a high order of gene interactions among developmental systems. With 5,000 genes we cannot maintain a view that different parts of the organism are both independent genetically and each influenced by large number of gene loci.

I don’t know if underestimating by an few folds is worse than overestimating by a few folds (D. melanogaster has 15,000 protein-coding genes or so), but the point is that knowledgeable geneticists did not go around believing that there was a simple 1-to-1 mapping between genes and traits, or even between genes and proteins at this time. I know Lewontin is a population geneticist, and in the popular mythology population geneticists are nothing but single-minded bean counters who do not appreciate the complexity of molecular biology … but you know, they were no fools.

The selfish cistron

One thing Ball gets wrong is evolutionary genetics, where he mixes genetic concepts that, really, have very little to do with each other despite superficially sounding similar.

Yet plenty remain happy to propagate the misleading idea that we are “gene machines” and our DNA is our “blueprint”. It is no wonder that public understanding of genetics is so blighted by notions of genetic determinism – not to mention the now ludicrous (but lucrative) idea that DNA genealogy tells you which percentage of you is “Scots”, “sub-Saharan African” or “Neanderthal”.

This passage smushes two very different parts of genetics together, that don’t belong together and have nothing to do with with the preceding questions about how many genes there are or if the DNA is a blueprint: The gene-centric view of adaptation, a way of thinking of natural selection where you imagine genetic variants (not organisms, not genomes, not populations or species) as competing for reproduction; and genetic genealogy and ancestry models, where you describe how individuals are related based on the genetic variation they carry. The gene-centric view is about adaptation, while genetic genealogy works because of effectively neutral genetics that just floats around, giving us a unique individual barcode due to the sheer combinatorics.

He doesn’t elaborate on the gene machines, but it links to a paper (Ridley 1984) on Williams’ and Dawkins’ ”selfish gene” or ”gene-centric perspective”. I’ve been on about this before, but when evolutionary geneticists say ”selfish gene”, they don’t mean ”the selfish protein-coding DNA element”; they mean something closer to ”the selfish allele”. They are not committed to any view that the genome is a blueprint, or that only protein-coding genes matter to adaptation, or that there is a 1-to-1 correspondence between genetic variants and traits.

This is the problem with correcting misconceptions in genetics: it’s easy to chide others for being confused about the parts you know well, and then make a hash of some other parts that you don’t know very well yourself. Maybe when researchers say ”gene” in a context that doesn’t sound right to you, they have a different use of the word in mind … or they’re conceptually confused fools, who knows.

Literature

Lewontin, R. C. (1977). The relevance of molecular biology to plant and animal breeding. In International Conference on Quantitative Genetics. Ames, Iowa (USA). 16-21 Aug 1976.

Lowe, J. W. (2018). Sequencing through thick and thin: Historiographical and philosophical implications. Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences, 72, 10-27.

The Fulsome Principle

”If it be aught to the old tune, my lord,
It is as fat and fulsome to mine ear
As howling after music.”
(Shakespeare, The Twelfth Night)

There are problematic words that can mean opposite things, I presume either because two once different expressions meandered in the space of meaning until they were nigh indistinguishable, or because something that already had a literal meaning went and became ironic. We can think of our favourites, like a particular Swedish expression that either means that you will get paid or not, or ”fulsome”. Is ”fulsome praise” a good thing (Merriam Webster Usage Note)?

Better avoid the ambiguity. I think this is a fitting name for an observation about language use.

The Fulsome Principle: smart people will gladly ridicule others for breaking supposed rules that are in fact poorly justified.

That is, if we take any strong position about what ”fulsome” means that doesn’t acknowledge the ambiguity, we are following a rule that is poorly justified. If we make fun of anyone for getting the rule wrong, condemning them for as misusing and degrading the English language, we are embarrassingly wrong. We are also in the company of many other smart people who snicker before checking the dictionary. It could also be called the Strunk & White principle.

This is related to:

The Them Principle: If you think something sounds like a novel misuse and degradation of language, chances are it’s in Shakespeare.

This has everything to do with language use in science. How many times have you heard geneticists or evolutionary biologists haranguing some outsider to their field, science writer or student for misusing ”gene”, ”fitness”, ”adaptation” or similar? I would suspect: Many. How many times was the usage, in fact, in line with how the word is used in an adjacent sub-subfield? I would suspect: Also, many.

In ”A stylistic note” at the beginning of his book The Limits of Kindness (Hare 2013), philosopher Caspar Hare writes:

I suspect that quite often, when professional philosophers use specialized terms, they have subtly different senses of those terms in mind.

One example, involving not-so-subtly-different senses of a not-very-specialized term: We talk a great deal about biting the bullet. For example, ”I confronted David Lewis with my damning objection to modal realism, and he bit the bullet.” I have asked a number of philosophers about what, precisely, this means, and received a startling range of replies.

Around 70% say that the metaphor has to do with surgery. … So, in philosophy, for you to bite the bullet is for you to grimly accept seemingly absurd consequences of the theory you endorse. This is, I think, the most widely understood sense of the term.
/…/

Some others say that the metaphor has to do with injury … So in philosophy, for you to acknowledge that you are biting the bullet is for you to acknowledge that an objection has gravely wounded your theory.

/…/

One philosopher said to me that the metaphor has to do with magic. To bite a bullet is to catch a bullet, Houdini-style, in your teeth. So, in philosophy, for you to bite the bullet is for you to elegantly intercept a seemingly lethal objection and render it benign.

/…/

I conclude from my highly unscientific survey that, more than 30 percent of the time, when a philosopher claims to be ”biting the bullet,” a small gap opens up between what he or she means and what his or her reader or listener understands him or her to mean.

And I guess these small gaps in understanding are more common than we normally think.”

Please, don’t go back to my blog archive and look for cases of me railing against someone’s improper use of scientific language, because I’m sure I’ve done it too many times. Mea maxima culpa.

The word ”genome”

The sources I’ve seen attribute the coinage of ”genome” to botanist Hans Winkler (1920, p. 166).

The pertinent passage goes:

Ich schlage vor, für den haploiden Chromosomensatz, der im Verein mit dem zugehörigen Protoplasma die materielle Grundlage der systematischen Einheit darstellt den Ausdruck: das Genom zu verwenden … I suggest to use the expression ”the genome” for the haploid set of chromosomes, which together with the protoplasm it belongs with make up the material basis of the systematic unit …

That’s good, but why did Winkler need this term in the first place? In this chapter, he is dealing with the relationship between chromosome number and mode of reproduction. Of course, he’s going to talk about hybridization and ploidy, and he needs some terms to bring order to the mess. He goes on to coin a couple of other concepts that I had never heard of:

… und Kerne, Zellen und Organismen, in denen ein gleichartiges Genom mehr als einmal in jedem Kern vorhanden ist, homogenomatisch zu nennen, solche dagegen, die verschiedenartige Genome im Kern führen, heterogenomatisch.

So, a homogenomic organism has more than one copy of the same genome in its nuclei, while a heterogenomic organism has multiple genomes. He also suggests you could count the genomes, di-, tri- up to polygenomic organisms. He says that this is a different thing than polyploidy, which is when an organism has multiples of a haploid chromosome set. Winkler’s example: A hybrid between a diploid species with 10 chromosomes and another diploid species with 16 chromosomes might have 13 chromosomes and be polygenomic but not polyploid.

These terms don’t seem to have stuck as much, but I found them used here en there, for example in papers on bananas (Arvanitoyannis et al. 2008) and cotton (Brown & Menzel 1952); cooking bananas are heterogenomic.

This only really makes sense in cases with recent hybridisation, where you can trace different chromosomes to origins in different species. You need to be able to trace parts of the hybrid genome of the banana to genomes of other species. Otherwise, the genome of the banana just the genome of the banana.

Analogously, we also find polygenomes in this cancer paper (Navin et al. 2010):

We applied our methods to 20 primary ductal breast carcinomas, which enable us to classify them according to whether they appear as either monogenomic (nine tumors) or polygenomic (11 tumors). We define ”monogenomic” tumors to be those consisting of an apparently homogeneous population of tumor cells with highly similar genome profiles throughout the tumor mass. We define ”polygenomic” tumors as those containing multiple tumor subpopulations that can be distinguished and grouped by similar genome structure.

This makes sense; if a tumour has clones of cells in it with a sufficiently rearranged genome, maybe it is fair to describe it as a tumour with different genomes. It raises the question what is ”sufficiently” different for something to be a different genome.

How much difference can there be between sequences that are supposed to count as the same genome? In everything above, we have taken a kind of typological view: there is a genome of an individual, or a clone of cells, that can be thought of as one entity, despite the fact that every copy of it, in every different cell, is likely to have subtle differences. Philosopher John Dupré (2010), in ”The Polygenomic Organism”, questions what we mean by ”the genome” of an organism. How can we talk about an organism having one genome or another, when in fact, every cell in the body goes through mutation (actually, Dupré spends surprisingly little time on somatic mutation but more on epigenetics, but makes a similar point), sometimes chimerism, sometimes programmed genome rearrangements?

The genome is related to types of organism by attempts to find within it the essence of a species or other biological kind. This is a natural, if perhaps naïve, interpretation of the idea of the species ‘barcode’, the use of particular bits of DNA sequence to define or identify species membership. But in this paper I am interested rather in the relation sometimes thought to hold between genomes of a certain type and an individual organism. This need not be an explicitly essentialist thesis, merely the simple factual belief that the cells that make up an organism all, as a matter of fact, have in common the inclusion of a genome, and the genomes in these cells are, barring the odd collision with a cosmic ray or other unusual accident, identical.

Dupré’s answer is that there probably isn’t a universally correct way to divide living things into individuals, and what concept of individuality one should use really depends on what one wants to do with it. I take this to mean that it is perfectly fine to gloss over real biological detail, but that we need to keep in mind that they might unexpectedly start to matter. For example, when tracing X chromosomes through pedigrees, it might be fine to ignore that X-inactivation makes female mammals functionally mosaic–until you start looking at the expression of X-linked traits.

Photo of calico cat in Amsterdam by SpanishSnake (CC0 1.0). See, I found a reason to put in a cat picture!

Finally, the genome exists not just in the organism, but also in the computer, as sequences, maps and obscure bioinformatics file formats. Arguably, keeping the discussion above in mind, the genome only exists in the computer, as a scientific model of a much messier biology. Szymanski, Vermeulen & Wong (2019) investigate what the genome is by looking at how researchers talk about it. ”The genome” turns out to be many things to researchers. Here they are writing about what happened when the yeast genetics community created a reference genome.

If the digital genome is not assumed to solely a representation of a physical genome, we might instead see ”the genome” as a discursive entity moving from the cell to the database but without ever removing ”the genome” from the cell, aggregating rather than excluding. This move and its inherent multiplying has consequences for the shape of the community that continues to participate in constructing the genome as a digital text. It also has consequences for the work the genome can perform. As Chadarevian (2004) observes for the C. elegans genome sequence, moving the genome from cell to database enables it to become a new kind of mapping tool …

/…/

Consequently, the informational genome can be used to manufacture coherence across knowledge generated by disparate labs by making it possible to line up textual results – often quite literally, in the case of genome sequences as alphabetic texts — and read across them.

/…/

Prior to the availability of the reference genome, such coherence across the yeast community was generated by strain sharing practices and standard protocols and notation for documenting variation from the reference strain, S288C, authoritatively embodied in living cells housed at Mortimer’s stock center. After the sequencing project, part of that work was transferred to the informational, textual yeast genome, making the practice of lining up and making the same available to those who worked with the digital text as well as those who worked with the physical cell.

And that brings us back to Winkler: What does the genome have in common? That it makes up the basis for the systematic unit, that it belongs to organisms that we recognize as closely related enough to form a systematic unit.

Literature

Winkler H. (1920) Verbreitung und Ursache der Parthenogenesis im Pflanzen- und Tierreiche.

Arvanitoyannis, Ioannis S., et al. ”Banana: cultivars, biotechnological approaches and genetic transformation.” International journal of food science & technology 43.10 (2008): 1871-1879.

Navin, Nicholas, et al. ”Inferring tumor progression from genomic heterogeneity.” Genome research 20.1 (2010): 68-80.

Brown, Meta S., and Margaret Y. Menzel. ”Polygenomic hybrids in Gossypium. I. Cytology of hexaploids, pentaploids and hexaploid combinations.” Genetics 37.3 (1952): 242.

Dupré, John. ”The polygenomic organism.” The Sociological Review 58.1_suppl (2010): 19-31.

Szymanski, Erika, Niki Vermeulen, and Mark Wong. ”Yeast: one cell, one reference sequence, many genomes?.” New Genetics and Society 38.4 (2019): 430-450.

Things that really don’t matter: megabase or megabasepair

Should we talk about physical distance in genetics as number of base pairs (kbp, Mbp, and so on) or bases (kb, Mb)?

I got into a discussion about this recently, and I said I’d continue the struggle on my blog. Here it is. Let me first say that I don’t think this matters at all, and if you make a big deal out of this (or whether ”data” can be singular, or any of those inconsequential matters of taste we argue about for amusement), you shouldn’t. See this blog post as an exorcism, helping me not to trouble my colleagues with my issues.

What I’m objecting to is mostly the inconsistency of talking about long stretches of nucleotides as ”kilobase” and ”megabase” but talking about short stretches as ”base pairs”. I don’t think it’s very common to call a 100 nucleotide stretch ”a 100 b sequence”; I would expect ”100 bp”. For example, if we look at Ensembl, they might describe a large region as 1 Mb, but if you zoom in a lot, they give length in bp. My impression is that this is a common practice. However, if you consistently use ”bases” and ”megabase”, more power to you.

Unless you’re writing a very specific kind of bioinformatics paper, the risk of confusion with the computer storage unit isn’t a problem. But there are some biological arguments.

A biological argument for ”base”, might be that we care about the identity of the base, not the base pairing. We note only one nucleotide down when we write a nucleic acid sequence. The base pair is a different thing: that base bound to the one on the other strand that it’s paired with, or, if the DNA or RNA is single-stranded, it’s not even paired at all.

Conversely, a biochemical argument for ”base pair” might be that in a double-stranded molecule, the base pair is the relevant informational unit. We may only write one base in our nucleotide sequence for convenience, but because of the rules of base pairing, we know the complementing pair. In this case, maybe we should use ”base” for single-stranded molecules.

If we consult two more or less trustworthy sources, The Encylopedia of Life Sciences and Wiktionary, they both seem to take this view.

eLS says:

A megabase pair, abbreviated Mbp, is a unit of length of nucleic acids, equal to one million base pairs. The term ‘megabase’ (or Mb) is commonly used inter-changeably, although strictly this would refer to a single-stranded nucleic acid.

Wiktionary says:

A length of nucleic acid containing one million nucleotides (bases if single-stranded, base pairs if double-stranded)

Please return next week for the correct pronunciation of ”loci”.

Literature

Dear, P.H. (2006). Megabase Pair (Mbp). eLS.

If research is learning, how should researchers learn?

I’m taking a course on university pedagogy to, hopefully, become a better teacher. While reading about students’ learning and what teachers ought to do to facilitate it, I couldn’t help thinking about researchers’ learning, and what we ought to do to give ourselves a good learning environment.

Research is, largely, learning. First, a large part of any research work is learning what is already known, not just by me in particular; it’s a direct continuation of learning that takes place in courses. While doing any research project, we learn the concepts other researchers use in this specific sub-subfield, and the relations between them. First to the extent that we can orient ourselves, and eventually to be able to make a contribution that is intelligible to others who work there. We also learn their priorities, attitudes and platitudes. (Seriously, I suspect you learn a lot about a sub-subfield by trying to make jokes about it.) We also learn to do something new: perform a laboratory procedure, a calculation, or something like that.

But more importantly, research is learning about things no-one knows yet. The idea of constructivist learning theory seems apt: We are constructing new knowledge, building on pre-existing structures. We don’t go out and read the book of nature; we take the concepts and relations of our sub-subfield of choice, and graft, modify and rearrange them into our new model of the subject.

If there is something to this, it means that old clichéd phrases like ”institution of higher learning”, scientists as ”students of X”, and so on, name a deeper analogy than it might seem. It also suggests that innovations in student learning might also be good building blocks for research group management. Should we be concept mapping with our colleagues to figure out where we disagree about the definition of ”developmental pleiotropy”? It also makes one wonder why meetings and departmental seminars often take the form of sage on the stage lectures.

X-related genes

It is hard to interpret gene lists. But before we would even get into the statistical properties of annotation term enrichment, or whether network models are appropriate, or anything like that, we have the simpler problem of how to talk, colloquially, about genes connected with a biological process. In particular, there is a weak way to describe gene function one ought to avoid.

What is, for example, an immune-related gene? Why, it’s a gene that is important to immune function, of course! Is beta-catenin an immune-related gene? Wnt signalling is certainly important to immune cell differentiation (Chae & Bothwell 2018), and beta-catenin is certainly important to Wnt signalling function.

Similarly, Paris is a city in France. Therefore, all cities in France are Paris-related.

The thing is, any indirect mechanism can be a mechanism of genuine genetic causation, and this one isn’t even very roundabout. I couldn’t find a known Mendelian disorder with a mechanism that fit the above story, but I don’t think it’s out of the question. At the same time, labeling everything Wnt ”immune-related” would be a little silly, because those genes also do all sorts of other things. If the omnigenenic hypothesis of near-universal pleiotropy is correct, we should expect a lot of genetic causation to be like that: indirect, based on common pathways that do many kinds of different work in different parts of the organism.

That leaves X-related genes a vague notion that can contract or expand at will. From now on, I will think twice before using it.

‘Approaches to genetics for livestock research’ at IASH, University of Edinburgh

A couple of weeks ago, I was at a symposium on the history of genetics in animal breeding at the Institute of Advanced Studies in the Humanities, organized by Cheryl Lancaster. There were talks by two geneticists and two historians, and ample time for discussion.

First geneticists:

Gregor Gorjanc presented the very essence of quantitative genetics: the pedigree-based model. He illustrated this with graphs (in the sense of edges and vertices) and by predicting his own breeding value for height from trait values, and from his personal genomics results.

Then, yours truly gave this talk: ‘Genomics in animal breeding from the perspectives of matrices and molecules’. Here are the slides (only slightly mangled by Slideshare). This is the talk I was preparing for when I collected the quotes I posted a couple of weeks ago.

I talked about how there are two perspectives on genomics: you can think of genomes either as large matrices of ancestry indicators (statistical perspective) or as long strings of bases (sequence perspective). Both are useful, and give animal breeders and breeding researchers different tools (genomic selection, reference genomes). I also talked about potential future breeding strategies that use causative variants, and how they’re not about stopping breeding and designing the perfect animal in a lab, but about supplementing genomic selection in different ways.

Then, historians:

Cheryl Lancaster told the story of how ABGRO, the Animal Breeding and Genetics Research Organisation in Edinburgh, lost its G. The organisation was split up in the 1950s, separating fundamental genetics research and animal breeding. She said that she had expected this split to be do to scientific, methodological or conceptual differences, but instead found when going through the archives, that it all was due to personal conflicts. She also got into how the ABGRO researchers justified their work, framing it as ”fundamental research”, and aspired to do long term research projects.

Jim Lowe talked about the pig genome sequencing and mapping efforts, how it was different from the human genome project in organisation, and how it used comparisons to the human genome a lot. Here he’s showing a photo of Alan Archibald using the gEVAL genome browser to quality-check the pig genome. He also argued that the infrastructural outcomes of a project like the human genome project, such as making it possible for pig genome scientists to use the human genome for comparisons, are more important and less predictable than usually assumed.

The discussion included comments by some of the people who were there (Chris Haley, Bill Hill), discussion about the breed concept, and what scientists can learn from history.

What is a breed? Is it a genetical thing, defined by grouping individuals based on their relatedness, a historical thing, based on what people think a certain kind of animal is supposed to look like, or a marketing tool, naming animals that come from a certain system? It is probably a bit of everything. (I talked with Jim Lowe during lunch; he had noticed how I referred to Griffith & Stotz for gene concepts, but omitted the ”post-genomic” gene concept they actually favour. This is because I didn’t find it useful for understanding how animal breeding researchers think. It is striking how comfortable biologists are with using fuzzy concepts that can’t be defined in a way that cover all corner cases, because biology doesn’t work that way. If the nominal gene concept is broken by trans-splicing, practicing genomicists will probably think of that more as a practical issue with designing gene databases than a something that invalidates talking about genes in principle.)

What would researchers like to learn from history? Probably how to succeed with large research endeavors and how to get funding for them. Can one learn that from history? Maybe not, but there might be lessons about thinking of research as ”basic”, ”fundamental”, ”applied” etc, and about what the long term effects of research might be.

Greek in biology

This is a fun essay about biological terms borrowed from or inspired by Greek, written by a group of (I presume) Greek speakers: Iliopoulos & al (2019), Hypothesis, analysis and synthesis, it’s all Greek to me.

We hope that this contribution will encourage scientists to think about the terminology used in modern science, technology and medicine (Wulff, 2004), and to be more careful when seeking to introduce new words and phrases into our vocabulary.

First, I like how they celebrate the value of knowing more than one language. I feel like bi- and multilingualism in science is most often discussed as a problem: Either we non-native speakers have problems catching up with the native speakers, or we’re burdening them with our poor writing. Here, the authors seem to argue that knowing another language (Greek) helps both your understanding of scientific language, and the style and grace with which you use it.

I think this is the central argument:

Non-Greek speakers will, we are sure, be surprised by the richness and structure of the Greek language, despite its often inept naturalization in English or other languages, and as a result be better able to understand their own areas of science (Snell, 1960; Montgomery, 2004). Our favorite example is the word ‘analysis’: everyone uses it, but few fully understand it. ‘Lysis’ means ‘breaking up’, while ‘ana-‘ means ‘from bottom to top’ but also ‘again/repetitively’: the subtle yet ingenious latter meaning of the term implies that if you break up something once, you might not know how it works; however, if you break up something twice, you must have reconstructed it, so you must understand the inner workings of the system.

I’m sure it is true that some of the use of Greek-inspired terms in scientific English is inept, and would benefit from checking by someone who knows Greek. However, this passage invites two objections.

First, why would anyone think that the Greek language has less richness and structure then English? Then again, if I learned Greek, it is possible that I would find that the richness would be even more than I expected.

Second, does knowing Greek mean that you have a deeper appreciation for the nuances of a concept like analysis? Maybe ‘analysis’ as understood without those double meanings of the ‘ana-‘ prefix is less exciting, but if it is true that most people don’t know about this subtlety, this can’t be what they mean by ‘analysis’. So, if that etymological understanding isn’t part of how most people use the word, do we really understand it better by learning that story? It sounds like they think that the word is supposed to have a true meaning separate from how it is used, and I’m not sure that is helpful.

So what are some less inept uses of Greek? They like the term ‘epigenomics’, writing that it is being ‘introduced in a thoughtful and meaningful way’. To me, this seems like an unfortunate example, because I can think of few terms in genomics that cause more confusion. ‘Epigenomics’ is the upgraded version of ‘epigenetics’, a word which was, unfortunately, coined at least twice with different meanings. And now, epigenetics is this two-headed beast that feeds on geneticists’s energy as they try to understand what on earth other geneticists are saying.

First, Conrad Waddington glued ‘epigenesis’ and ‘genetics’ together to define epigenetics as ‘the branch of biology that studies the causal interactions between genes and their products which bring the phenotype into being’ (Waddington 1942, quoted in Deans & Maggert 2015). That is, it is what we today might call developmental genetics. Later, David Nanney connected it to gene regulatory mechanisms that are stable through cell division, and we get the modern view of epigenetics as a layer of regulatory mechanisms on top of the DNA sequence. I would be interested to know which of these two intertwined meanings it is that the authors like.

Judging by the affiliations of the authors, the classification of the paper (by the way, how is this ‘computational and systems biology, genetics and genomics’, eLife?), and the citations (16 of 27 to medicine and science journals, a lot of which seems to be similar opinion pieces), this feels like a missed opportunity to connect with language scholarship. I’m no better myself–I’m not a scholar of language, and I haven’t tried to invite one to co-write this blog post with me … But there must be scholarship and expertise outside biomedicine relevant to this topic, and language sources richer than an etymological online dictionary?

Finally, the table of new Greek-inspired terms that ‘might be useful’ is a fun thought exercise, and if it serves as inspiration for someone to have an eureka moment about a concept they need to investigate, great (‘… but what is a katagenome, really? Oh, maybe …’). But I think that telling scientists to coin new words is inviting catastrophe. I’d much rather take the lesson that we need fewer new tortured terms borrowed from Greek, rather than more of them. It’s as if I, driven by the nuance and richness I recognise in my own first language, set out to coin övergenome, undergenome and pågenome.

‘We have reached peak gene, and passed it’

Ken Richardson recently published an opinion piece about genetics titled ‘It’s the end of the gene as we know it‘. And I feel annoyed.

The overarching point of the piece is that there have been ‘radical revisions of the gene concept’ and that they ‘need to reach the general public soon—before past social policy mistakes are repeated’. He argues, among other things, that:

  • headlines like ‘being rich and successful is in your DNA’ are silly;
  • polygenic scores for complex traits have limited predictive power and problems with population structure;
  • the classical concept of what a ‘gene’ has been undermined by molecular biology, which means that genetic mapping and genomic prediction are conceptually flawed.

You may be able to guess which of these arguments make me cheer and which make me annoyed.

There is a risk when you writes a long list of arguments, that if you make some good points and some weak points, no-one will remember anything but the weak point. Let us look at what I think are some good points, and the main weak one.

Gene-as-variant versus gene-as-sequence

I think Richardson is right that there is a difference in how classical genetics, including quantitative genetics, conceives of a ‘gene’, and what a gene is to molecular biology. This is the same distinction as Griffth & Stotz (2013), Portin & Wilkins (2017), and I’m sure many others have written about. (Personally, I used to call it ‘gene(1)’ and ‘gene(2)’, but that is useless; even I can’t keep track of which is supposed to be one and two. Thankfully, that terminology didn’t make it to the blog.)

In classical terms, the ‘gene’ is a unit of inheritance. It’s something that causes inherited differences between individuals, and it’s only observed indirectly through crosses and and differences between relatives. In molecular terms, a ‘gene’ is a piece of DNA that has a name and, optionally, some function. The these two things are not the same. The classical gene fulfills a different function in genetics than the molecular gene. Classical genes are explained by molecular mechanisms, but they are not reducible to molecular genes.

That is, you can’t just take statements in classical genetics and substitute ‘piece of DNA’ for ‘gene’ and expect to get anything meaningful. Unfortunately, this seems to be what Richardson wants to do, and this inability to appreciate classical genes for what they are is why the piece goes astray. But we’ll return to that in a minute.

A gene for hardwiring in your DNA

I also agree that a lot of the language that we use around genetics, casually and in the media, is inappropriate. Sometimes it’s silly (when reacting positively to animals, believing in God, or whatever is supposed to be ‘hard-wired in our DNA’) and sometimes it’s scary (like when a genetic variant was dubbed ‘The Warrior Gene’ on flimsy grounds and tied to speculations about Maori genetics). Even serious geneticists who should know better will put out press releases where this or that is ‘in your DNA’, and the literature is full of ‘genes for’ complex traits that have at best small effects. This is an area where both researchers and communicators should shape up.

Genomic prediction is hard

Polygenic scores are one form of genomic prediction, that is: one way to predict individuals’ trait values from their DNA. It goes something like this: you collect trait values and perform DNA tests on some reference population, then fit a statistical model that tells you which genetic variants differ between individuals with high and low trait values. Then you take that model and apply it to some other individuals, whose values you want to predict. There are a lot of different ways to do this, but they all amount to estimating how much each variant contributes to the trait, and somehow adding that up.

If you have had any exposure to animal breeding, you will recognise this as genomic selection, a technology that has been a boon to animal breeding in dairy cattle, pig, chicken, and to lesser extent other industries in the last ten years or so (see review by Georges, Charlier & Hayes 2018). It’s only natural that human medical geneticists want to do use the same idea to improve prediction of diseases. Unfortunately, it’s a bit harder to get genomic prediction to be useful for humans, for several reasons.

The piece touches on two important problems with genomic prediction in humans: First, DNA isn’t everything, so the polygenic scores will likely have to be combined with other risk factors in a joint model. It still seems to be an open question how useful genomic prediction will be for what diseases and in what contexts. Second, there are problems with population structure. Ken Richardson explains with an IQ example, but the broader point is that it is hard for the statistical models geneticists use to identify the causal effects in the flurry of spurious associations that are bound to exist in real data.

[A]ll modern societies have resulted from waves of migration by people whose genetic backgrounds are different in ways that are functionally irrelevant. Different waves have tended to enter the class structure at randomly different levels, creating what is called genetic population stratification. But different social classes also experience differences in learning opportunities, and much about the design of IQ tests, education, and so on, reflects those differences, irrespective of differences in learning ability as such. So some spurious correlations are, again, inevitable.

So, it may be really hard to get good genomic predictors that predict accurately. This is especially pressing for studies of adaptation, where researchers might use polygenic scores estimated in European populations to compare other populations, for example. Methods to get good estimates in the face of population structure is a big research topic in both human, animal, and plant genetics. I wouldn’t be surprised if good genomic prediction in humans would require both new method development and big genome-wide association studies that cover people from all of the world.

These problems are empirical research problems. Polygenic scores may be useful or not. They will probably need huge studies with lots of participants and new methods with smart statistical tricks. However, they are not threatened by conceptual problems with the word ‘gene’.

Richardson’s criticism is timely. We’d all like to think that anyone who uses polygenic scores would be responsible, pay attention to the literature about sensitivity to population structure, and not try to over-interpret average polygenic scores as some way to detect genetic differences between populations. But just the other week, an evolutionary psychology journal published a paper that did just that. There are ill-intentioned researchers around, and they enjoy wielding the credibility of fancy-sounding modern methods like polygenic scores.

Genetic variants can be causal, though

Now on to where I think the piece goes astray. Here is a description of genetic causation and how that is more complicated than it first seems:

Of course, it’s easy to see how the impression of direct genetic instructions arose. Parents “pass on” their physical characteristics up to a point: hair and eye color, height, facial features, and so on; things that ”run in the family.” And there are hundreds of diseases statistically associated with mutations to single genes. Known for decades, these surely reflect inherited codes pre-determining development and individual differences?

But it’s not so simple. Consider Mendel’s sweet peas. Some flowers were either purple or white, and patterns of inheritance seemed to reflect variation in a single ”hereditary unit,” as mentioned above. It is not dependent on a single gene, however. The statistical relation obscures several streams of chemical synthesis of the dye (anthocyanin), controlled and regulated by the cell as a whole, including the products of many genes. A tiny alteration in one component (a ”transcription factor”) disrupts this orchestration. In its absence the flower is white.

So far so good. This is one of the central ideas of quantitative genetics: most traits that we care about are complex, in that an individual’s trait value is affected by lots of genes of individually small effects, and to a large extent on environmental factors (that are presumably also many and subtle in their individual effects). Even relatively simple traits tend to be more complicated when you look closely. For example, almost none of the popular textbook examples of single gene traits in humans are truly influenced by variants at only one gene (Myths of human genetics). Most of the time they’re either unstudied or more complicated than that. And even Mendelian rare genetic diseases are often collections of different mutations in different genes that have similar effects.

This is what quantitative geneticists have been saying since the early 1900s (setting aside the details about the transcription factors, which is interesting in its own right, but not a crucial part of the quantitative genetic account). This is why genome-wide association studies and polygenic scores are useful, and why single-gene studies of ‘candidate genes’ picked based on their a priori plausible function is a thing of the past. But let’s continue:

This is a good illustration of what Noble calls ”passive causation.” A similar perspective applies to many ”genetic diseases,” as well as what runs in families. But more evolved functions—and associated diseases—depend upon the vast regulatory networks mentioned above, and thousands of genes. Far from acting as single-minded executives, genes are typically flanked, on the DNA sequence, by a dozen or more ”regulatory” sequences used by wider cell signals and their dynamics to control genetic transcription.

This is where it happens. We get a straw biochemist’s view of the molecular gene, where everything is due only to protein-coding genes that encode one single protein that has one single function, and then he enumerates a lot of different exceptions to this view that is supposed to make us reject the gene concept: regulatory DNA (as in the quote above), dynamic gene regulation during development, alternative splicing that allows the same gene to make multiple protein isoforms, noncoding RNA genes that act without being turned into protein, somatic rearrangements in DNA, and even that similar genes may perform different functions in different species … However, the classical concept of a gene used in quantitative genetics is not the same as the molecular gene. Just because the molecular biology and classical genetics both use the word ‘gene’, users of genome-wide association studies are not forced to commit to any particular view about alternative splicing.

It is true that there are ‘vast regulatory networks’ and interplay at the level of ‘the cell as a whole’, but that does not prevent some (or many) of the genes involved in the network to be affected by genetic variants that cause differences between the individuals. That builds up to form genetic effects on traits, through pathways that are genuinely causal, ‘passive’ or not. There are many genetic variants and complicated indirect mechanisms involved. The causal variants are notoriously hard to find. They are still genuine causes. You can become a bit taller because you had great nutrition as a child rather than poor nutrition. You can become a bit taller because you carry certain genetic variants rather than others.