”Dangerous gene myths abound”

Philip Ball, who is a knowledgeable and thoughtful science writer, published an piece in the Guardian a couple of months ago about the misunderstood legacy of the human genome project: ”20 years after the human genome was first sequenced, dangerous gene myths abound”.

The human genome project published the draft reference genome for the human species 20 years ago. Ball argues, in short, that the project was oversold with promises that it couldn’t deliver, and consequently has not delivered. Instead, the genome project was good for other things that had more to do with technology development and scientific infrastructure. The sequencing of the human genome was the platform for modern genome science, but it didn’t, for example, cure cancer or uncover a complete set of instructions for building a human.

He also argues that the rhetoric of human genome hype, which did not end with the promotion of the human genome project (see the ENCODE robot punching cancer in the face, for example), is harmful. It is scientifically harmful because it oversimplifies modern genetics, and it is politically harmful because it aligns well with genetic determinism and scientific racism.

I believe Ball is entirely right about this.

Selling out

The breathless hype around the human genome project was embarrassing. Ball quotes some fragments, but you can to to the current human genome project site and enjoy quotes like ”it’s a transformative textbook of medicine, with insights that will give health care providers immense new powers to treat, prevent and cure disease”. This image has some metonymical truth to it — human genomics is helping medical science in different ways — but even as a metaphor, it is obviously false. You can go look at the human reference genome if you want, and you will discover that the ”text” such as it is looks more like this than a medical textbook:


This is a human Alu element from chromosome 17. It’s also in an intron of a gene, flanking a promoter, a few hundred basepairs away from an insulator (see the Ensembl genome browser) … All that is stuff that cannot be read from the sequence alone. You might be able to tell that it’s Alu if you’re an Alu genius or run a sequence recognition software, but there is no to read the other contextual genomic information, and there is no way you can tell anything about human health by reading it.

I think Ball is right that this is part of simplistic genetics that doesn’t appreciate the complexity either quantitative or molecular genetics. In short, quantitative genetics, as a framework, says that inheritance of traits between relatives is due to thousands and thousands of genetic differences each of them with tiny effects. Molecular genetics says that each of those genetic differences may operate through any of a dizzying selection of Rube Goldberg-esque molecular mechanisms, to the point where understanding one of them might be a lifetime of laboratory investigation.

Simple inheritance is essentially a fiction, or put more politely: a simple model that is useful as a step to build up a more better picture of inheritance. This is not knew; the knowledge that everything of note is complex has been around since the beginning of genetics. Even rare genetic diseases understood as monogenic are caused by sometimes thousands of different variants that happen in a particular small subset of the genome. Really simple traits, in the sense of one variant–one phenotype, seldom happen in large mixing and migrating populations like humans; they may occur in crosses constructed in the lab, or in extreme structured populations like dog breeds or possibly with balancing selection.

Can you market thick sequencing?

Ball is also right about what it was most useful about the human genome project: it enabled research at scale into human genetic variation, and it stimulated development of sequencing methods, both generating and using DNA sequence. Lowe (2018) talks about ”thick” sequencing, a notion of sequencing that includes associated activities like assembly, annotation and distribution to a community of researchers — on top of ”thin” sequencing as determination of sequences of base pairs. Thick sequencing better captures how genome sequencing is used and stimulates other research, and aligns with how sequencing is an iterative process, where reference genomes are successively refined and updated in the face of new data, expert knowledge and quality checking.

It is hard to imagine gene editing like CRISPR being applied in any human cell without a good genome sequence to help find out what to cut out and what to put instead. It is hard to imagine the developments in functional genomics that all use short read sequencing as a read-out without having a good genome sequence to anchor the reads on. It is possible to imagine genome-wide association just based on very dense linkage maps, but it is a bit far-fetched. And so on.

Now, this raises a painful but interesting question: Would the genome project ever have gotten funded on reasonable promises and reasonable uncertainties? If not, how do we feel about the genome hype — necessary evil, unforgivable deception, something in-between? Ball seems to think that gene hype was an honest mistake and that scientists were surprised that genomes turned out to be more complicated than anticipated. Unlike him, I do not believe that most researchers honestly believed the hype — they must have known that they were overselling like crazy. They were no fools.

An example of this is the story about how many genes humans have. Ball writes:

All the same, scientists thought genes and traits could be readily matched, like those children’s puzzles in which you trace convoluted links between two sets of items. That misconception explains why most geneticists overestimated the total number of human genes by a factor of several-fold – an error typically presented now with a grinning “Oops!” rather than as a sign of a fundamental error about what genes are and how they work.

This is a complicated history. Gene number estimates are varied, but enjoy this passage from Lewontin in 1977:

The number of genes is not large

While higher organisms have enough DNA to specify from 100,000 to 1,000,000 proteins of average size, it appears that the actual number of cistrons does not exceed a few thousand. Thus, saturation lethal mapping of the fourth chromosome (Hochman, 1971) and the X chromosome (Judd, Shen and Kaufman, 1972) of Drosophila melanogbaster make it appear that there is one cistron per salivary chromosome band, of which there are 5,000 in this species. Whether 5,000 is a large or small number of total genes depends, of course, on the degree of interaction of various cistrons in influencing various traits. Nevertheless, it is apparent that either a given trait is strongly influenced by only a small number of genes, or else there is a high order of gene interactions among developmental systems. With 5,000 genes we cannot maintain a view that different parts of the organism are both independent genetically and each influenced by large number of gene loci.

I don’t know if underestimating by an few folds is worse than overestimating by a few folds (D. melanogaster has 15,000 protein-coding genes or so), but the point is that knowledgeable geneticists did not go around believing that there was a simple 1-to-1 mapping between genes and traits, or even between genes and proteins at this time. I know Lewontin is a population geneticist, and in the popular mythology population geneticists are nothing but single-minded bean counters who do not appreciate the complexity of molecular biology … but you know, they were no fools.

The selfish cistron

One thing Ball gets wrong is evolutionary genetics, where he mixes genetic concepts that, really, have very little to do with each other despite superficially sounding similar.

Yet plenty remain happy to propagate the misleading idea that we are “gene machines” and our DNA is our “blueprint”. It is no wonder that public understanding of genetics is so blighted by notions of genetic determinism – not to mention the now ludicrous (but lucrative) idea that DNA genealogy tells you which percentage of you is “Scots”, “sub-Saharan African” or “Neanderthal”.

This passage smushes two very different parts of genetics together, that don’t belong together and have nothing to do with with the preceding questions about how many genes there are or if the DNA is a blueprint: The gene-centric view of adaptation, a way of thinking of natural selection where you imagine genetic variants (not organisms, not genomes, not populations or species) as competing for reproduction; and genetic genealogy and ancestry models, where you describe how individuals are related based on the genetic variation they carry. The gene-centric view is about adaptation, while genetic genealogy works because of effectively neutral genetics that just floats around, giving us a unique individual barcode due to the sheer combinatorics.

He doesn’t elaborate on the gene machines, but it links to a paper (Ridley 1984) on Williams’ and Dawkins’ ”selfish gene” or ”gene-centric perspective”. I’ve been on about this before, but when evolutionary geneticists say ”selfish gene”, they don’t mean ”the selfish protein-coding DNA element”; they mean something closer to ”the selfish allele”. They are not committed to any view that the genome is a blueprint, or that only protein-coding genes matter to adaptation, or that there is a 1-to-1 correspondence between genetic variants and traits.

This is the problem with correcting misconceptions in genetics: it’s easy to chide others for being confused about the parts you know well, and then make a hash of some other parts that you don’t know very well yourself. Maybe when researchers say ”gene” in a context that doesn’t sound right to you, they have a different use of the word in mind … or they’re conceptually confused fools, who knows.


Lewontin, R. C. (1977). The relevance of molecular biology to plant and animal breeding. In International Conference on Quantitative Genetics. Ames, Iowa (USA). 16-21 Aug 1976.

Lowe, J. W. (2018). Sequencing through thick and thin: Historiographical and philosophical implications. Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences, 72, 10-27.

Sequencing-based methods called Dart

Some years ago James Hadfield at Enseqlopedia made a spreadsheet of acronyms for sequencing-based methods with some 50 rows. I can only imagine how long it would be today.

The overloading of acronyms is becoming a bit ridiculous. I recently saw a paper about DART-seq, a method for detecting N6-methyladenosine in RNA (Meyer 2019), and thought, ”wait a minute, isn’t DART-seq a reduced representation genotyping method?” It is, only stylised as DArTseq (seriously). Apparently, it’s also a droplet RNA-sequencing method (Saikia et al. 2018).

What are these methods doing?

  • DArT, diversity array technology, is a way to enrich for a part of a genome. It was originally developed with array technology in mind (Jaccoud et al. 2001). They take some DNA, cut it with restriction enzymes, add adapters and amplify regions close to the cut. Then they clone the resulting DNA, and then attach it to a slide, and that gives a custom microarray of anonymous fragments from the genome. For the Dart-seq version, it seems they make a sequencing library instead of going on to cloning (Ren et al. 2015). It falls in the same family as GBS and RAD-seq methods.
  • DART-seq, droplet-assisted RNA targeting, builds on Drop-seq, where they put single cells and beads that carry primers into the same oil droplet. As cells lyse, the RNA sticks to the primer. The beads also have a barcode so they can be identified in sequencing. Then they break the emulsion, reverse transcribe the RNA attached to beads, amplify and sequence. That is cool. However, because they capture the RNA with oligo-dT primers, they sequence from the 3′ end of the RNA. The Dart method adds primers to the beads, so they can target some specific RNAs and amplify more of them. It’s the super-high-tech version of gene-specific primers for reverse transcription..
  • DART-seq, deamination adjacent to RNA modification targets, uses a synthetic fusion protein that combines APOBEC1, which deaminates cytidines, with a protein domain from YTHDF2 which binds N6-methyladenosine. If an RNA has N6-methyladenosine, cytidines that are close to it, as is usually the case with this base modification, will be deaminated to uracil. After RNA-sequencing, this will look like Cs next to As turning into Ts. Neat! It’s a little bit like bisulfite sequencing of methylated DNA, but with RNA.

On the one hand: Don’t people search the internet before they name their methods, or do they not care? On the other hand, realistically, the genotyping method Dart and the single cell RNA-seq method Dart are unlikely to show up in the same work. If you can call your groups ”treatment” and ”control” for the purpose of a paper, maybe you can call your method ”Dart”, and no-one gets too confused.

Paper: ‘Sequence variation, evolutionary constraint, and selection at the CD163 gene in pigs’

This paper is sort of a preview of what is going to be a large series of empirical papers on pig genomics from a lot of people in our group.

The humble CD163 gene has become quite important, because the PRRS virus exploits it to enter macrophages when it infects a pig. It turns out, that if you inactivate it — and there are several ways to go about that; a new one was even published right this paper (Chen et al. 2019) — you get a PRRSV-resistant pig. For obvious reasons, PRRSV-resistant pigs would be great for pig farmers.

In this paper, we wanted to figure out 1) if there were any natural knockout variants in CD163, and 2) if there was anything special about CD163 if you compare it to the rest of the genes in the pig genome. In short, we found no convincing knockout variants, and that CD163 seemed moderately variant intolerant, under positive selection in the lineage leading up to the pig, and that there was no evidence of a selective sweep at CD63.

You can read the whole thing in GSE.

Figure 1, showing sequence variants detected in the CD163 gene.

If you are so inclined, this might lead on to the interesting but not very well defined open question of how we combine these different perspectives on selection in the genome, and how they go together with other genome features like mutation rate and recombination rate variation. There are some disparate threads to bring together there.

Johnsson, Martin, et al. Sequence variation, evolutionary constraint, and selection at the CD163 gene in pigs. Genetics Selection Evolution 50.1 (2018): 69.

Paper: ”Mixed ancestry and admixture in Kauai’s feral chickens: invasion of domestic genes into ancient Red Junglefowl reservoirs”

We have a new paper almost out (now in early view) in Molecular Ecology about the chickens on the Pacific island Kauai. These chickens are pretty famous for being everywhere on the island. Where do they come from? If you use your favourite search engine you’ll find an explanation with two possible origins: ancient wild birds brought over by the Polynesians and escaped domestic chickens. This post on Kauaiblog is great:

Hawaii’s official State bird is the Hawaiian Goose, or Nene, but on Kauai, everyone jokes that the “official” birds of the Garden Island are feral chickens, especially the wild roosters.

Wikepedia says the “mua” or red jungle fowl were brought to Kauai by the Polynesians as a source of food, thriving on an island where they have no real predators. /…/
Most locals agree that wild chickens proliferated after Hurricane Iniki ripped across Kauai in 1992, destroying chicken coops and releasing domesticated hens, and well as roosters being bred for cockfighting. Now these brilliantly feathered fowl inhabit every part of this tropical paradise, crowing at all hours of the day and night to the delight or dismay of tourists and locals alike.

In this paper, we look at phenotypes and genetics and find that this dual origin explanation is probably true.


(Chickens on Kauai. This is not from our paper, but by Jeff Trimble (cc:by-sa-nc) published on Flickr. There are so many pretty chicken pictures there!)

Dom, Eben, and Pamela went to Kauai to photograph, record to and collect DNA from the chickens. (I stayed at home and did sequence bioinformatics.) The Kauai chickens look and sound like mixture of wild and domestic chickens. Some of them have the typical Junglefowl plumage, and other have flecks of white. Their crows vary in the length of the characteristic fourth syllable. Also, some of them have yellow legs, a trait that domestic chickens seem to have gotten not from the Red but from the Grey Junglefowl.

We looked at DNA sequences by massively parallel (SOLiD) sequencing of 23 individuals. We find mitochondrial sequences that fall in two haplogroups: E and D. The presence of the D haplogroup, which is the dominating one in ancient DNA sequences from the Pacific, means that there is a Pacific component to their ancestry. The E group, on the other hand, occurs in domestic chickens. It also shows up in some ancient DNA samples from the Pacific, but not from Kauai (and there is a scientific debate about these sequences). The nuclear genome analysis is pretty inconclusive. I think what we would need is some samples of possible domestic source populations (Where did the escapee  chickens came from? Are there other traditional domestic sources?) and a better sampling of Red Junglefowl to make better sense of it.

When we take the plumage, vocalisation and mitochondrial DNA together, it looks like this is a feral admixed population of either Red Junglefowl or traditional Pacific chickens mixed with domestics. A very interesting population indeed.

Kenneth Chang wrote about the paper in New York Times; includes quotes from Eben and Dom.

E Gering, M Johnsson, P Willis, T Getty, D Wright (2015) Mixed ancestry and admixture in Kauai’s feral chickens: invasion of domestic genes into ancient Red Junglefowl reservoirs. Molecular ecology

Morning coffee: cost per genome

I recently heard this thing referred to as ”the most overused slide in genomics” (David Klevebring). It might be: what it shows is some estimate of the cost of sequencing a human genome over time, and how it plummets around 2008. Before that, the curve is Sanger sequencing, and then the costs show second generation sequencing (454, Illumina and SOLiD).


The source is the US National Human Genome Research Institute, and they’ve put some thought into how to estimate costs so that machines, reagents, analysis and people to do the work are included and that the different platforms are somewhat comparable. One must first point out that downstream analysis to make any sense of the data (assembly and variant calling) isn’t included. But the most important thing that this graph hides, even if the estimates of the cost would be perfect, is that to ”sequence a genome” means something completely different in 2001 and 2015. (Well, with third generation sequencers that give long reads coming up, the old meaning might come back.)

For data since January 2008 (representing data generated using ‘second-generation’ sequencing platforms), the ”Cost per Genome” graph reflects projects involving the ‘re-sequencing’ of the human genome, where an available reference human genome sequence is available to serve as a backbone for downstream data analyses.

The human genome project was of course about sequencing and assembling the genome into high quality sequences. Very few of the millions of human genomes resequenced since are anywhere close. As people in the sequencing loop know, resequencing with short reads doesn’t give you a genome sequence (and neither does trying to assemble a messy eukaryote genome with short reads only). It gives you a list of variants compared to the reference sequence. The usual short read business has no way of detect anything but single nucleotide variants and small indels. (And the latter depends … Also, you can detect copy number variants, but large scale structural variants are mostly off the table.) Of course, you can use these edits to reconstruct a consensus sequence from the reference, but it would be a total lie.

Again, none of this is news for people who deal with sequencing, and I’m not knocking second-generation sequencing. It’s very useful and has made a lot of new things possible. It’s just something I think about every time I see that slide.

agcgaaaaagtggaaaacagcgaacgcattaacggc (Så går det till, del 4)

Nu är det ungefär åtta år sedan det mänskliga genomprojektet avslutades. Nu pratas det med jämna mellanrum om den sköna nya värld när alla kommer kunna sekvensera sitt genom. (Vad i hela friden vi ska med våra individuella genomsekvenser till är fortfarande en öppen fråga. Jag föreslår att skriva ut den på ett ungefär trehundra mil långt toalettpapper att släpa runt på.) Samtidigt strömmar det in nya organismer, senast tror jag kakao och jordgubbe, som fått en referenssekvens. Tidningar brukar oftast kalla det ”kartläggning” eller att ”knäcka den genetiska koden”, men de menar nästan alltid sekvensering.

Att ta fram en referenssekvens är fortfarande ett hårt jobb, men det är när den är klar som den verkliga sekvenseringen kan börja. Då blir det lätt för de som är intresserade av någon viss gen eller viss kandidatregion att sekvensera den och hitta intressanta genetiska variationer. Och den moderna sekvenseringen, som snabbt läser av DNA från många slumpvis valda delar av genomet och sedan lappar ihop det till en hyfsat heltäckande sekvens, behöver en bra referenssekvens.

Sådana massivt parallella sekvenseringsmetoder (med namn som Illumina Genome Analyzer, SOLiD och Roche 454) kommer starkt för tillfället. Men Sangersekvensering är förmodligen fortfarande den viktigaste metoden, vid sidan av alternativet pyrosekvensering. (Det finns andra tidigare metoder och den numera typ övergivna Maxam-Gilbertsekvenseringen, men de är drygare.) Alltså — Sangers enzymatiska sekvensering med avbrutna kedjor!

Vi börjar med en primer — precis som i PCR, fast det behövs bara en — en kort DNA-sekvens som passar där vi vill börja läsa. Så kommer DNA-polymeraset, utgår från primern och bygger upp en ny sträng som kopierar den gamla. Den nya strängen byggs upp av nukleotider — en av baserna A, T, G, C som sitter ihop med en sockermolekyl. Så långt allt gott. Det är bara det att vi satt till en andel stoppnukleotider — nukleotider som är kemiskt lite annorlunda och inte går att bygga vidare på.

När polymeraset råkar sätta in en stoppnukleotid är den strängen slut. Stoppnukleotiden är dessutom märkt med en fluorescent molekyl i en av fyra färger, en för varje bas. Resultatet är en blandning av strängar av olika längd som fluorescerar (avger ljus när den blir belyst) i olika färg beroende på vilken bas de slutar med.

Så, om vi bara kan sortera DNA-fragmenten i storleksordning kan vi läsa av sekvensen som en serie av fluorescenta ljus. Hur storlekssorterar vi? Med gelektrofores! Sekvenseringmaskiner brukar använda ett tunnt glasrör med gel och ha en fast detektor som DNA-bitarna passerar igenom. När de åker förbi lyser detektorn på dem med laser och ser vilken färg som lyser tillbaka.

Sedan ritar den ett diagram med en kurva för varje färg, där sekvensen framträder som en bergskedja av toppar i olika färger. (Eller snarare, det gör ett datorprogram på datorn kopplad till maskinen — om någon vet om ett trevligt fritt program för Sangersekvensering, berätta för mig!)

Den här metodens nackdel är att den bara kan sekvensera en ganska kort sekvens i taget, upp till kanske tusen baspar. Början av sekvensen brukar också vara svår att läsa av. För att konstruera en referenssekvens behövs det alltså ohyggliga mängder sekvenseringsreaktioner; det mänskliga genomet är ungefär tre miljarder baser. Å andra sidan går det fort att köra en reaktion, och numera är bara att stoppa sitt prov (till exempel en renad PCR-produkt) i ett kuvert och skicka till något av alla de företag som gör sekvensering för typ 100 kr per reaktion.

(Rubriken är ett exempel på den genetiska koden. Proteiner byggs upp av aminosyror, som motsvarar en trio av baspar i en gen. Trion kallas kodon — och vilka kodoner som motsvarar vilken aminosyra kallas den genetiska koden. ”GCT” står för aminosyran alanin, till exempel. I en annan, inte genetisk utan mänsklig, kod förkortas varje aminosyra med en bokstav ur alfabetet. Rubriksekvensen skapades med webbprogrammet Reverse Translate — nej, jag kan det inte i huvudet. Den kan översättas tillbaka till exempel med Transeq.)