Things that really don’t matter: megabase or megabasepair

Should we talk about physical distance in genetics as number of base pairs (kbp, Mbp, and so on) or bases (kb, Mb)?

I got into a discussion about this recently, and I said I’d continue the struggle on my blog. Here it is. Let me first say that I don’t think this matters at all, and if you make a big deal out of this (or whether ”data” can be singular, or any of those inconsequential matters of taste we argue about for amusement), you shouldn’t. See this blog post as an exorcism, helping me not to trouble my colleagues with my issues.

What I’m objecting to is mostly the inconsistency of talking about long stretches of nucleotides as ”kilobase” and ”megabase” but talking about short stretches as ”base pairs”. I don’t think it’s very common to call a 100 nucleotide stretch ”a 100 b sequence”; I would expect ”100 bp”. For example, if we look at Ensembl, they might describe a large region as 1 Mb, but if you zoom in a lot, they give length in bp. My impression is that this is a common practice. However, if you consistently use ”bases” and ”megabase”, more power to you.

Unless you’re writing a very specific kind of bioinformatics paper, the risk of confusion with the computer storage unit isn’t a problem. But there are some biological arguments.

A biological argument for ”base”, might be that we care about the identity of the base, not the base pairing. We note only one nucleotide down when we write a nucleic acid sequence. The base pair is a different thing: that base bound to the one on the other strand that it’s paired with, or, if the DNA or RNA is single-stranded, it’s not even paired at all.

Conversely, a biochemical argument for ”base pair” might be that in a double-stranded molecule, the base pair is the relevant informational unit. We may only write one base in our nucleotide sequence for convenience, but because of the rules of base pairing, we know the complementing pair. In this case, maybe we should use ”base” for single-stranded molecules.

If we consult two more or less trustworthy sources, The Encylopedia of Life Sciences and Wiktionary, they both seem to take this view.

eLS says:

A megabase pair, abbreviated Mbp, is a unit of length of nucleic acids, equal to one million base pairs. The term ‘megabase’ (or Mb) is commonly used inter-changeably, although strictly this would refer to a single-stranded nucleic acid.

Wiktionary says:

A length of nucleic acid containing one million nucleotides (bases if single-stranded, base pairs if double-stranded)

Please return next week for the correct pronunciation of ”loci”.

Literature

Dear, P.H. (2006). Megabase Pair (Mbp). eLS.

If research is learning, how should researchers learn?

I’m taking a course on university pedagogy to, hopefully, become a better teacher. While reading about students’ learning and what teachers ought to do to facilitate it, I couldn’t help thinking about researchers’ learning, and what we ought to do to give ourselves a good learning environment.

Research is, largely, learning. First, a large part of any research work is learning what is already known, not just by me in particular; it’s a direct continuation of learning that takes place in courses. While doing any research project, we learn the concepts other researchers use in this specific sub-subfield, and the relations between them. First to the extent that we can orient ourselves, and eventually to be able to make a contribution that is intelligible to others who work there. We also learn their priorities, attitudes and platitudes. (Seriously, I suspect you learn a lot about a sub-subfield by trying to make jokes about it.) We also learn to do something new: perform a laboratory procedure, a calculation, or something like that.

But more importantly, research is learning about things no-one knows yet. The idea of constructivist learning theory seems apt: We are constructing new knowledge, building on pre-existing structures. We don’t go out and read the book of nature; we take the concepts and relations of our sub-subfield of choice, and graft, modify and rearrange them into our new model of the subject.

If there is something to this, it means that old clichéd phrases like ”institution of higher learning”, scientists as ”students of X”, and so on, name a deeper analogy than it might seem. It also suggests that innovations in student learning might also be good building blocks for research group management. Should we be concept mapping with our colleagues to figure out where we disagree about the definition of ”developmental pleiotropy”? It also makes one wonder why meetings and departmental seminars often take the form of sage on the stage lectures.

Two distinguishing traits of science are that there are errors all the time and that almost no-one can reproduce anything

I got annoyed and tweeted:

”If you can’t reproduce a result, it isn’t science” … so we’re at that stage now, when we write things that sound righteous but are nonsense.

Hashtag subtweet, I guess. But it doesn’t matter who first wrote the sentence I was complaining about; they won’t care what I think, and I’m not out to debate them. I only think the quoted sentence makes sense if you take ”science” to mean ”the truth”. The relationship between science and reproducibility is messier than that.

The first clause could mean a few different things:

You have previously produced a result, but now, you can’t reproduce it when you try …

Then you might have done something wrong the first time, or the second time. This is an everyday occurrence of any type of research, that probably happens to every postdoc every week. Not even purely theoretical results are safe. If the simulation is stochastic, one might have been interpreting noise. If there is an analytical result, one might have made an odd number of sign errors. In fact, it is a distinguishing trait of science that when we try to learn new things, there are errors all the time.

If that previous result is something that has been published, circulated to peers, and interpreted as if it was a useful finding, then that is unfortunate. The hypothetical you should probably make some effort to figure out why, and communicate that to peers. But it seems like a bad idea to suggest that because there was an error, you’re not doing science.

You personally can’t reproduce a results because you don’t have the expertise or resources …

Science takes a lot of skill and a lot of specialised technical stuff. I probably can’t reproduce even a simple organic chemistry experiment. In fact, it is a distinguishing trait of science that almost no-one can reproduce any of it, because it takes both expertise and special equipment.

No-one can ever reproduce a certain result even in principle …

It might still be science. The 1918 influenza epidemic will by the nature of time never happen again. Still, there is science about it.

You can’t reproduce someone else’s results when you try with a reasonably similar setup …

Of course, this is what the original authors of the sentence meant. When this turns out to be a common occurrence, as people systematically try to reproduce findings, there is clearly something wrong with the research methods scientists use: The original report may be the outcome of a meandering process of researcher degrees of freedom that produced a striking result that is unlikely to happen when the procedure is repeated, even with high fidelity. However, I would say that we’re dealing with bad science, rather than non-science. Reproducibility is not a demarcation criterion.

(Note: Some people reserve ”reproducibility” for the computational reproducibility of re-running someone’s analysis code and getting the same results. This was not the case with the sentence quoted above.)

Interpreting genome scans, with wisdom

Eric Fauman is a scientist at Pfizer who also tweets out interpretations of genome-wide association scans.

Background: There is a GWASbot twitter account which posts Manhattan plots with links for various traits from the UK Biobank. The bot was made by the Genetic Epidemiology lab at the Finnish Institute for Molecular Medicine and Harvard. The source of the results is these genome scans (probably; it’s little bit opaque); the bot also links to heritability and genetic correlation databases. There is also an EnrichrBot that replies with enrichment of chromatin marks (Chen et al. 2013). Fauman’s comments on some of the genome scans on his Twitter account.

Here are a couple of recent ones:

And here is his list of these threads as a Google Document.

This makes me thing of three things, two good, and one bad.

1. The ephemeral nature of genome scans

Isn’t it great that we’re now at a stage where a genome scan can be something to be tweeted or put en masse in a database, instead of published one paper per scan with lots of boilerplate. The researchers behind the genome scans say as much in their 2017 blog post on the first release:

To further enhance the value of this resource, we have performed a basic association test on ~337,000 unrelated individuals of British ancestry for over 2,000 of the available phenotypes. We’re making these results available for browsing through several portals, including the Global Biobank Engine where they will appear soon. They are also available for download here.

We have decided not to write a scientific article for publication based on these analyses. Rather, we have described the data processing in a detailed blog post linked to the underlying code repositories. The decision to eschew scientific publication for the basic association analysis is rooted in our view that we will continue to work on and analyze these data and, as a result, writing a paper would not reflect the current state of the scientific work we are performing. Our goal here is to make these results available as quickly as possible, for any geneticist, biologist or curious citizen to explore. This is not to suggest that we will not write any papers on these data, but rather only write papers for those activities that involve novel method development or more complex analytic approaches. A univariate genome-wide association analysis is now a relatively well-established activity, and while the scale of this is a bit grander than before, that in and of itself is a relatively perfunctory activity. [emphasis mine] Simply put, let the data be free.

That being said, when starting to write this post, first I missed a paper. It was pretty frustrating to find a detailed description of the methods: after circling back and forth between the different pages that link to each other, I landed on the original methods post, which is informative, and written in a light conversational style. On the internet, one would fear that this links may rot and die eventually, and a paper would probably (but not necessarily …) be longer-lasting.

2. Everything is a genome scan, if you’re brave enough

Another thing that the GWAS bot drives home is that you can map anything that you can measure. The results are not always straightforward. On the other hand, even if the trait in question seems a bit silly, the results are not necessarily nonsense either.

There is a risk, for geneticists and non-geneticists alike, to reify traits based on their genetic parameters. If we can measure the heritability coefficient of something, and localise it in the genome with a genome-wide association study, it better be a real and important thing, right? No. The truth is that geneticists choose traits to measure the same way all researchers choose things to measure. Sometimes for great reasons with serious validation and considerations about usefulness. Sometimes just because. The GWAS bot also helpfully links to the UK Biobank website that describes the traits.

Look at that bread intake genome scan above. Here, ”bread intake” is the self-reported number of slices of bread eaten per week, as entered by participants on a touch screen questionnaire at a UK Biobank assessment centre. I think we can be sure that this number doesn’t reveal any particularly deep truth about bread and its significance to humanity. It’s a limited, noisy, context-bound number measured, I bet, because once you ask a battery of lifestyle questions, you’ll ask about bread too. Still, the strongest association is at a region that contains olfactory receptor genes and also shows up two other scans about food (fruit and ice cream). The bread intake scan hits upon a nugget of genetic knowledge about human food preference. A small, local truth, but still.

Now substitute bread intake for some more socially relevant trait, also imperfectly measured.

3. Lost, like tweets in rain

Genome scan interpretation is just that: interpretation. It means pulling together quantitative data, a knowledge of biology, previous literature, and writing an unstructured text, such as a Discussion section or a Twitter thread. This makes them harder to organise, store and build on than the genome scans themselves. Sure, Fauman’s Twitter threads are linked from the above Google Document, and our Discussion sections are available from the library. But they’re spread out in different places, they mix (as they should) evidence with evaluation and speculation, and it’s not like we have a structured vocabulary for describing genetic mechanisms of quantitative trait loci, and the levels of evidence for them. Maybe we could, with genome-wide association study ontologies and wikis.

You’re not funny, but even if you were

Here is a kind of humour that is all too common in scientific communication; I’ll just show you the caricature, and I think you’ll recognize the shape of it:

Some slogan about how a married man is a slave or a prisoner kneeling and holding a credit card. Some joke where the denouement relies on: the perception that blondes are dumb, male preference for breast size, perceived associations between promiscuity and nationality, or anything involving genital size. Pretty much any one-panel cartoon taken from the Internet.

Should you find any of this in your own talk, here is a message to you: That may be funny to you; that isn’t the problem. To a fair number of the people who are listening, it’s likely to be trite, sad and annoying.

Humour totally has a place in academic speech and writing—probably more than one place. There is the laughter that is there to relieve tension. That is okay sometimes. There are jokes that are obviously put-downs. Those are probably only a good idea in private company, or in public forums where the object of derision is powerful enough that you’re not punching down, but powerless enough to not punch you back. Say, the ever-revered and long dead founder of your field—they may deserve a potshot at their bad manners and despicable views on eugenics.

Then there is that elusive ‘sudden perception of the incongruity between a concept and the real objects which have been thought through it in some relation’ (Schopenhauer, quoted in Stanford Encyclopedia of Philosophy). When humour is used right, a serious lecturer talking about serious issues has all kinds of opportunities to amuse the listener with incongruities between the expectations and what they really are like. So please don’t reveal yourself to be predictably trite.

Sequencing-based methods called Dart

Some years ago James Hadfield at Enseqlopedia made a spreadsheet of acronyms for sequencing-based methods with some 50 rows. I can only imagine how long it would be today.

The overloading of acronyms is becoming a bit ridiculous. I recently saw a paper about DART-seq, a method for detecting N6-methyladenosine in RNA (Meyer 2019), and thought, ”wait a minute, isn’t DART-seq a reduced representation genotyping method?” It is, only stylised as DArTseq (seriously). Apparently, it’s also a droplet RNA-sequencing method (Saikia et al. 2018).

What are these methods doing?

  • DArT, diversity array technology, is a way to enrich for a part of a genome. It was originally developed with array technology in mind (Jaccoud et al. 2001). They take some DNA, cut it with restriction enzymes, add adapters and amplify regions close to the cut. Then they clone the resulting DNA, and then attach it to a slide, and that gives a custom microarray of anonymous fragments from the genome. For the Dart-seq version, it seems they make a sequencing library instead of going on to cloning (Ren et al. 2015). It falls in the same family as GBS and RAD-seq methods.
  • DART-seq, droplet-assisted RNA targeting, builds on Drop-seq, where they put single cells and beads that carry primers into the same oil droplet. As cells lyse, the RNA sticks to the primer. The beads also have a barcode so they can be identified in sequencing. Then they break the emulsion, reverse transcribe the RNA attached to beads, amplify and sequence. That is cool. However, because they capture the RNA with oligo-dT primers, they sequence from the 3′ end of the RNA. The Dart method adds primers to the beads, so they can target some specific RNAs and amplify more of them. It’s the super-high-tech version of gene-specific primers for reverse transcription..
  • DART-seq, deamination adjacent to RNA modification targets, uses a synthetic fusion protein that combines APOBEC1, which deaminates cytidines, with a protein domain from YTHDF2 which binds N6-methyladenosine. If an RNA has N6-methyladenosine, cytidines that are close to it, as is usually the case with this base modification, will be deaminated to uracil. After RNA-sequencing, this will look like Cs next to As turning into Ts. Neat! It’s a little bit like bisulfite sequencing of methylated DNA, but with RNA.

On the one hand: Don’t people search the internet before they name their methods, or do they not care? On the other hand, realistically, the genotyping method Dart and the single cell RNA-seq method Dart are unlikely to show up in the same work. If you can call your groups ”treatment” and ”control” for the purpose of a paper, maybe you can call your method ”Dart”, and no-one gets too confused.

‘Approaches to genetics for livestock research’ at IASH, University of Edinburgh

A couple of weeks ago, I was at a symposium on the history of genetics in animal breeding at the Institute of Advanced Studies in the Humanities, organized by Cheryl Lancaster. There were talks by two geneticists and two historians, and ample time for discussion.

First geneticists:

Gregor Gorjanc presented the very essence of quantitative genetics: the pedigree-based model. He illustrated this with graphs (in the sense of edges and vertices) and by predicting his own breeding value for height from trait values, and from his personal genomics results.

Then, yours truly gave this talk: ‘Genomics in animal breeding from the perspectives of matrices and molecules’. Here are the slides (only slightly mangled by Slideshare). This is the talk I was preparing for when I collected the quotes I posted a couple of weeks ago.

I talked about how there are two perspectives on genomics: you can think of genomes either as large matrices of ancestry indicators (statistical perspective) or as long strings of bases (sequence perspective). Both are useful, and give animal breeders and breeding researchers different tools (genomic selection, reference genomes). I also talked about potential future breeding strategies that use causative variants, and how they’re not about stopping breeding and designing the perfect animal in a lab, but about supplementing genomic selection in different ways.

Then, historians:

Cheryl Lancaster told the story of how ABGRO, the Animal Breeding and Genetics Research Organisation in Edinburgh, lost its G. The organisation was split up in the 1950s, separating fundamental genetics research and animal breeding. She said that she had expected this split to be do to scientific, methodological or conceptual differences, but instead found when going through the archives, that it all was due to personal conflicts. She also got into how the ABGRO researchers justified their work, framing it as ”fundamental research”, and aspired to do long term research projects.

Jim Lowe talked about the pig genome sequencing and mapping efforts, how it was different from the human genome project in organisation, and how it used comparisons to the human genome a lot. Here he’s showing a photo of Alan Archibald using the gEVAL genome browser to quality-check the pig genome. He also argued that the infrastructural outcomes of a project like the human genome project, such as making it possible for pig genome scientists to use the human genome for comparisons, are more important and less predictable than usually assumed.

The discussion included comments by some of the people who were there (Chris Haley, Bill Hill), discussion about the breed concept, and what scientists can learn from history.

What is a breed? Is it a genetical thing, defined by grouping individuals based on their relatedness, a historical thing, based on what people think a certain kind of animal is supposed to look like, or a marketing tool, naming animals that come from a certain system? It is probably a bit of everything. (I talked with Jim Lowe during lunch; he had noticed how I referred to Griffith & Stotz for gene concepts, but omitted the ”post-genomic” gene concept they actually favour. This is because I didn’t find it useful for understanding how animal breeding researchers think. It is striking how comfortable biologists are with using fuzzy concepts that can’t be defined in a way that cover all corner cases, because biology doesn’t work that way. If the nominal gene concept is broken by trans-splicing, practicing genomicists will probably think of that more as a practical issue with designing gene databases than a something that invalidates talking about genes in principle.)

What would researchers like to learn from history? Probably how to succeed with large research endeavors and how to get funding for them. Can one learn that from history? Maybe not, but there might be lessons about thinking of research as ”basic”, ”fundamental”, ”applied” etc, and about what the long term effects of research might be.