Various positions II

Again, what good is a blog if you can’t post your arbitrary idiosyncratic opinions as if you were an authority?

Don’t make a conference app

I get it, you can’t print a full-blown paper program book: it is too much, no one reads it, and it feels wasteful. But please, please, for the love of everything holy, don’t make an app. Put the text, straight up, on a website in plaintext. It loads quickly, it’s searchable, it can be automatically generated. The conference app will be cloddy, take up space on the phone, eat bandwidth on some strained mobile contract, and invariably freeze.

Posters, still bad in 2020

Don’t believe the lies: a once folded canvas poster will never look good again. You haven’t had fun on a conference before you’ve tried ironing a poster on a hostel floor with an iron that belongs in a museum.

Poster sessions are bad by necessity. If they had had space and time to be anything other than a crowded mess, the conference would have to accept substantially fewer posters. That means fewer participants, probably especially earlier career participants, and the value of having them outweighs the value of a somewhat better poster session.

Gene accession numbers

PLOS Genetics has a great policy in their submission guidelines that doesn’t seem to get followed very much in papers they actually publish. This should be the norm in every genetics paper. I feel bad that it’s not the case in all my papers.

As much as possible, please provide accession numbers or identifiers for all entities such as genes, proteins, mutants, diseases, etc., for which there is an entry in a public database, for example:

Ensembl
Entrez Gene
FlyBase
InterPro
Mouse Genome Database (MGD)
Online Mendelian Inheritance in Man (OMIM)
PubChem

Identifiers should be provided in parentheses after the entity on first use.

In the future, with the right ontologies and repositories in place, I hope this will be the case with traits, methods and so on as well.

UK Biobank and dbGAP are not open data

And that is fine.

Stop it with the work-life balance tweets

No-one should tweet about work-life balance; whether you write about how much you work or how diligent you are about your hours, it comes off as bragging.

Tenses

Write your papers in the past or present tense, whichever you prefer. In the context of a scientific paper, the difference between past and present communicates nothing. I suppose you’re not supposed to mix tenses, but that doesn’t matter either. Most readers probably won’t notice. If you ask me about my stylistic opinion: present tense for everything. But again, it doesn’t matter.

A partial success

In 2010, Poliseno & co published some results on the regulation of a gene by a transcript from a pseudogene. Now, Kerwin & co have published a replication study, the protocol for which came out in 2015 (Khan et al). An editor summarises it like this in an accompanying commentary (Calin 2020):

The partial success of a study to reproduce experiments that linked pseudogenes and cancer proves that understanding RNA networks is more complicated than expected.

I guess he means ”partial success” in the sense that they partially succeeded in performing the replication experiments they wanted. These experiments did not reproduce the gene regulation results from 2010.

Seen from the outside — I have no insight in what is going on here or who the people involved are — something is not working here. If it takes five years from paper to replication effort, and then another five years to replication study accompanied by an editorial commentary that subtly undermines it, we can’t expect replication studies to update the literature, can we?

Communication

What’s the moral of the story, according to Calin?

What are the take-home messages from this Replication Study? One is the importance of fruitful communication between the laboratory that did the initial experiments and the lab trying to repeat them. The lack of such communication – which should extend to the exchange of protocols and reagents – was the reason why the experiments involving microRNAs could not be reproduced. The original paper did not give catalogue numbers for these reagents, so the wrong microRNA reagents were used in the Replication Study. The introduction of reporting standards at many journals means that this is less likely to be an issue for more recent papers.

There is something right and something wrong about this. On the one hand, talking to your colleagues in the field obviously makes life easier. We would like researchers to put all pertinent information in writing, and we would like there to be good communication channels in cases where the information turns out not to be what the reader needed. On the other hand, we don’t want science to be esoteric. We would like experiments to be reproducible without the special artifact or secret sauce. If nothing else, because the people’s time and willingness to provide tech support for their old papers might be limited. Of course, this is hard, in a world where the reproducibility of an experiment might depend on the length of digestion (Hines et al 2014) or that little plastic thingamajig you need for the washing step.

Another take-home message is that it is finally time for the research community to make raw data obtained with quantitative real-time PCR openly available for papers that rely on such data. This would be of great benefit to any group exploring the expression of the same gene/pseudogene/non-coding RNA in the same cell line or tissue type.

This is true. You know how doctored, or just poor, Western blots are a notorious issue in the literature? I don’t think that’s because Western blot as a technique is exceptionally bad, but because there is a culture of showing the raw data (the gel), so people can notice problems. However, even if I’m all for showing real-time PCR amplification curves (as well as melting curves, standard curves, and the actual batch and plate information from the runs), I doubt that it’s going to be possible to trouble-shoot PCR retrospectively from those curves. Maybe sometimes one would be able to spot a PCR that looks iffy, but beyond that, I’m not sure what we would learn. PCR issues are likely to have to do with subtle things like primer design, reaction conditions and handling that can only really be tackled in the lab.

The world is messy, alright

Both the commentary and the replication study (Kerwin et al 2020) are cautious when presenting their results. I think it reads as if the authors themselves either don’t truly believe their failure to replicate or are bending over backwards to acknowledge everything that could have gone wrong.

The original study reported that overexpression of PTEN 3’UTR increased PTENP1 levels in DU145 cells (Figure 4A), whereas the Replication Study reports that it does not. …

However, the original study and the Replication Study both found that overexpression of PTEN 3’UTR led to a statistically significant decrease in the proliferation of DU145 cells compared to controls.

In the original study Poliseno et al. reported that two microRNAs – miR-19b and miR-20a – suppress the transcription of both PTEN and PTENP1 in DU145 prostate cancer cells (Figure 1D), and that the depletion of PTEN or PTENP1 led to a statistically significant reduction in the corresponding pseudogene or gene (Figure 2G). Neither of these effects were seen in the Replication Study. There are many possible explanations for this. For example, although both studies used DU145 prostate cancer cells, they did not come from the same batch, so there could be significant genetic differences between them: see Andor et al. (2020) for more on cell lines acquiring mutations during cell cultures. Furthermore, one of the techniques used in both studies – quantitative real-time PCR – depends strongly on the reagents and operating procedures used in the experiments. Indeed, there are no widely accepted standard operating procedures for this technique, despite over a decade of efforts to establish such procedures (Willems et al., 2008; Schwarzenbach et al., 2015).

That is both commentary and replication study seem to subscribe to a view of the world where biology is so rich and complex that both might be right, conditional on unobserved moderating variables. This is true, but it throws us into a discussion of generalisability. If a result only holds in some genotypes of DU145 prostate cancer cells, which might very well be the case, does it generalise enough to be useful for cancer research?

Power underwhelming

There is another possible view of the world, though … Indeed, biology rich and complicated, but in the absence of accurate estimates, we don’t know which of all these potential moderating variables actually do anything. First order, before we start imagining scenarios that might explain the discrepancy, is to get a really good estimate of it. How do we do that? It’s hard, but how about starting with a cell size greater than N = 5?

The registered report contains power calculations, which is commendable. As far as I can see, it does not describe how they arrived at the assumed effect sizes. Power estimates for a study design depend on the assumed effect sizes. Small studies tend to exaggerate effect sizes (because, if an estimate is small the difference can’t be significant). This means that taking the estimates as staring effect sizes might leave you with a design that is still unable to detect a true effect of reasonable size.

I don’t know what effect sizes one should expect in these kinds of experiments, but my intuition would be that even if you think that you can get good power with a handful of samples per cell, can’t you please run a couple more? We are all limited by resources and time, but if you’re running something like a qPCR, the cost per sample must be much smaller than the cost for doing one run of the experiment in the first place. It’s really not as simple as adding one row on a plate, but almost.

Literature

Calin, George A. ”Reproducibility in Cancer Biology: Pseudogenes, RNAs and new reproducibility norms.” eLife 9 (2020): e56397.

Hines, William C., et al. ”Sorting out the FACS: a devil in the details.” Cell reports 6.5 (2014): 779-781.

Kerwin, John, and Israr Khan. ”Replication Study: A coding-independent function of gene and pseudogene mRNAs regulates tumour biology.” eLife 9 (2020): e51019.

Khan, Israr, et al. ”Registered report: a coding-independent function of gene and pseudogene mRNAs regulates tumour biology.” Elife 4 (2015): e08245.

Poliseno, Laura, et al. ”A coding-independent function of gene and pseudogene mRNAs regulates tumour biology.” Nature 465.7301 (2010): 1033-1038.

Better posters are nice, but we need better poster session experiences

Fear and loathing in the conference centre lobby

Let me start from a negative place, because my attitude to poster sessions is negative. Poster sessions are neither good ways to communicate science, nor to network at conferences. Moreover, they are unpleasant.

The experience of going to a poster session, as an attendant or a presenter goes something like this: You have to stand in a crowded room that is too loud and try to either read technical language or hold a conversation in about a difficult topic. Even without anxiety, mobility, or hearing difficulties, a poster session is unlikely to be enjoyable or efficient.

Poster sessions are bad because of necessities of conference organisation. We want to invite many people, but we can’t fit in many talks; we get crowded poster sessions.

They are made worse by efforts to make them better, such as mandating presenters to stand by their posters, in some cases on pain of some sanction by the organisers, or to have the poster presenters act as dispensers of alcohol. If you need to threaten or drug people to participate in an activity, that might be a sign.

They are made not worse but a bit silly, by assertions that poster sessions are of utmost importance for conferencing. Merely stating that the poster session is vibrant and inspiring, or that you want to emphasise the poster as an important form of communication, sadly, does not make it so, if the poster sessions are still business as usual.

Mike Morrison’s ”Better Scientific Poster” design

As you can see above, my diagnosis of the poster session problem is part that you’re forced to read walls of text or listen to mini-lectures, and part that it happens in an overcrowded space. The walls of text and mini-lecture might be improved by poster design.

Enter the Better Scientific Poster. I suggest clicking on that link and looking at the poster templates directly. I waited too long to look at the actual template files, because I expected a bunch of confusing designer stuff. It’s not. They contain their own documentation and examples.

There is also a video on YouTube expanding on the thinking behind the design, but I think this conversation on the Everyting Hertz podcast is the best introduction, if you need an introduction beyond the template. The YouTube video doesn’t go into enough detail, and is also a bit outdated. The poster template has gone through improvements since.

If you want to hear the criticisms of the design, here’s a blog post summarising some of it. In short, it is unscientific and intellectually arrogant to put a take home message in too large a font, and it would be boring if all posters used the same template. Okay.

The caveats

I am not a designer, which should be abundantly clear to everyone. I don’t really know what good graphic design principles for a poster are.

There is also no way to satisfy everyone. Some people will think you’ve put too little on the poster unless it ”tells the full story” and a has self-contained description of the methods with all caveats. Some people, like me, will think you’ve put way too much on it long before that.

What I like, however, is that Morrison’s design is based on an analysis of the poster session experience that aligns with mine, and that it is based on a goal for the poster that makes sense. The features of the design flow from that goal. If you listen to the video or the Hertz episode: Morrison has thought about the purpose of the poster.

He’s not just expressing some wisdom his PhD supervisor told him in stern voice, or what his gut feeling tells him, which I suspect is the two sources that scientists’ advice on communication is usually based on. We all think that poster sessions are bad, because we’ve been to poster sessions. We usually don’t have thought-through ideas about what to do better.

Back to a place of negativity

For those reasons, I think the better poster is likely to be an improvement. I was surprised that I didn’t see it sweep through poster sessions at the conferences I went to last summer, but there were a few. I was going to try it for TAGGC 2020 (here is my poster aboutthe genetics of recombination rate in the pig), but that moved online, which made poster presentations a little different.

However, changing up poster layout can only get you so far. Unless someone has a stroke of genius to improve the poster viewing experience or change the economics of poster attendance, there no bright future for the poster session. Individually, the rational course of action isn’t to fiddle with the design and spend time to squeeze marginal improvements out of our posters. It is to spend as little time as possible on posters, ignoring our colleagues’ helpful advice on how to make them prettier and more scientific, and lowering our expectations.

Virtual animal breeding journal club: ”Structural equation models to disentangle the biological relationship between microbiota and complex traits …”

The other day was the first Virtual breeding and genetics journal club organised by John Cole. This was the first online journal club I’ve attended (shocking, given how many video calls I’ve been on for other sciencey reasons), so I thought I’d write a little about it: both the format and the paper. You can look the slide deck from the journal club here (pptx file).

The medium

We used Zoom, and that seemed to work, as I’m sure anything else would, if everyone just mute their microphone when they aren’t speaking. As John said, the key feature of Zoom seems to be the ability for the host to mute everyone else. During the call, I think we were at most 29 or so people, but only a handful spoke. It will probably get more intense with the turn taking if more people want to speak.

The format

John started the journal club with a code of conduct, which I expect helped to set what I felt was a good atmosphere. In most journal clubs I’ve been in, I feel like the atmosphere has been pretty good, but I think we’ve all heard stories about hyper-critical and hostile journal clubs, and that doesn’t sound particularly fun or useful. On that note, one of the authors, Oscar González-Recio, was on the call and answered some questions.

The paper

Saborío‐Montero, Alejandro, et al. ”Structural equation models to disentangle the biological relationship between microbiota and complex traits: Methane production in dairy cattle as a case of study.” Journal of Animal Breeding and Genetics 137.1 (2020): 36-48.

The authors measured methane emissions (by analysing breath with with an infrared gas monitor) and abundance of different microbes in the rumen (with Nanopore sequencing) from dairy cows. They genotyped the animals for relatedness.

They analysed the genetic relationship between breath methane and abundance of each taxon of microbe, individually, with either:

  • a bivariate animal model;
  • a structural equations model that allows for a causal effect of abundance on methane, capturing the assumption that the abundance of a taxon can affect the methane emission, but not the other way around.

They used them to estimate heritabilities of abundances and genetic correlations between methane and abundances, and in the case of the structural model: conditional on the assumed causal model, the effect of that taxon’s abundance on methane.

My thoughts

It’s cool how there’s a literature building up on genetic influences on the microbiome, with some consistency across studies. These intense high-tech studies on relatively few cattle might build up to finding new traits and proxies that can go into larger scale phenotyping for breeding.

As the title suggests, the paper advocates for using the structural equations model: ”Genetic correlation estimates revealed differences according to the usage of non‐recursive and recursive models, with a more biologically supported result for the recursive model estimation.” (Conclusions)

While I agree that a priori, it makes sense to assume a structural equations model with a causal structure, I don’t think the results provide much evidence that it’s better. The estimates of heritabilities and genetic correlations from the two models are near indistinguishable. Here is the key figure 4, comparing genetic correlation estimates:

saborio-montero-fig4

As you can see, there are a couple of examples of genetic correlations where the point estimate switches sign, and one of them (Succinivibrio sp.) where the credible intervals don’t overlap. ”Recursive” is the structural equations model. The error bars are 95% credible intervals. This is not strong evidence of anything; the authors are responsible about it and don’t go into interpreting this difference. But let us speculate! They write:

All genera in this case, excepting Succinivibrio sp. from the Proteobacteria phylum, resulted in overlapped genetic cor- relations between the non‐recursive bivariate model and the recursive model. However, high differences were observed. Succinivibrio sp. showed the largest disagreement changing from positively correlated (0.08) in the non‐recursive bivariate model to negatively correlated (−0.20) in the recursive model.

Succinivibrio are also the taxon with the estimated largest inhibitory effect on methane (from the structural equations model).

While some taxa, such as ciliate protozoa or Methanobrevibacter sp., increased the CH4 emissions …, others such as Succinivibrio sp. from Proteobacteria phylum decreased it

Looking at the paper that first described these bacteria (Bryan & Small 1955),  Succinivibrio were originally isolated from the cattle rumen, and their name is because ”they ferment glucose with the production of a large amount of succinic acid”. Bryant & Small made a fermentation experiment to see what came out, and it seems that the bacteria don’t produce methane:

succ_table2

This is also in line with a rRNA sequencing study of high and low methane emitting cows (Wallace & al 2015) that found lower Succinivibrio abundance in high methane emitters.

We may speculate that Succinivibrio species could be involved in diverting energy from methanogens, and thus reducing methane emissions. If that is true, then the structural equations model estimate (larger genetic negative correlation between Succinivibrio abundance and methane) might be better than one from the animal model.

Finally, while I’m on board with the a priori argument for using a structural equations model, as with other applications of causal modelling (gene networks, Mendelian randomisation etc), it might be dangerous to consider only parts of the system independently, where the microbes are likely to have causal effects on each other.

Literature

Saborío‐Montero, Alejandro, et al. ”Structural equation models to disentangle the biological relationship between microbiota and complex traits: Methane production in dairy cattle as a case of study.” Journal of Animal Breeding and Genetics 137.1 (2020): 36-48.

Wallace, R. John, et al. ”The rumen microbial metagenome associated with high methane production in cattle.” BMC genomics 16.1 (2015): 839.

Bryant, Marvin P., and Nola Small. ”Characteristics of two new genera of anaerobic curved rods isolated from the rumen of cattle.” Journal of bacteriology 72.1 (1956): 22.

Things that really don’t matter: megabase or megabasepair

Should we talk about physical distance in genetics as number of base pairs (kbp, Mbp, and so on) or bases (kb, Mb)?

I got into a discussion about this recently, and I said I’d continue the struggle on my blog. Here it is. Let me first say that I don’t think this matters at all, and if you make a big deal out of this (or whether ”data” can be singular, or any of those inconsequential matters of taste we argue about for amusement), you shouldn’t. See this blog post as an exorcism, helping me not to trouble my colleagues with my issues.

What I’m objecting to is mostly the inconsistency of talking about long stretches of nucleotides as ”kilobase” and ”megabase” but talking about short stretches as ”base pairs”. I don’t think it’s very common to call a 100 nucleotide stretch ”a 100 b sequence”; I would expect ”100 bp”. For example, if we look at Ensembl, they might describe a large region as 1 Mb, but if you zoom in a lot, they give length in bp. My impression is that this is a common practice. However, if you consistently use ”bases” and ”megabase”, more power to you.

Unless you’re writing a very specific kind of bioinformatics paper, the risk of confusion with the computer storage unit isn’t a problem. But there are some biological arguments.

A biological argument for ”base”, might be that we care about the identity of the base, not the base pairing. We note only one nucleotide down when we write a nucleic acid sequence. The base pair is a different thing: that base bound to the one on the other strand that it’s paired with, or, if the DNA or RNA is single-stranded, it’s not even paired at all.

Conversely, a biochemical argument for ”base pair” might be that in a double-stranded molecule, the base pair is the relevant informational unit. We may only write one base in our nucleotide sequence for convenience, but because of the rules of base pairing, we know the complementing pair. In this case, maybe we should use ”base” for single-stranded molecules.

If we consult two more or less trustworthy sources, The Encylopedia of Life Sciences and Wiktionary, they both seem to take this view.

eLS says:

A megabase pair, abbreviated Mbp, is a unit of length of nucleic acids, equal to one million base pairs. The term ‘megabase’ (or Mb) is commonly used inter-changeably, although strictly this would refer to a single-stranded nucleic acid.

Wiktionary says:

A length of nucleic acid containing one million nucleotides (bases if single-stranded, base pairs if double-stranded)

Please return next week for the correct pronunciation of ”loci”.

Literature

Dear, P.H. (2006). Megabase Pair (Mbp). eLS.

If research is learning, how should researchers learn?

I’m taking a course on university pedagogy to, hopefully, become a better teacher. While reading about students’ learning and what teachers ought to do to facilitate it, I couldn’t help thinking about researchers’ learning, and what we ought to do to give ourselves a good learning environment.

Research is, largely, learning. First, a large part of any research work is learning what is already known, not just by me in particular; it’s a direct continuation of learning that takes place in courses. While doing any research project, we learn the concepts other researchers use in this specific sub-subfield, and the relations between them. First to the extent that we can orient ourselves, and eventually to be able to make a contribution that is intelligible to others who work there. We also learn their priorities, attitudes and platitudes. (Seriously, I suspect you learn a lot about a sub-subfield by trying to make jokes about it.) We also learn to do something new: perform a laboratory procedure, a calculation, or something like that.

But more importantly, research is learning about things no-one knows yet. The idea of constructivist learning theory seems apt: We are constructing new knowledge, building on pre-existing structures. We don’t go out and read the book of nature; we take the concepts and relations of our sub-subfield of choice, and graft, modify and rearrange them into our new model of the subject.

If there is something to this, it means that old clichéd phrases like ”institution of higher learning”, scientists as ”students of X”, and so on, name a deeper analogy than it might seem. It also suggests that innovations in student learning might also be good building blocks for research group management. Should we be concept mapping with our colleagues to figure out where we disagree about the definition of ”developmental pleiotropy”? It also makes one wonder why meetings and departmental seminars often take the form of sage on the stage lectures.

Two distinguishing traits of science are that there are errors all the time and that almost no-one can reproduce anything

I got annoyed and tweeted:

”If you can’t reproduce a result, it isn’t science” … so we’re at that stage now, when we write things that sound righteous but are nonsense.

Hashtag subtweet, I guess. But it doesn’t matter who first wrote the sentence I was complaining about; they won’t care what I think, and I’m not out to debate them. I only think the quoted sentence makes sense if you take ”science” to mean ”the truth”. The relationship between science and reproducibility is messier than that.

The first clause could mean a few different things:

You have previously produced a result, but now, you can’t reproduce it when you try …

Then you might have done something wrong the first time, or the second time. This is an everyday occurrence of any type of research, that probably happens to every postdoc every week. Not even purely theoretical results are safe. If the simulation is stochastic, one might have been interpreting noise. If there is an analytical result, one might have made an odd number of sign errors. In fact, it is a distinguishing trait of science that when we try to learn new things, there are errors all the time.

If that previous result is something that has been published, circulated to peers, and interpreted as if it was a useful finding, then that is unfortunate. The hypothetical you should probably make some effort to figure out why, and communicate that to peers. But it seems like a bad idea to suggest that because there was an error, you’re not doing science.

You personally can’t reproduce a results because you don’t have the expertise or resources …

Science takes a lot of skill and a lot of specialised technical stuff. I probably can’t reproduce even a simple organic chemistry experiment. In fact, it is a distinguishing trait of science that almost no-one can reproduce any of it, because it takes both expertise and special equipment.

No-one can ever reproduce a certain result even in principle …

It might still be science. The 1918 influenza epidemic will by the nature of time never happen again. Still, there is science about it.

You can’t reproduce someone else’s results when you try with a reasonably similar setup …

Of course, this is what the original authors of the sentence meant. When this turns out to be a common occurrence, as people systematically try to reproduce findings, there is clearly something wrong with the research methods scientists use: The original report may be the outcome of a meandering process of researcher degrees of freedom that produced a striking result that is unlikely to happen when the procedure is repeated, even with high fidelity. However, I would say that we’re dealing with bad science, rather than non-science. Reproducibility is not a demarcation criterion.

(Note: Some people reserve ”reproducibility” for the computational reproducibility of re-running someone’s analysis code and getting the same results. This was not the case with the sentence quoted above.)

Interpreting genome scans, with wisdom

Eric Fauman is a scientist at Pfizer who also tweets out interpretations of genome-wide association scans.

Background: There is a GWASbot twitter account which posts Manhattan plots with links for various traits from the UK Biobank. The bot was made by the Genetic Epidemiology lab at the Finnish Institute for Molecular Medicine and Harvard. The source of the results is these genome scans (probably; it’s little bit opaque); the bot also links to heritability and genetic correlation databases. There is also an EnrichrBot that replies with enrichment of chromatin marks (Chen et al. 2013). Fauman’s comments on some of the genome scans on his Twitter account.

Here are a couple of recent ones:

And here is his list of these threads as a Google Document.

This makes me thing of three things, two good, and one bad.

1. The ephemeral nature of genome scans

Isn’t it great that we’re now at a stage where a genome scan can be something to be tweeted or put en masse in a database, instead of published one paper per scan with lots of boilerplate. The researchers behind the genome scans say as much in their 2017 blog post on the first release:

To further enhance the value of this resource, we have performed a basic association test on ~337,000 unrelated individuals of British ancestry for over 2,000 of the available phenotypes. We’re making these results available for browsing through several portals, including the Global Biobank Engine where they will appear soon. They are also available for download here.

We have decided not to write a scientific article for publication based on these analyses. Rather, we have described the data processing in a detailed blog post linked to the underlying code repositories. The decision to eschew scientific publication for the basic association analysis is rooted in our view that we will continue to work on and analyze these data and, as a result, writing a paper would not reflect the current state of the scientific work we are performing. Our goal here is to make these results available as quickly as possible, for any geneticist, biologist or curious citizen to explore. This is not to suggest that we will not write any papers on these data, but rather only write papers for those activities that involve novel method development or more complex analytic approaches. A univariate genome-wide association analysis is now a relatively well-established activity, and while the scale of this is a bit grander than before, that in and of itself is a relatively perfunctory activity. [emphasis mine] Simply put, let the data be free.

That being said, when starting to write this post, first I missed a paper. It was pretty frustrating to find a detailed description of the methods: after circling back and forth between the different pages that link to each other, I landed on the original methods post, which is informative, and written in a light conversational style. On the internet, one would fear that this links may rot and die eventually, and a paper would probably (but not necessarily …) be longer-lasting.

2. Everything is a genome scan, if you’re brave enough

Another thing that the GWAS bot drives home is that you can map anything that you can measure. The results are not always straightforward. On the other hand, even if the trait in question seems a bit silly, the results are not necessarily nonsense either.

There is a risk, for geneticists and non-geneticists alike, to reify traits based on their genetic parameters. If we can measure the heritability coefficient of something, and localise it in the genome with a genome-wide association study, it better be a real and important thing, right? No. The truth is that geneticists choose traits to measure the same way all researchers choose things to measure. Sometimes for great reasons with serious validation and considerations about usefulness. Sometimes just because. The GWAS bot also helpfully links to the UK Biobank website that describes the traits.

Look at that bread intake genome scan above. Here, ”bread intake” is the self-reported number of slices of bread eaten per week, as entered by participants on a touch screen questionnaire at a UK Biobank assessment centre. I think we can be sure that this number doesn’t reveal any particularly deep truth about bread and its significance to humanity. It’s a limited, noisy, context-bound number measured, I bet, because once you ask a battery of lifestyle questions, you’ll ask about bread too. Still, the strongest association is at a region that contains olfactory receptor genes and also shows up two other scans about food (fruit and ice cream). The bread intake scan hits upon a nugget of genetic knowledge about human food preference. A small, local truth, but still.

Now substitute bread intake for some more socially relevant trait, also imperfectly measured.

3. Lost, like tweets in rain

Genome scan interpretation is just that: interpretation. It means pulling together quantitative data, a knowledge of biology, previous literature, and writing an unstructured text, such as a Discussion section or a Twitter thread. This makes them harder to organise, store and build on than the genome scans themselves. Sure, Fauman’s Twitter threads are linked from the above Google Document, and our Discussion sections are available from the library. But they’re spread out in different places, they mix (as they should) evidence with evaluation and speculation, and it’s not like we have a structured vocabulary for describing genetic mechanisms of quantitative trait loci, and the levels of evidence for them. Maybe we could, with genome-wide association study ontologies and wikis.

You’re not funny, but even if you were

Here is a kind of humour that is all too common in scientific communication; I’ll just show you the caricature, and I think you’ll recognize the shape of it:

Some slogan about how a married man is a slave or a prisoner kneeling and holding a credit card. Some joke where the denouement relies on: the perception that blondes are dumb, male preference for breast size, perceived associations between promiscuity and nationality, or anything involving genital size. Pretty much any one-panel cartoon taken from the Internet.

Should you find any of this in your own talk, here is a message to you: That may be funny to you; that isn’t the problem. To a fair number of the people who are listening, it’s likely to be trite, sad and annoying.

Humour totally has a place in academic speech and writing—probably more than one place. There is the laughter that is there to relieve tension. That is okay sometimes. There are jokes that are obviously put-downs. Those are probably only a good idea in private company, or in public forums where the object of derision is powerful enough that you’re not punching down, but powerless enough to not punch you back. Say, the ever-revered and long dead founder of your field—they may deserve a potshot at their bad manners and despicable views on eugenics.

Then there is that elusive ‘sudden perception of the incongruity between a concept and the real objects which have been thought through it in some relation’ (Schopenhauer, quoted in Stanford Encyclopedia of Philosophy). When humour is used right, a serious lecturer talking about serious issues has all kinds of opportunities to amuse the listener with incongruities between the expectations and what they really are like. So please don’t reveal yourself to be predictably trite.

Sequencing-based methods called Dart

Some years ago James Hadfield at Enseqlopedia made a spreadsheet of acronyms for sequencing-based methods with some 50 rows. I can only imagine how long it would be today.

The overloading of acronyms is becoming a bit ridiculous. I recently saw a paper about DART-seq, a method for detecting N6-methyladenosine in RNA (Meyer 2019), and thought, ”wait a minute, isn’t DART-seq a reduced representation genotyping method?” It is, only stylised as DArTseq (seriously). Apparently, it’s also a droplet RNA-sequencing method (Saikia et al. 2018).

What are these methods doing?

  • DArT, diversity array technology, is a way to enrich for a part of a genome. It was originally developed with array technology in mind (Jaccoud et al. 2001). They take some DNA, cut it with restriction enzymes, add adapters and amplify regions close to the cut. Then they clone the resulting DNA, and then attach it to a slide, and that gives a custom microarray of anonymous fragments from the genome. For the Dart-seq version, it seems they make a sequencing library instead of going on to cloning (Ren et al. 2015). It falls in the same family as GBS and RAD-seq methods.
  • DART-seq, droplet-assisted RNA targeting, builds on Drop-seq, where they put single cells and beads that carry primers into the same oil droplet. As cells lyse, the RNA sticks to the primer. The beads also have a barcode so they can be identified in sequencing. Then they break the emulsion, reverse transcribe the RNA attached to beads, amplify and sequence. That is cool. However, because they capture the RNA with oligo-dT primers, they sequence from the 3′ end of the RNA. The Dart method adds primers to the beads, so they can target some specific RNAs and amplify more of them. It’s the super-high-tech version of gene-specific primers for reverse transcription..
  • DART-seq, deamination adjacent to RNA modification targets, uses a synthetic fusion protein that combines APOBEC1, which deaminates cytidines, with a protein domain from YTHDF2 which binds N6-methyladenosine. If an RNA has N6-methyladenosine, cytidines that are close to it, as is usually the case with this base modification, will be deaminated to uracil. After RNA-sequencing, this will look like Cs next to As turning into Ts. Neat! It’s a little bit like bisulfite sequencing of methylated DNA, but with RNA.

On the one hand: Don’t people search the internet before they name their methods, or do they not care? On the other hand, realistically, the genotyping method Dart and the single cell RNA-seq method Dart are unlikely to show up in the same work. If you can call your groups ”treatment” and ”control” for the purpose of a paper, maybe you can call your method ”Dart”, and no-one gets too confused.