The next notebook of work

Dear diary,

The last post was about my attempt to use the Getting Things Done method to bring some more order to research, work, and everything. This post will contain some more details about my system, at a little less than a year into the process, on the off chance that anyone wants to know. This post will use some Getting Things Done jargon without explaining it. There are many useful guides online, plus of course the book itself.

Medium

Most of my system lives in paper notebooks. The main notebook contains my action list, projects list, waiting for list and agendas plus a section for notes. I quickly learned that the someday/maybe lists won’t fit, so I now have a separate (bigger) notebook for those. My calendar is digital. I also use a note taking app for project support material, and as an extra inbox for notes I jot down on my phone. Thus, I guess it’s a paper/digital hybrid.

Contexts

I have five contexts: email/messaging, work computer, writing, office and home. There were more in the beginning, but I gradually took out the ones I didn’t use. They need to be few enough and map cleanly to situations, so that I remember to look at them. I added the writing context because I tend to treat, and schedule, writing tasks separately from other work tasks. The writing context also includes writing-adjacent support tasks such as updating figures, going through reviewer comments or searching for references.

Inboxes

I have a total of nine inboxes, if you include all the email accounts and messenger services where people might contact me about things I need to do. That sounds excessive, but only three of those are where I put things for myself (physical inbox, notes section of notebook, and notes app), and so far they’re all getting checked regularly.

Capture

I do most of my capture in the notes app on my phone (when not at a desk) or on piece of paper (when at my desk). When I get back to having in-person meetings, I assume more notes are going to end up in the physical notebook, because it’s nicer to take meeting notes on paper than on a phone.

Agendas

The biggest thing I changed in the new notebook was to dedicate much more space to agendas, but it’s already almost full! It turns out there are lots of things ”I should talk to X about the next time we’re speaking”, rather than send X an email immediately. Who knew?

Waiting for

This is probably my favourite. It is useful to have a list of who have said they will get back to me, when, and about what. That little date next to their name helps me not feel like a nag when I ask them again after a reasonable time, and makes me appreciate them more when they respond quickly.

Weekly review

I already had the habit of scheduling an appointment with myself on Fridays (or otherwise towards the end of the week) to go over some recurring items. I’ve expanded this appointment to do a weekly review of the notebook, calendar, someday/maybe list, and some other bespoke checklist items. I bribe myself with sweets to support this habit.

Things I’d like to improve

Here are some of the things I want to improve:

  • The project list. A project sensu Getting Things Done can be anything from purchase new shoes to taking over the world. The project list is supposed to keep track of what you’ve undertaken to do, and make sure you have come up with actions that progress them. My project list isn’t very complete, and doesn’t spark new actions very often.
  • Project backlogs. On the other hand, I have some things on the project list that are projects in a greater sense, and will have literally thousands of actions, both from me and others. These obviously need planning ahead beyond the next thing to do. I haven’t yet figured out the best way to keep a backlog of future things to do in a project, potentially with dependencies, and feed them into my list of things to do when they become current.
  • Notes. I have a strong note taking habit, but a weak note reading habit. Essentially, many of my notes are write-only; this feels like a waste. I’ve started my attempts to improve the situation with meeting notes: trying to take five minutes right after a meeting (if possible) to go over the notes, extract any calendar items, actions and waiting-fors, and decide whether I need to save the note or if I can throw it away. What to do about research notes from reading and from seminars is another matter.

One notebook’s worth of work

Image: an Aviagen sponsored notebook from the 100 Years of Genetics meeting in Edinburgh, with post-its sticking out, next to a blue Ballograf pen

Dear diary,

”If could just spend more time doing stuff instead of worrying about it …” (Me, at several points over the years.)

I started this notebook in spring last year and recently filled it up. It contains my first implementation of the system called ”Getting Things Done” (see the book by David Allen with the same name). Let me tell you a little about how it’s going.

The way I organised my work, with to-do lists, calendar, work journal, and routines for dealing with email had pretty much grown organically up until the beginning of this year. I’d gotten some advice, I’d read the odd blog post and column about email and calendar blocking, but beyond some courses in project management (which are a topic for another day), I’d gotten myself very little instruction on how to do any of this. How does one actually keep a good to-do list? Are there principles and best practices? I was aware that Getting Things Done was a thing, and last spring, a mention in passing on the Teaching in Higher Ed podcast prompted me to give it a try.

I read up a little. The book was right there in the university library, unsurprisingly. I also used a blog post by Alberto Taiuti about doing Getting Things Done in a notebook, and read some other writing by researchers about how they use the method (Robert Talbert and Veronika Cheplygina).

There is enough out there about this already that I won’t make my own attempt to explain the method in full, but here are some of the interesting particulars:

You are supposed to be careful about how you organise your to-do lists. You’re supposed to make sure everything on the list is a clear, unambiguous next action that you can start doing when you see it. Everything else that needs thinking, deciding, mulling over, reflecting etc, goes somewhere else, not on your list of thing to do. This means that you can easily pick something off your list and start work on it.

You are supposed to be careful about your calendar. You’re supposed to only put things in there that have a fixed date and time attached, not random reminders or aspirational scheduling of things you would like to do. This means that you can easily look at your calendar and know what your day, week and month look like.

You are supposed to be careful to record everything you think about that matters. You’re supposed to take a note as soon as you have a potentially important thought and put it in a dedicated place that you will check and go through regularly. This means that you don’t have to keep things in your head.

This sounds pretty straightforward, doesn’t it? Well, despite having to-do lists, calendars and a habit of note-taking for years, I’ve not been very disciplined about any of this before. My to-do list items have often been vague, too big tasks that are hard to get started on. My calendar has often contained aspirational planning entries that didn’t survive contact with the realities of the workday. I often delude myself that I’ll remember an idea or a decision, to have quietly it slip out of my mind.

Have I become more productive, or less stressed? The honest answer is that I don’t know. I don’t have a reliable way to track either productivity or stress levels, and even if I did: the last year has not really been comparable to the year before, for several reasons. However, I feel like thinking more about how I organise my work makes a difference, and I’ve felt a certain joy working on the process, as well as a certain dread when looking at it all organised in one place. Let’s keep going and see where this takes us.

Against question and answer time

Here is a semi-serious suggestion: Let’s do away with questions and answers after talks.

I’ll preface with two examples:

First, a scientist I respect highly had just given a talk. As we were chatting away afterwards, I referred to someone who had asked a question during the talk. The answer: ”I didn’t pay attention. I don’t listen when people talk at me like that.”

Second, Swedish author Göran Hägg had this little joke about question and answer time. I paraphrase from memory: Question time is useless because no reasonable person who has a useful contribution will be socially uninhibited enough to ask a question in a public forum (at least not in Sweden). To phrase it more nicely: Having a useful contribution and feeling comfortable to speak up might not be that well correlated.

I have two intuitions about this. On the one hand, there’s the idea that science thrives on vigorous criticism. I have been at talks where people bounce questions at the speaker, even during the talk and even with pretty serious criticisms, and it works just fine. I presume it has to do both with respect, skill at asking and answering, and the power and knowledge differentials between interlocutors.

On the other hand, we would prefer to have a good conversation and productive arguments, and I’m sure everyone has been in seminar rooms where that wasn’t the case. It’s not a good conversation if, say, question and answers turn into old established guys (sic) shouting down students. In some cases, it seems the asker is not after a productive argument, nor indeed any honest attempt to answer the question. (You might be able to tell by them barking a new question before the respondent has finished.)

Personally, I’ve turned to asking fewer questions. If it’s something I’ve misunderstood, it’s unlikely that I will get the explanation I need without conversation and interaction. If I have a criticism, it’s unlikely that I will get the best possible answer from the speaker on the spot. If I didn’t like the seminar, am upset with the speaker’s advisor, hate it when people mangle the definition of ”epigenetics” or when someone shows a cartoon of left-handed DNA, it’s my problem and not something I need to share with the audience.

I think questions and answers is one of thing that actually has benefitted from a move to digital seminars on a distance, where questions are often written in chat. This might be because of a difference in tone between writing a question down or asking it verbally, or thanks to the filtering capabilities of moderators.

Various positions II

Again, what good is a blog if you can’t post your arbitrary idiosyncratic opinions as if you were an authority?

Don’t make a conference app

I get it, you can’t print a full-blown paper program book: it is too much, no one reads it, and it feels wasteful. But please, please, for the love of everything holy, don’t make an app. Put the text, straight up, on a website in plaintext. It loads quickly, it’s searchable, it can be automatically generated. The conference app will be cloddy, take up space on the phone, eat bandwidth on some strained mobile contract, and invariably freeze.

Posters, still bad in 2020

Don’t believe the lies: a once folded canvas poster will never look good again. You haven’t had fun on a conference before you’ve tried ironing a poster on a hostel floor with an iron that belongs in a museum.

Poster sessions are bad by necessity. If they had had space and time to be anything other than a crowded mess, the conference would have to accept substantially fewer posters. That means fewer participants, probably especially earlier career participants, and the value of having them outweighs the value of a somewhat better poster session.

Gene accession numbers

PLOS Genetics has a great policy in their submission guidelines that doesn’t seem to get followed very much in papers they actually publish. This should be the norm in every genetics paper. I feel bad that it’s not the case in all my papers.

As much as possible, please provide accession numbers or identifiers for all entities such as genes, proteins, mutants, diseases, etc., for which there is an entry in a public database, for example:

Ensembl
Entrez Gene
FlyBase
InterPro
Mouse Genome Database (MGD)
Online Mendelian Inheritance in Man (OMIM)
PubChem

Identifiers should be provided in parentheses after the entity on first use.

In the future, with the right ontologies and repositories in place, I hope this will be the case with traits, methods and so on as well.

UK Biobank and dbGAP are not open data

And that is fine.

Stop it with the work-life balance tweets

No-one should tweet about work-life balance; whether you write about how much you work or how diligent you are about your hours, it comes off as bragging.

Tenses

Write your papers in the past or present tense, whichever you prefer. In the context of a scientific paper, the difference between past and present communicates nothing. I suppose you’re not supposed to mix tenses, but that doesn’t matter either. Most readers probably won’t notice. If you ask me about my stylistic opinion: present tense for everything. But again, it doesn’t matter.

A partial success

In 2010, Poliseno & co published some results on the regulation of a gene by a transcript from a pseudogene. Now, Kerwin & co have published a replication study, the protocol for which came out in 2015 (Khan et al). An editor summarises it like this in an accompanying commentary (Calin 2020):

The partial success of a study to reproduce experiments that linked pseudogenes and cancer proves that understanding RNA networks is more complicated than expected.

I guess he means ”partial success” in the sense that they partially succeeded in performing the replication experiments they wanted. These experiments did not reproduce the gene regulation results from 2010.

Seen from the outside — I have no insight in what is going on here or who the people involved are — something is not working here. If it takes five years from paper to replication effort, and then another five years to replication study accompanied by an editorial commentary that subtly undermines it, we can’t expect replication studies to update the literature, can we?

Communication

What’s the moral of the story, according to Calin?

What are the take-home messages from this Replication Study? One is the importance of fruitful communication between the laboratory that did the initial experiments and the lab trying to repeat them. The lack of such communication – which should extend to the exchange of protocols and reagents – was the reason why the experiments involving microRNAs could not be reproduced. The original paper did not give catalogue numbers for these reagents, so the wrong microRNA reagents were used in the Replication Study. The introduction of reporting standards at many journals means that this is less likely to be an issue for more recent papers.

There is something right and something wrong about this. On the one hand, talking to your colleagues in the field obviously makes life easier. We would like researchers to put all pertinent information in writing, and we would like there to be good communication channels in cases where the information turns out not to be what the reader needed. On the other hand, we don’t want science to be esoteric. We would like experiments to be reproducible without the special artifact or secret sauce. If nothing else, because the people’s time and willingness to provide tech support for their old papers might be limited. Of course, this is hard, in a world where the reproducibility of an experiment might depend on the length of digestion (Hines et al 2014) or that little plastic thingamajig you need for the washing step.

Another take-home message is that it is finally time for the research community to make raw data obtained with quantitative real-time PCR openly available for papers that rely on such data. This would be of great benefit to any group exploring the expression of the same gene/pseudogene/non-coding RNA in the same cell line or tissue type.

This is true. You know how doctored, or just poor, Western blots are a notorious issue in the literature? I don’t think that’s because Western blot as a technique is exceptionally bad, but because there is a culture of showing the raw data (the gel), so people can notice problems. However, even if I’m all for showing real-time PCR amplification curves (as well as melting curves, standard curves, and the actual batch and plate information from the runs), I doubt that it’s going to be possible to trouble-shoot PCR retrospectively from those curves. Maybe sometimes one would be able to spot a PCR that looks iffy, but beyond that, I’m not sure what we would learn. PCR issues are likely to have to do with subtle things like primer design, reaction conditions and handling that can only really be tackled in the lab.

The world is messy, alright

Both the commentary and the replication study (Kerwin et al 2020) are cautious when presenting their results. I think it reads as if the authors themselves either don’t truly believe their failure to replicate or are bending over backwards to acknowledge everything that could have gone wrong.

The original study reported that overexpression of PTEN 3’UTR increased PTENP1 levels in DU145 cells (Figure 4A), whereas the Replication Study reports that it does not. …

However, the original study and the Replication Study both found that overexpression of PTEN 3’UTR led to a statistically significant decrease in the proliferation of DU145 cells compared to controls.

In the original study Poliseno et al. reported that two microRNAs – miR-19b and miR-20a – suppress the transcription of both PTEN and PTENP1 in DU145 prostate cancer cells (Figure 1D), and that the depletion of PTEN or PTENP1 led to a statistically significant reduction in the corresponding pseudogene or gene (Figure 2G). Neither of these effects were seen in the Replication Study. There are many possible explanations for this. For example, although both studies used DU145 prostate cancer cells, they did not come from the same batch, so there could be significant genetic differences between them: see Andor et al. (2020) for more on cell lines acquiring mutations during cell cultures. Furthermore, one of the techniques used in both studies – quantitative real-time PCR – depends strongly on the reagents and operating procedures used in the experiments. Indeed, there are no widely accepted standard operating procedures for this technique, despite over a decade of efforts to establish such procedures (Willems et al., 2008; Schwarzenbach et al., 2015).

That is both commentary and replication study seem to subscribe to a view of the world where biology is so rich and complex that both might be right, conditional on unobserved moderating variables. This is true, but it throws us into a discussion of generalisability. If a result only holds in some genotypes of DU145 prostate cancer cells, which might very well be the case, does it generalise enough to be useful for cancer research?

Power underwhelming

There is another possible view of the world, though … Indeed, biology rich and complicated, but in the absence of accurate estimates, we don’t know which of all these potential moderating variables actually do anything. First order, before we start imagining scenarios that might explain the discrepancy, is to get a really good estimate of it. How do we do that? It’s hard, but how about starting with a cell size greater than N = 5?

The registered report contains power calculations, which is commendable. As far as I can see, it does not describe how they arrived at the assumed effect sizes. Power estimates for a study design depend on the assumed effect sizes. Small studies tend to exaggerate effect sizes (because, if an estimate is small the difference can’t be significant). This means that taking the estimates as staring effect sizes might leave you with a design that is still unable to detect a true effect of reasonable size.

I don’t know what effect sizes one should expect in these kinds of experiments, but my intuition would be that even if you think that you can get good power with a handful of samples per cell, can’t you please run a couple more? We are all limited by resources and time, but if you’re running something like a qPCR, the cost per sample must be much smaller than the cost for doing one run of the experiment in the first place. It’s really not as simple as adding one row on a plate, but almost.

Literature

Calin, George A. ”Reproducibility in Cancer Biology: Pseudogenes, RNAs and new reproducibility norms.” eLife 9 (2020): e56397.

Hines, William C., et al. ”Sorting out the FACS: a devil in the details.” Cell reports 6.5 (2014): 779-781.

Kerwin, John, and Israr Khan. ”Replication Study: A coding-independent function of gene and pseudogene mRNAs regulates tumour biology.” eLife 9 (2020): e51019.

Khan, Israr, et al. ”Registered report: a coding-independent function of gene and pseudogene mRNAs regulates tumour biology.” Elife 4 (2015): e08245.

Poliseno, Laura, et al. ”A coding-independent function of gene and pseudogene mRNAs regulates tumour biology.” Nature 465.7301 (2010): 1033-1038.

Better posters are nice, but we need better poster session experiences

Fear and loathing in the conference centre lobby

Let me start from a negative place, because my attitude to poster sessions is negative. Poster sessions are neither good ways to communicate science, nor to network at conferences. Moreover, they are unpleasant.

The experience of going to a poster session, as an attendant or a presenter goes something like this: You have to stand in a crowded room that is too loud and try to either read technical language or hold a conversation in about a difficult topic. Even without anxiety, mobility, or hearing difficulties, a poster session is unlikely to be enjoyable or efficient.

Poster sessions are bad because of necessities of conference organisation. We want to invite many people, but we can’t fit in many talks; we get crowded poster sessions.

They are made worse by efforts to make them better, such as mandating presenters to stand by their posters, in some cases on pain of some sanction by the organisers, or to have the poster presenters act as dispensers of alcohol. If you need to threaten or drug people to participate in an activity, that might be a sign.

They are made not worse but a bit silly, by assertions that poster sessions are of utmost importance for conferencing. Merely stating that the poster session is vibrant and inspiring, or that you want to emphasise the poster as an important form of communication, sadly, does not make it so, if the poster sessions are still business as usual.

Mike Morrison’s ”Better Scientific Poster” design

As you can see above, my diagnosis of the poster session problem is part that you’re forced to read walls of text or listen to mini-lectures, and part that it happens in an overcrowded space. The walls of text and mini-lecture might be improved by poster design.

Enter the Better Scientific Poster. I suggest clicking on that link and looking at the poster templates directly. I waited too long to look at the actual template files, because I expected a bunch of confusing designer stuff. It’s not. They contain their own documentation and examples.

There is also a video on YouTube expanding on the thinking behind the design, but I think this conversation on the Everyting Hertz podcast is the best introduction, if you need an introduction beyond the template. The YouTube video doesn’t go into enough detail, and is also a bit outdated. The poster template has gone through improvements since.

If you want to hear the criticisms of the design, here’s a blog post summarising some of it. In short, it is unscientific and intellectually arrogant to put a take home message in too large a font, and it would be boring if all posters used the same template. Okay.

The caveats

I am not a designer, which should be abundantly clear to everyone. I don’t really know what good graphic design principles for a poster are.

There is also no way to satisfy everyone. Some people will think you’ve put too little on the poster unless it ”tells the full story” and a has self-contained description of the methods with all caveats. Some people, like me, will think you’ve put way too much on it long before that.

What I like, however, is that Morrison’s design is based on an analysis of the poster session experience that aligns with mine, and that it is based on a goal for the poster that makes sense. The features of the design flow from that goal. If you listen to the video or the Hertz episode: Morrison has thought about the purpose of the poster.

He’s not just expressing some wisdom his PhD supervisor told him in stern voice, or what his gut feeling tells him, which I suspect is the two sources that scientists’ advice on communication is usually based on. We all think that poster sessions are bad, because we’ve been to poster sessions. We usually don’t have thought-through ideas about what to do better.

Back to a place of negativity

For those reasons, I think the better poster is likely to be an improvement. I was surprised that I didn’t see it sweep through poster sessions at the conferences I went to last summer, but there were a few. I was going to try it for TAGGC 2020 (here is my poster aboutthe genetics of recombination rate in the pig), but that moved online, which made poster presentations a little different.

However, changing up poster layout can only get you so far. Unless someone has a stroke of genius to improve the poster viewing experience or change the economics of poster attendance, there no bright future for the poster session. Individually, the rational course of action isn’t to fiddle with the design and spend time to squeeze marginal improvements out of our posters. It is to spend as little time as possible on posters, ignoring our colleagues’ helpful advice on how to make them prettier and more scientific, and lowering our expectations.

Virtual animal breeding journal club: ”Structural equation models to disentangle the biological relationship between microbiota and complex traits …”

The other day was the first Virtual breeding and genetics journal club organised by John Cole. This was the first online journal club I’ve attended (shocking, given how many video calls I’ve been on for other sciencey reasons), so I thought I’d write a little about it: both the format and the paper. You can look the slide deck from the journal club here (pptx file).

The medium

We used Zoom, and that seemed to work, as I’m sure anything else would, if everyone just mute their microphone when they aren’t speaking. As John said, the key feature of Zoom seems to be the ability for the host to mute everyone else. During the call, I think we were at most 29 or so people, but only a handful spoke. It will probably get more intense with the turn taking if more people want to speak.

The format

John started the journal club with a code of conduct, which I expect helped to set what I felt was a good atmosphere. In most journal clubs I’ve been in, I feel like the atmosphere has been pretty good, but I think we’ve all heard stories about hyper-critical and hostile journal clubs, and that doesn’t sound particularly fun or useful. On that note, one of the authors, Oscar González-Recio, was on the call and answered some questions.

The paper

Saborío‐Montero, Alejandro, et al. ”Structural equation models to disentangle the biological relationship between microbiota and complex traits: Methane production in dairy cattle as a case of study.” Journal of Animal Breeding and Genetics 137.1 (2020): 36-48.

The authors measured methane emissions (by analysing breath with with an infrared gas monitor) and abundance of different microbes in the rumen (with Nanopore sequencing) from dairy cows. They genotyped the animals for relatedness.

They analysed the genetic relationship between breath methane and abundance of each taxon of microbe, individually, with either:

  • a bivariate animal model;
  • a structural equations model that allows for a causal effect of abundance on methane, capturing the assumption that the abundance of a taxon can affect the methane emission, but not the other way around.

They used them to estimate heritabilities of abundances and genetic correlations between methane and abundances, and in the case of the structural model: conditional on the assumed causal model, the effect of that taxon’s abundance on methane.

My thoughts

It’s cool how there’s a literature building up on genetic influences on the microbiome, with some consistency across studies. These intense high-tech studies on relatively few cattle might build up to finding new traits and proxies that can go into larger scale phenotyping for breeding.

As the title suggests, the paper advocates for using the structural equations model: ”Genetic correlation estimates revealed differences according to the usage of non‐recursive and recursive models, with a more biologically supported result for the recursive model estimation.” (Conclusions)

While I agree that a priori, it makes sense to assume a structural equations model with a causal structure, I don’t think the results provide much evidence that it’s better. The estimates of heritabilities and genetic correlations from the two models are near indistinguishable. Here is the key figure 4, comparing genetic correlation estimates:

saborio-montero-fig4

As you can see, there are a couple of examples of genetic correlations where the point estimate switches sign, and one of them (Succinivibrio sp.) where the credible intervals don’t overlap. ”Recursive” is the structural equations model. The error bars are 95% credible intervals. This is not strong evidence of anything; the authors are responsible about it and don’t go into interpreting this difference. But let us speculate! They write:

All genera in this case, excepting Succinivibrio sp. from the Proteobacteria phylum, resulted in overlapped genetic cor- relations between the non‐recursive bivariate model and the recursive model. However, high differences were observed. Succinivibrio sp. showed the largest disagreement changing from positively correlated (0.08) in the non‐recursive bivariate model to negatively correlated (−0.20) in the recursive model.

Succinivibrio are also the taxon with the estimated largest inhibitory effect on methane (from the structural equations model).

While some taxa, such as ciliate protozoa or Methanobrevibacter sp., increased the CH4 emissions …, others such as Succinivibrio sp. from Proteobacteria phylum decreased it

Looking at the paper that first described these bacteria (Bryan & Small 1955),  Succinivibrio were originally isolated from the cattle rumen, and their name is because ”they ferment glucose with the production of a large amount of succinic acid”. Bryant & Small made a fermentation experiment to see what came out, and it seems that the bacteria don’t produce methane:

succ_table2

This is also in line with a rRNA sequencing study of high and low methane emitting cows (Wallace & al 2015) that found lower Succinivibrio abundance in high methane emitters.

We may speculate that Succinivibrio species could be involved in diverting energy from methanogens, and thus reducing methane emissions. If that is true, then the structural equations model estimate (larger genetic negative correlation between Succinivibrio abundance and methane) might be better than one from the animal model.

Finally, while I’m on board with the a priori argument for using a structural equations model, as with other applications of causal modelling (gene networks, Mendelian randomisation etc), it might be dangerous to consider only parts of the system independently, where the microbes are likely to have causal effects on each other.

Literature

Saborío‐Montero, Alejandro, et al. ”Structural equation models to disentangle the biological relationship between microbiota and complex traits: Methane production in dairy cattle as a case of study.” Journal of Animal Breeding and Genetics 137.1 (2020): 36-48.

Wallace, R. John, et al. ”The rumen microbial metagenome associated with high methane production in cattle.” BMC genomics 16.1 (2015): 839.

Bryant, Marvin P., and Nola Small. ”Characteristics of two new genera of anaerobic curved rods isolated from the rumen of cattle.” Journal of bacteriology 72.1 (1956): 22.

Things that really don’t matter: megabase or megabasepair

Should we talk about physical distance in genetics as number of base pairs (kbp, Mbp, and so on) or bases (kb, Mb)?

I got into a discussion about this recently, and I said I’d continue the struggle on my blog. Here it is. Let me first say that I don’t think this matters at all, and if you make a big deal out of this (or whether ”data” can be singular, or any of those inconsequential matters of taste we argue about for amusement), you shouldn’t. See this blog post as an exorcism, helping me not to trouble my colleagues with my issues.

What I’m objecting to is mostly the inconsistency of talking about long stretches of nucleotides as ”kilobase” and ”megabase” but talking about short stretches as ”base pairs”. I don’t think it’s very common to call a 100 nucleotide stretch ”a 100 b sequence”; I would expect ”100 bp”. For example, if we look at Ensembl, they might describe a large region as 1 Mb, but if you zoom in a lot, they give length in bp. My impression is that this is a common practice. However, if you consistently use ”bases” and ”megabase”, more power to you.

Unless you’re writing a very specific kind of bioinformatics paper, the risk of confusion with the computer storage unit isn’t a problem. But there are some biological arguments.

A biological argument for ”base”, might be that we care about the identity of the base, not the base pairing. We note only one nucleotide down when we write a nucleic acid sequence. The base pair is a different thing: that base bound to the one on the other strand that it’s paired with, or, if the DNA or RNA is single-stranded, it’s not even paired at all.

Conversely, a biochemical argument for ”base pair” might be that in a double-stranded molecule, the base pair is the relevant informational unit. We may only write one base in our nucleotide sequence for convenience, but because of the rules of base pairing, we know the complementing pair. In this case, maybe we should use ”base” for single-stranded molecules.

If we consult two more or less trustworthy sources, The Encylopedia of Life Sciences and Wiktionary, they both seem to take this view.

eLS says:

A megabase pair, abbreviated Mbp, is a unit of length of nucleic acids, equal to one million base pairs. The term ‘megabase’ (or Mb) is commonly used inter-changeably, although strictly this would refer to a single-stranded nucleic acid.

Wiktionary says:

A length of nucleic acid containing one million nucleotides (bases if single-stranded, base pairs if double-stranded)

Please return next week for the correct pronunciation of ”loci”.

Literature

Dear, P.H. (2006). Megabase Pair (Mbp). eLS.

If research is learning, how should researchers learn?

I’m taking a course on university pedagogy to, hopefully, become a better teacher. While reading about students’ learning and what teachers ought to do to facilitate it, I couldn’t help thinking about researchers’ learning, and what we ought to do to give ourselves a good learning environment.

Research is, largely, learning. First, a large part of any research work is learning what is already known, not just by me in particular; it’s a direct continuation of learning that takes place in courses. While doing any research project, we learn the concepts other researchers use in this specific sub-subfield, and the relations between them. First to the extent that we can orient ourselves, and eventually to be able to make a contribution that is intelligible to others who work there. We also learn their priorities, attitudes and platitudes. (Seriously, I suspect you learn a lot about a sub-subfield by trying to make jokes about it.) We also learn to do something new: perform a laboratory procedure, a calculation, or something like that.

But more importantly, research is learning about things no-one knows yet. The idea of constructivist learning theory seems apt: We are constructing new knowledge, building on pre-existing structures. We don’t go out and read the book of nature; we take the concepts and relations of our sub-subfield of choice, and graft, modify and rearrange them into our new model of the subject.

If there is something to this, it means that old clichéd phrases like ”institution of higher learning”, scientists as ”students of X”, and so on, name a deeper analogy than it might seem. It also suggests that innovations in student learning might also be good building blocks for research group management. Should we be concept mapping with our colleagues to figure out where we disagree about the definition of ”developmental pleiotropy”? It also makes one wonder why meetings and departmental seminars often take the form of sage on the stage lectures.

Two distinguishing traits of science are that there are errors all the time and that almost no-one can reproduce anything

I got annoyed and tweeted:

”If you can’t reproduce a result, it isn’t science” … so we’re at that stage now, when we write things that sound righteous but are nonsense.

Hashtag subtweet, I guess. But it doesn’t matter who first wrote the sentence I was complaining about; they won’t care what I think, and I’m not out to debate them. I only think the quoted sentence makes sense if you take ”science” to mean ”the truth”. The relationship between science and reproducibility is messier than that.

The first clause could mean a few different things:

You have previously produced a result, but now, you can’t reproduce it when you try …

Then you might have done something wrong the first time, or the second time. This is an everyday occurrence of any type of research, that probably happens to every postdoc every week. Not even purely theoretical results are safe. If the simulation is stochastic, one might have been interpreting noise. If there is an analytical result, one might have made an odd number of sign errors. In fact, it is a distinguishing trait of science that when we try to learn new things, there are errors all the time.

If that previous result is something that has been published, circulated to peers, and interpreted as if it was a useful finding, then that is unfortunate. The hypothetical you should probably make some effort to figure out why, and communicate that to peers. But it seems like a bad idea to suggest that because there was an error, you’re not doing science.

You personally can’t reproduce a results because you don’t have the expertise or resources …

Science takes a lot of skill and a lot of specialised technical stuff. I probably can’t reproduce even a simple organic chemistry experiment. In fact, it is a distinguishing trait of science that almost no-one can reproduce any of it, because it takes both expertise and special equipment.

No-one can ever reproduce a certain result even in principle …

It might still be science. The 1918 influenza epidemic will by the nature of time never happen again. Still, there is science about it.

You can’t reproduce someone else’s results when you try with a reasonably similar setup …

Of course, this is what the original authors of the sentence meant. When this turns out to be a common occurrence, as people systematically try to reproduce findings, there is clearly something wrong with the research methods scientists use: The original report may be the outcome of a meandering process of researcher degrees of freedom that produced a striking result that is unlikely to happen when the procedure is repeated, even with high fidelity. However, I would say that we’re dealing with bad science, rather than non-science. Reproducibility is not a demarcation criterion.

(Note: Some people reserve ”reproducibility” for the computational reproducibility of re-running someone’s analysis code and getting the same results. This was not the case with the sentence quoted above.)