Convincing myself about the Monty Hall problem

Like many others, I’ve never felt that the solution to the Monty Hall problem was intuitive, despite the fact that explanations of the correct solution are everywhere. I am not alone. Famously, columnist Marilyn vos Savant got droves of mail from people trying to school her after she had published the correct solution.

The problem goes like this: You are a contestant on a game show (based on a real game show hosted by Monty Hall, hence the name). The host presents you with three doors, one of which contains a prize — say, a goat — and the others are empty. After you’ve made your choice, the host opens one of the doors, showing that it is empty. You are now asked whether you would like to stick to your initial choice, or switch to the other door. The right thing to do is to switch, which gives you 2/3 probability of winning the goat. This can be demonstrated in a few different ways.

A goat is a great prize. Image: Casey Goat by Pete Markham (CC BY-SA 2.0)

So I sat down to do 20 physical Monty Hall simulations on paper. I shuffled three cards with the options, picked one, and then, playing the role of the host, took away one losing option, and noted down if switching or holding on to the first choice would have been the right thing to do. The results came out 13 out of 20 (65%) wins for the switching strategy, and 7 out of 20 (35%) for the holding strategy. Of course, the Monty Hall Truthers out there must question whether this demonstration in fact happened — it’s too perfect, isn’t it?

The outcome of the simulation is less important than the feeling that came over me as I was running it, though. As I was taking on the role of the host and preparing to take away one of the losing options, it started feeling self-evident that the important thing is whether the first choice is right. If the first choice is right, holding is the right strategy. If the first choice is wrong, switching is the right option. And the first choice, clearly, is only right 1/3 of the time.

In this case, it was helpful to take the game show host’s perspective. Selvin (1975) discussed the solution to the problem in The American Statistician, and included a quote from Monty Hall himself:

Monty Hall wrote and expressed that he was not ”a student of statistics problems” but ”the big hole in your argument is that once the first box is seen to be empty, the contestant cannot exchange his box.” He continues to say, ”Oh, and incidentally, after one [box] is seen to be empty, his chances are no longer 50/50 but remain what they were in the first place, one out of three. It just seems to the contestant that one box having been eliminated, he stands a better chance. Not so.” I could not have said it better myself.

A generalised problem

Now, imagine the same problem with a number d number of doors, w number of prizes and o number of losing doors that are opened after the first choice is made. We assume that the losing doors are opened at random, and that switching entails picking one of the remaining doors at random. What is the probability of winning with the switching strategy?

The probability of picking the a door with or without a prize is:

\Pr(\text{pick right first}) = \frac{w}{d}

\Pr(\text{pick wrong first}) = 1 - \frac{w}{d}

If we picked a right door first, we have w – 1 winning options left out of d – o – 1 doors after the host opens o doors:

\Pr(\text{win\textbar right first}) = \frac{w - 1}{d - o - 1}

If we picked the wrong door first, we have all the winning options left:

\Pr(\text{win\textbar wrong first}) = \frac{w}{d - o - 1}

Putting it all together:

\Pr(\text{win\textbar switch}) = \Pr(\text{pick right first}) \cdot \Pr(\text{win\textbar right first}) + \\   + \Pr(\text{pick wrong first}) \cdot \Pr(\text{win\textbar wrong first}) = \\  = \frac{w}{d} \frac{w - 1}{d - o - 1} + (1 - \frac{w}{d}) \frac{w}{d - o - 1}

As before, for the hold strategy, the probability of winning is the probability of getting it right the first time:

\Pr(\text{win\textbar hold}) = \frac{w}{d}

With the original Monty Hall problem, w = 1, d = 3 and o = 1, and therefore

\Pr(\text{win\textbar switch}) = \frac{1}{3} \cdot 0 + \frac{2}{3} \cdot 1

Selvin (1975) also present a generalisation due to Ferguson, where there are n options and p doors that are opened after the choice. That is, w = 1, d = 3 and o = 1. Therefore,

\Pr(\text{win\textbar switch}) = \frac{1}{n} \cdot 0 + (1 - \frac{1}{n}) \frac{1}{n - p - 1} =  \frac{n - 1}{n(n - p - 1)}

which is Ferguson’s formula.

Finally, in Marilyn vos Savant’s column, she used this thought experiment to illustrate why switching is the right thing to do:

Here’s a good way to visualize what happened. Suppose there are a million doors, and you pick door #1. Then the host, who knows what’s behind the doors and will always avoid the one with the prize, opens them all except door #777,777. You’d switch to that door pretty fast, wouldn’t you?

That is, w = 1 still, d = 106 and o = 106 – 2.

\Pr(\text{win\textbar switch}) = 1 - \frac{1}{10^6}

It turns out that the solution to the generalised problem is that it is always better to switch, as long as there is a prize, and as long as the host opens any doors. One can also generalise it to choosing sets of more than one door. This makes some intuitive sense: as long as the host takes opens some doors, taking away losing options, switching should enrich for prizes.

Some code

To be frank, I’m not sure I have convinced myself of the solution to the generalised problem yet. However, using the code below, I did try the calculation for all combinations of total number of doors, prizes and doors opened up to 100, and in all cases, switching wins. That inspires some confidence, should I end up on a small ruminant game show.

The code below first defines a wrapper around R’s sampling function, which has a very annoying alternative behaviour when fed a vector of length one, to be able to build a computational version of my physical simulation. Finally, we have a function for the above formulae. (See whole thing on GitHub if you are interested.)

## Wrap sample into a function that avoids the "convenience"
## behaviour that happens when the length of x is one

sample_safer <- function(to_sample, n) {
  assert_that(n <= length(to_sample))
  if (length(to_sample) == 1)
  else {
    return(sample(to_sample, n))

## Simulate a generalised Monty Hall situation with
## w prizes, d doors and o doors that are opened.

sim_choice <- function(w, d, o) {
  ## There has to be less prizes than unopened doors
  assert_that(w < d - o) 
  wins <- rep(1, w)
  losses <- rep(0, d - w)
  doors <- c(wins, losses)
  ## Pick a door
  choice <- sample_safer(1:d, 1)
  ## Doors that can be opened
  to_open_from <- which(doors == 0)
  ## Chosen door can't be opened
  to_open_from <- to_open_from[to_open_from != choice]
  ## Doors to open
  to_open <- sample_safer(to_open_from, o)
  ## Switch to one of the remaining doors
  possible_switches <- setdiff(1:d, c(to_open, choice))
  choice_after_switch <- sample_safer(possible_switches , 1)
  result_hold <- doors[choice]
  result_switch <- doors[choice_after_switch]

## Formulas for probabilities

mh_formula <- function(w, d, o) {
  ## There has to be less prizes than unopened doors
  assert_that(w < d - o) 
  p_win_switch <- w/d * (w - 1)/(d - o - 1) +
                     (1 - w/d) * w / (d - o - 1) 
  p_win_hold <- w/d

## Standard Monty Hall

mh <- replicate(1000, sim_choice(1, 3, 1))
> mh_formula(1, 3, 1)
[1] 0.3333333 0.6666667
> rowSums(mh)/ncol(mh)
[1] 0.347 0.653

The Monty Hall problem problem

Guest & Martin (2020) use this simple problem as their illustration for computational model building: two 12 inch pizzas for the same price as one 18 inch pizza is not a good deal, because the 18 inch pizza contains more food. Apparently this is counter-intuitive to many people who have intuitions about inches and pizzas.

They call the risk of having inconsistencies in our scientific understanding because we cannot intuitively grasp the implications of our models ”The pizza problem”, arguing that it can be ameliorated by computational modelling, which forces you to spell out implicit assumptions and also makes you actually run the numbers. Having a formal model of areas of circles doesn’t help much, unless you plug in the numbers.

The Monty Hall problem problem is the pizza problem with a vengeance; not only is it hard to intuitively grasp what is going on in the problem, but even when presented with compelling evidence, the mental resistance might still remain and lead people to write angry letters and tweets.


Guest, O, & Martin, AE (2020). How computational modeling can force theory building in psychological science. Preprint.

Selvin, S (1975) On the Monty Hall problem. The American Statistician 29:3 p.134.

Showing a difference in mean between two groups, take 2

A couple of years ago, I wrote about the paradoxical difficulty of visualising a difference in means between two groups, while showing both the data and some uncertainty interval. I still feel like many ills in science come from our inability to interpret a simple comparison of means. Anything with more than two groups or a predictor that isn’t categorical makes things worse, of course. It doesn’t take much to overwhelm the intuition.

My suggestion at the time was something like this — either a panel that shows the data an another panel with coefficients and uncertainty intervals; or a plot that shows the with lines that represent the magnitude of the difference at the upper and lower limit of the uncertainty interval.

Alternative 1 (left), with separate panels for data and coefficient estimates, and alternative 2 (right), with confidence limits for the difference shown as vertical lines. For details, see the old post about these graphs.

Here is the fake data and linear model we will plot. If you want to follow along, the whole code is on GitHub. Group 0 should have a mean of 4, and the difference between groups should be two, and as the graphs above show, our linear model is not too far off from these numbers.


data <- data.frame(group = rep(0:1, 20))
data$response <- 4 + data$group * 2 + rnorm(20)

model <- lm(response ~ factor(group), data = data)
result <- tidy(model)

Since the last post, a colleague has told me about the Gardner-Altman plot. In a paper arguing that confidence intervals should be used to show the main result of studies, rather than p-values, Gardner & Altman (1986) introduced plots for simultaneously showing confidence intervals and data.

Their Figure 1 shows (hypothetical) systolic blood pressure data for a group of diabetic and non-diabetic people. The left panel is a dot plot for each group. The right panel is a one-dimensional plot (with a different scale than the right panel; zero is centred on the mean of one of the groups), showing the difference between the groups and a confidence interval as a point with error bars.

There are functions for making this kind of plot (and several more complex alternatives for paired comparisons and analyses of variance) in the R package dabestr from Ho et al. (2019). An example with our fake data looks like this:

Alternative 3: Gardner-Altman plot with bootstrap confidence interval.


bootstrap <- dabest(data,
                    idx = c("0", "1"),
                    paired = FALSE)

bootstrap_diff <- mean_diff(bootstrap)


While this plot is neat, I think it is a little too busy — I’m not sure that the double horizontal lines between the panels or the shaded density for the bootstrap confidence interval add much. I’d also like to use other inference methods than bootstrapping. I like how the scale of the right panel has the same unit as the left panel, but is offset so the zero is at the mean of one of the groups.

Here is my attempt at making a minimalistic version:

Alternative 4: Simplified Garner-Altman plot.

This piece of code first makes the left panel of data using ggbeeswarm (which I think looks nicer than the jittering I used in the first two alternatives), then the right panel with the estimate and approximate confidence intervals of plus/minus two standard errors of the mean), adjusts the scale, and combines the panels with patchwork.


ymin <- min(data$response)
ymax <- max(data$response)

plot_points_ga <- ggplot() +
  geom_quasirandom(aes(x = factor(group), y = response),
                   data = data) +
  xlab("Group") +
  ylab("Response") +
  theme_bw() +
  scale_y_continuous(limits = c(ymin, ymax))

height_of_plot <- ymax-ymin

group0_fraction <- (coef(model)[1] - ymin)/height_of_plot

diff_min <- - height_of_plot * group0_fraction

diff_max <- (1 - group0_fraction) * height_of_plot

plot_difference_ga <- ggplot() +
  geom_pointrange(aes(x = term, y = estimate,
                      ymin = estimate - 2 * std.error,
                      ymax = estimate + 2 * std.error),
                  data = result[2,]) +
  scale_y_continuous(limits = c(diff_min, diff_max)) +
  ylab("Difference") +
  xlab("Comparison") +

(plot_points_ga | plot_difference_ga) +
    plot_layout(widths = c(0.75, 0.25))


Gardner, M. J., & Altman, D. G. (1986). Confidence intervals rather than P values: estimation rather than hypothesis testing. British Medical Journal

Ho, J., Tumkaya, T., Aryal, S., Choi, H., & Claridge-Chang, A. (2019). Moving beyond P values: data analysis with estimation graphics. Nature methods.

Reflektioner om högskolepedagogik: undervisning online

Den senaste i serien högskolepedagogiska kurser var en workshop om e-lärande, alltså undervisning online. Det är användbart både för genuina distanskurser, för kurser som behöver hållas på distans i nödfall därför att det råkar vara pandemi, och för vilka kurser som helst — eftersom varje kurs med självaktning har ett inslag av online-aktiviteter numera. Vi använder ju alltid en digital lärplattform till att dela material, hantera inlämningar, meddelanden och diskussioner, även om kursen också har klassrumsaktiviteter.

Jag har hittils inte behövt spela in några föreläsningar eller demonstrationer, men nu är jag bättre förberedd ifall det skulle behövas. En fördel med att behöva lyssna på mig själv alldeles för noggrant för att kunna skriva textningen: Jag insåg att jag, i min nedkortade föreläsning, gav en ganska torftig förklaring av ett visst genetiskt koncept (för insatta: ja, det var naturligtvis kopplingsojämvikt). Om jag skulle använda den i faktisk undervisning måste jag spela in den delen igen med en bättre förklaring.

Screenshot av mig som spelar in en föreläsning med undertexten 'don't bother to talk to me'

Den automatiska textningen har det inte lätt med min engelska kombinerad med genetisk terminologi. Jag minns inte vad jag sa här, men det var något helt annat.

Annars var det mest intressanta att prata (och i någon mån klaga) med andra deltagare om det senaste årets distansundervisning. Ett återkommande klagomål under det gångna året är hur trist det är att föreläsa för en skärm, jämfört med att göra det inför en publik i ett rum. Jag håller med: Det är både tråkigare och mer stressande att prata till en skärm, och det blir knappast bättre när det är en inspelning på gång. Men å andra sidan, varför är det så viktigt att titta på åhörarnas ansikten? Vet vi att studenterna hänger med för att de ser ut att hänga med — eller att de inte hänger med för att de ser ut att uttråkat titta ut i rymden? Nej, såklart inte. Det är klart att det är trevligt för mig att se dem jag talar till, men jag har svårt att se att det ger mig någon användbar information om vad de förstår och inte, om jag inte frågar dem.

Och på motsatt håll: är det viktigt att studenterna ser mitt ansikte? Även det omvända klassrummet, i all sin påstådda radikalitet, verkar en smula fixerat vid föreläsningar. Å ena sidan känns det seriöst med en videoföreläsning, i alla fall om den inte är för tafflig — att jag tagit tiden och ansträngningen att samla ihop och spela in materialet. Och det finns ett visst underhållningsvärde, som inte är att förakta, i att läraren visar sitt ansikte och har ett personligt tilltal. Å andra sidan, all kritik som finns mot föreläsningsformen (med undantaget att man inte kan spola tillbaka och se den igen) kan riktas mot den förinspelade föreläsningen. Den eventuella lilla interaktivitet som finns i en live-föreläsning försvinner också. Det viktiga är att studenterna lär sig så bra som möjligt, och frågan är om de blir bättre föreberedda för en lektion eller ett seminarium av att få en inspelad föreläsning än de skulle bli av få läsanvisningar eller en förberedelseuppgift.

Journal club of one: ”Genome-wide enhancer maps link risk variants to disease genes”

(Here is a a paper from a recent journal club.)

Nasser et al. (2021) present a way to prioritise potential noncoding causative variants. This is part of solving the fine mapping problem, i.e. how to find the underlying causative variants from genetic mapping studies. They do it by connecting regulatory elements to genes and overlapping those regulatory elements with variants from statistical fine-mapping. Intuitively, it might not seem like connecting regulatory elements to genes should be that hard, but it is a tricky problem. Enhancers — because that is the regulatory element most of this is about; silencers and insulators get less attention — do not carry any sequence that tells us what gene they will act on. Instead, this needs to be measured or predicted somehow. These authors go the measurement route, combining chromatin sequencing with chromosome conformation capture.

This figure from the supplementary materials show what the method is about:

Additional figure 1 from Nasser et al. (2021) showing an overview of the workflow and an example of two sets of candidate variants derived from-fine mapping, each with variants that overlap enhancers connected to IL10.

They use chromatin sequence data (such as ATAC-seq, histone ChIP-seq or DNAse-seq) in combination with HiC chromosome conformation capture data to identify enhancers and connect them to genes (this was developed earlier in Fulco et al. 2019). The ”activity-by-contact model” means to multiply the contact frequency (from HiC) between promoter and enhancer with the enhancer activity (read counts from chromatin sequencing), standardised by the total contact–activity product with all elements within a window. Fulco et al. (2019) previously found that this conceptually simple model did well at connecting enhancers to genes (as evaluated with a CRISPR inhibition assay called CRISPRi-FlowFISH, which we’re not going into now, but it’s pretty ingenious).

In order to use this for fine-mapping, they calculated activity-by-contact maps for every gene combined with every open chromatin element within 5 Mbp for 131 samples from ENCODE and other sources. The HiC data were averaged of contacts in ten cell types, transformed to be follow a power-law distribution. That is, they did not do the HiC analysis per cell type, but use the same average HiC contact matrix combined with chromatin data from different cell types. Thus, the specificity comes from the promoters and enhancers that are called as active in each sample — I assume this is because the HiC data comes from a different (and smaller) set of cell types than the chromatin sequencing. Element–gene pairs that reached above a threshold were considered connected, for a total of about six million connections, involving 23,000 genes and 270,000 enhancers. On average, a gene had 2.8 enhancers and an enhancer connected to 2.7 genes.

They picked putative causative variants by overlapping the variant sets with these activity-by-contact maps and selecting the highest scoring enhancer gene pair.They used fine-mapping results from multiple previous studies. These variants were estimated with different methods, but they are all some flavour of fine-mapping by variable selection. Statistical fine mapping estimate sets of variants, called credible sets, that have high posterior probability of being the causative variant. They included only completely noncoding credible sets, i.e. those that did not include a coding sequence or splice variant. They applied this to 72 traits in humans, generating predictions for ~ 5000 noncoding credible sets of variants.

Did it work?

Variants for fine-mapping were enriched in connected enhancers more than in open chromatin in general, in cell types that are relevant to the traits. In particular, inflammatory bowel disease variants were enriched in enhancers in 65 samples, including immune cell types and gut cells. The strongest enrichment was in activated dendritic cells.

They used a set of genes previously known to be involved in inflammatory bowel disease, assuming that they were the true causative genes for their respective noncoding credible sets, and then compared their activity-by-contact based prioritisation of the causative gene to simply picking the closest gene. Picking the closest gene was right in 30 out of 37 sets. Picking the gene with the highest activity-by-contact score was right in 17 cases out of 18 sets that overlapped an activity-by-contact enhancer. Thus, this method had higher precision but worse recall. They also tested a number of eQTL-based, enrichment and enhancer–gene mapping methods, that did worse.

What it tells us about causative variants

Most of the putative causative variants, picked based on maximising activity-by-contact, were close to the proposed gene (median 13 kbp) and most involved the closest gene (77%). They were often found their putative causative variants to be located in enhancers that were only active in a few cell- or tissue types (median 4), compared to the promoters of the target genes, that were active in a broader set (median 120). For example, in inflammatory bowel disease, there were several examples where the putatively causal enhancer was only active in a particular immune cell or a stimulated state of an immune cell.

My thoughts

What I like about this model is that it is so different to many of the integrative and machine learning methods we see in genomics. It uses two kinds of data, and relatively simple computations. There is no machine learning. There is no sequence evolution or conservation analysis. Instead, measure two relevant quantities, standardise and preprocess them a bit, and then multiply them together.

If the success of the activity-by-contact model for prioritising causal enhancers generalises beyond the 18 causative genes investigated in the original paper, this is an argument for simple biology-based heuristics over machine learning models. It also suggest that, in the absence of contact data, one might do well by prioritising variant in enhancers that are highly active in relevant cell types, and picking the closest gene as the proposed causative gene.

However, the dataset needs to cover the relevant cell types, and possibly cells that are in the relevant stimulated states, meaning that it provides a motivation for rich conditional atlas-style datasets of chromatin and chromosome conformation.

I am, personally, a little bit sad that expression QTL methods seem to doing poorly. On the other hand, it makes some sense that eQTL might not have the genomic resolution to connect enhancers to genes.

Finally, if the relatively simple activity-by-contact model or the ridiculously simple method of ”picking the closest gene” beats machine learning models using the same data types and more, it suggests that the machine learning methods might not be solving theright problem. After all, they are not trained directly to prioritise variants for complex traits — because there are too few known causative variants for complex traits.


Fulco, C. P., Nasser, J., Jones, T. R., Munson, G., Bergman, D. T., Subramanian, V., … & Engreitz, J. M. (2019). Activity-by-contact model of enhancer–promoter regulation from thousands of CRISPR perturbations. Nature genetics.

Nasser, J., Bergman, D. T., Fulco, C. P., Guckelberger, P., Doughty, B. R., Patwardhan, T. A., … & Engreitz, J. M. (2021). Genome-wide enhancer maps link risk variants to disease genes. Nature.

Researchers in ecology and evolution don’t use Platt’s strong inference, and that’s okay

This paper (Betts et al. 2021) came out about a month ago investigating whether ecology and evolution papers explicitly state mechanistic hypotheses, and arguing that they ought to, preferably multiple alternative hypotheses. It advocates the particular flavour of hypothetico–deductivism expressed by Platt (1964) as ”strong inference”.

The key idea in Platt’s (1964) account of strong inference, that distinguishes it from garden variety accounts of scientific reasoning, is his emphasis on multiple alternative hypotheses and experiments that distinguish between them. He describes science progressing like a decision tree, with experiments as branching points — a ”conditional inductive tree”. He also emphasises theory construction, as he approvingly quotes biologists on the need to think hard about what possibilities there are in order to make the most informative experiments.

The empirical part of Betts et al. (2021) consists of a literature survey, where the authors read 268 empirical articles from ecology, evolution and glam journals published 1991-2018 to look whether they explicitly stated hypotheses (that is, proposed explanations or causes, regardless of whether they used the actual word ”hypothesis”), whether these were mechanistic, and whether there were multiple working hypotheses contrasted against each other. They estimated the slope over time, and the association between hypothesis use and journal impact factor and whether the research was funded by a major grant.

The results suggest that papers with explicit hypotheses are in a minority, that there was no significant change over time, and little association with impact factor or grants. The prevalence of mechanistic hypotheses was 26% and of multiple working hypotheses 6.7%. There were no significant time trends in hypothesis use. There was a significant difference in journal impact factor in one of the comparisons, where papers with mechanistic hypotheses were published in journals with 0.3 higher impact factor on average. There was no association with grants.

The authors go on to discuss how strong inference is still useful both to the scientific community and to individual researchers, arguing that they might not get more grants or fancier papers, but they will feel better and their research will be of higher quality. How to interpret the lack of clear increase or decrease over time depends on one’s level of optimism I guess. An optimistic take could be that the authors’ fear that machine learning and large datasets are turning researchers away from explanation seem not to be a major concern. A pessimistic take could be, like the they suggest in the Discussion, that decades of admonitions to do hypothetico–deductive science have not had much effect.

Thinking about causality is a good thing

I wholeheartedly agree with the authors that thinking about explanations, causality and mechanism is a useful thing to do, and probably something we ought to do more. It is probably useful to spend more time than we do (for me, to spend more time than I do) thinking about how theories map to testable hypotheses, how those hypotheses map to quantities that can be estimated, and how well the methods and data at hand manage to perform that estimation. Some of my best lessons from science over the last years have come from that sort of thing.

I also agree with them that causality is often what scientists are after — even in many cases where we think that the goal is prediction, the most trustworthy explanation for any ability of a prediction model to generalise is going to be an explanation in terms of mechanisms. They don’t go into this too much, but the caption of Figure 1 gives an example of how even when we are interested in prediction, explanations can be handy.

To take an example from my field: genomic prediction, when we fit statistical models to DNA data to predict trait values for breeding, seems like a pure prediction problem. And animal breeders are pragmatic enough to use anything that worked; if tea leaves worked well for breeding value prediction, they would use them (I am sure I have heard or read some animal breeding researcher make that joke, but I can’t find the source now). But why don’t tea leaves work, while single nucleotide markers spaced somewhat evenly across the genome do? Because we have a fairly well established theory for how genetic variants cause trait variation between individuals in a fairly predictable way. That doesn’t automatically mean that the statistical associations and predictions will transfer between situations — in fact they don’t. But there is theory that helps explain why genomic predictions generalise more or less well.

I also like that they, when they define what a hypothesis is (a proposal of a mechanism or cause of a phenomenon), make very clear that statistical hypotheses and null hypotheses don’t count as scientific hypotheses. There is more to explore here about the relationship between statistical inference and scientific hypotheses, and about the rhetorical move to declare something the null or default model, but that is for another day.

If scientists don’t use strong inference, maybe the problem isn’t with the scientists

Given the mostly negative results, the discussion starts as follows:

Overall, the prevalence of hypothesis use in the ecological and evolutionary literature is strikingly low and has been so for the past 25 years despite repeated calls to reverse this pattern […]. Why is this the case?

They don’t really have an answer to this question. They consider whether most work is descriptive fact finding, or purely about making prediction models, but argue that it is unlikely that 75% of ecology and evolution research is like that — and I agree. They consider a lack of individual incentives for formulating hypotheses, and that might be true; there was no striking association between hypotheses, grants or glamorous publications (unless you consider 0.3 journal impact factor units a compelling individual-level incentive). They suggest that there are costs to hypothesising — it ”an feel like a daunting hurdle”. However, they do not consider the option that their proposed model of science isn’t actually a useful method.

To think about that, we should discuss some of the criticisms of strong inference.

O’Donohue & Buchanan (2001) criticise the strong inference model by arguing that there are problems with each step of the method, and that the history of science anecdotes that Platt use to illustrate it actually show little evidence of being based on strong inference.

Specifically, Platt’s first step, devising alternative hypotheses, is problematic both because one might lack the background knowledge to devise many alternative hypotheses, and that there is no sure way to enumerate the plausible alternative hypotheses.

The second step, devicing crucial experiments, is problematic because of the Duhem–Quine problem, namely that experiments are never conclusive; even when the data are inconsistent with a hypothesis, we do not know whether the problem is with the hypothesis or with any number of, sometimes implicit, auxiliary assumptions. (By the way, I love that Betts et al. cite two ecologists called Quinn and Dunham (1983) who wrote about problems with conclusively testing hypotheses in ecology and evolution. I wish they got together to write it just because the names are so perfect for the topic.)

The third step, conducting crucial experiments, is problematic because experimental results may not cleanly separate hypotheses. Then again, would Platt not just reply that one ought to devise a better experiment then? This objection seems weak. Science is hard and it seems perfectly possible that there are lots of plausible alternative hypotheses that can’t be told apart, at least with data that can be realistically gathered.

Finally, O’Donohue & Buchanan (2001) go through some of Platt’s examples given of supposed strong inference, and suggest that Platt did not represent them accurately. And Platt’s paper really reads as a series of hero-worshipping anecdotes about great scientists, who were very successful and therefore must have employed strong inference. It is not convincing history of science.

Bett et al. (2021) instead give two examples of science that they suppose could have been helped by strong inference. The first example is Lamarck who is supposed to have been able to possibly come up with evolution by natural selection if he had entertained multiple working hypotheses. The second is psychologist Amy Cuddy’s power pose work which supposedly could have been more reproducible had it considered more causal mechanisms. They give no analysis of Lamarck’s scientific method or argument for how strong inference might have helped him. The evidence that strong inference could have helped Amy Cuddy is that she said in an interview that she should have considered the psychological mechanisms behind power posing more.

The claim, inherited from Platt, that multiple working hypotheses reduce confirmation bias really cries out for evidence. As far as I can tell, neither Platt nor Betts et al. provide any, beyond the intuition that you get less attached to one hypothesis if you entertain more than one. That doesn’t seem unreasonable to me, but it just shoves the problem to the next step. Now I have several plausible hypotheses, and I need to decide on one of them, that will advance my decision tree of experiments to the next branching point and provide the headline result for my next paper. That choice seem to me to be just as ripe for confirmation bias and perverse incentives than the choice to call the result for or against a particular hypothesis. In cases where there are only two hypotheses that are taken to be mutually exclusive, the distinction seems only rhetorical.

How Betts et al. (2021) themselves use hypotheses

Let us look at how Betts et al. (2021) themselves use hypotheses and whether they successfully use strong inference for the empirical part of the paper.

That the abstract states two hypotheses — that the number of papers with explicit hypotheses could have decreased because of a perceived rise in descriptive big data research; that explicit hypotheses could have increased because of hypotheses being promoted by journals and funders — none of which turn out to be consistent with the data, which shows a steady low prevalence of explicitly stated hypotheses.

One should note that in no way are these two mechanistic accounts mutually exclusive. If the slope of the line had been positive, that would have no logical force to compel us to believe that the rise of machine learning in research did not lead some researchers to abandon hypothesis-driven research — at most, we could conclude that the quantitative effect of accounts that promote and discourage explicit hypotheses balance towards the former.

Thus, we see two of the objections to Platt’s strong inference paradigm in action: the set of alternative hypotheses is by no means covering the whole range of possibilities, and the study in question is not a conclusive test that allows us to exclude any of them.

In the second set of analyses, measuring whether explicitly stating a hypothesis was associated with journal ranking, citations, or funding, the authors predict that hypotheses ought to be associated with these things if they confer academic success. This conform to their ”if–then” pattern for a research hypothesis, so presumably it is a hypothesis. In this case, there is no alternative hypothesis. This illustrates a third problem with Platt’s strong inference, namely that it is seldom actually applied in real research, even by its proponents, presumably because it is difficult to do so.

If we look at these two sets of analyses (considering change over time in explicit hypothesis use and association between hypothesis use and individual-researcher incentives) and the main message of the paper, which is that strong inference is useful and needs to be encouraged, there is a disconnect. The two sets of hypotheses, whether they are examples of strong inference or not, do not in any way test the theory that strong inference is a useful scientific method, or the normative claim that it therefore should be incentivised — rather, they illustrate them. We can ask Platt’s diagnostic question from the 1964 paper about the idea that strong inference is a method that needs to be encouraged — what would disprove this view? Some kind of data, surely, but nothing that was analysed in this paper.

I hypothesise that this is common in scientific papers. A lot of the reasoning goes on at a higher level than the hypothesis — theories, frameworks, normative stances — and the whether individual hypotheses stand and fall have little bearing on these larger structures. This is not necessarily bad or unscientific, even if it does not conform to Platt’s strong inference.

Method angst

Finally, the paper starts out with a strange anecdote: the claim that there is in the beginning of most scientists’ careers a period of ”hypothesis angst” where the student questions the hypothetico–deductive method. This is stated without evidence, and without following through on the cliff-hanger by explaining how the angst resolves. How are early career scientists convinced to come back into the fold? The anecdote becomes even stranger once you realise that, according to their data, explicit hypothesis use isn’t very common. If most research don’t use explicit hypotheses, it seems more likely that students, who have just sat through courses on scientific method, would feel cognitive dissonance, annoyance or angst over the fact that researcher around them don’t state explicit hypotheses or follow the simple schema of hypothetico–deductivism.


Betts, MG, Hadley, AS, Frey, DW, et al. When are hypotheses useful in ecology and evolution?. Ecol Evol. 2021; 00: 1-15.

O’Donohue, W., & Buchanan, J. A. (2001). The weaknesses of strong inference. Behavior and philosophy, 1-20.

Platt, JR. (1964) Strong Inference: Certain systematic methods of scientific thinking may produce much more rapid progress than others. Science 146 (3642), 347-353.

A genetic mapping animation in R

Cullen Roth posted a beautiful animation of quantitative trait locus mapping on Twitter. It is pretty amazing. I wanted to try to make something similar in R with gganimate. It’s not going to be as beautiful as Roth’s animation, but it will use the same main idea of showing both a test statistic along the genome, and the underlying genotypes and trait values. For example, Roth’s animation has an inset scatterplot that appears above the peak after it’s been reached; to do that I think we would have to go a bit lower-level than gganimate and place our plots ourselves.

First, we’ll look at a locus associated with body weight in chickens (with data from Henriksen et al. 2016), and then a simulated example. We will use ggplot2 with gganimate and a magick trick for combining the two animations. Here are some pertinent snippets of the code; as usual, find the whole thing on Github.

LOD curve

We will use R/qtl for the linkage mapping. We start by loading the data file (Supplementary Dataset from Henriksen et al. 2016). A couple of individuals have missing covariates, so we won’t be able to use them. This piece of code first reads the cross file, and then removes the two offending rows.


## Read cross file
cross <- read.cross(format = "csv",
                    file = "41598_2016_BFsrep34031_MOESM83_ESM.csv")

cross <- subset(cross, ind = c("-34336", "-34233"))

For nice plotting, let’s restrict ourselves to fully informative markers (that is, the ones that tell the two founder lines of the cross apart). There are some partially informative ones in the dataset too, and R/qtl can get some information out of them thanks to genotype probability calculations with its Hidden Markov Model. They don’t make for nice scatterplots though. This piece of code extracts the genotypes and identifies informative markers as the ones that only have genotypes codes 1, 2 or 3 (homozygote, heterozygote and other homozygote), but not 5 and 6, which are used for partially informative markers.

## Get informative markers and combine with phenotypes for plotting

geno <-,
                                chr = 1))

geno_values <- lapply(geno, unique)
informative <- unlist(lapply(geno_values,
    function(g) all(g %in% c(1:3, NA))))

geno_informative <- geno[informative]

Now for the actual scan. We run a single QTL scan with covariates (sex, batch that the chickens were reared in, and principal components of genotypes), and pull out the logarithm of the odds (LOD) across chromosome 1. This piece of code first prepares a design matrix of the covariates, and then runs a scan of chromosome 1.

## Prepare covariates
pheno <- pull.pheno(cross)

covar <- model.matrix(~ sex_number + batch + PC1 + PC2 + PC3 + PC4 + 
                        PC5 + PC6 + PC7 + PC8 + PC9 + PC10,
                      na.action = na.exclude)[,-1]

scan <- scanone(cross = cross,
                pheno.col = "weight_212_days",
                method = "hk",
                chr = 1,
                addcovar = covar)

Here is the LOD curve along chromosome 1 that want to animate. The peak is the biggest-effect growth locus in this intercross, known as ”growth1”.

With gganimate, animating the points is as easy as adding a transition layer. This piece of code first makes a list of some formatting for our graphics, then extracts the LOD scores from the scan object, and makes the plot. By setting cumulative in transition_manual the animation will add one data point at the time, while keeping the old ones.


formatting <- list(theme_bw(base_size = 16),
                   theme(panel.grid = element_blank(),
                         strip.background = element_blank(),
                         legend.position = "none"),
                   scale_colour_manual(values =
                         c("red", "purple", "blue")))

lod <-
lod <- lod[informative,]
lod$marker_number <- 1:nrow(lod)

plot_lod <- qplot(x = pos,
                  y = lod,
                  data = lod,
                  geom = c("point", "line")) +
  ylab("Logarithm of odds") +
  xlab("Position") +
  formatting +
                    cumulative = TRUE)

Plot of the underlying data

We also want a scatterplot of the data. Here what a jittered scatterplot will look like at the peak. The horizontal axes are genotypes (one homozygote, heterozygote in the middle, the other homozygote) and the vertical axis is the body mass in grams. We’ve separated the sexes into small multiples. Whether to give both sexes the same vertical axis or not is a judgement call. The hens weigh a lot less than the roosters, which means that it’s harder to see patterns among them when put on the same axis as the roosters. On the other hand, if we give the sexes different axes, it will hide that difference.

This piece of code builds a combined data frame with informative genotypes and body mass. Then, it makes the above plot for each marker into an animation.


## Combined genotypes and weight
geno_informative$id <- pheno$id
geno_informative$w212 <- pheno$weight_212_days
geno_informative$sex <- pheno$sex_number

melted <- pivot_longer(geno_informative,
                       -c("id", "w212", "sex"))

melted <- na.exclude(melted)

## Add marker numbers
marker_numbers <- data.frame(name = rownames(scan),
                             marker_number = 1:nrow(scan),
                             stringsAsFactors = FALSE)

melted <- inner_join(melted, marker_numbers)

## Recode sex to words
melted$sex_char <- ifelse(melted$sex == 1, "male", "female")

plot_scatter <- qplot(x = value,
                     geom = "jitter",
                     y = w212,
                     colour = factor(value),
                     data = melted) +
  facet_wrap(~ factor(sex_char),
             ncol = 1) +
  xlab("Genotype") +
  ylab("Body mass") +
  formatting +

Combining the animations

And here is the final animation:

To put the pieces together, we use this magick trick (posted by Matt Crump). That is, animate the plots, one frame for each marker, and then use the R interface for ImageMagick to put them together and write them out.

gif_lod <- animate(plot_lod,
                   fps = 2,
                   width = 320,
                   height = 320,
                   nframes = sum(informative))

gif_scatter <- animate(plot_scatter,
                       fps = 2,
                       width = 320,
                       height = 320,
                       nframes = sum(informative))

## Magick trick from Matt Crump

mgif_lod <- image_read(gif_lod)
mgif_scatter <- image_read(gif_scatter)

new_gif <- image_append(c(mgif_lod[1], mgif_scatter[1]))
for(i in 2:sum(informative)){
  combined <- image_append(c(mgif_lod[i], mgif_scatter[i]))
  new_gif <- c(new_gif, combined)

image_write(new_gif, path = "out.gif", format = "gif")


Henriksen, Rie, et al. ”The domesticated brain: genetics of brain mass and brain structure in an avian species.” Scientific reports 6.1 (2016): 1-9.

Research politics

How is science political? As a working scientist, but not a political scientist or a scholar of science and technologies studies, I can immediately think of three categories of social relations that are important to science, and can be called ”politics”.

First, there is politics going on within the scientific community. We sometimes talk about ”the politics within a department” etc, and that seems like not just a metaphor but an accurate description. Who has money, who gets a position, who publishes where? This probably happens at different levels and sizes of micro-cultures, and we don’t have to imagine that it’s an altogether Machiavellian cloak and daggers affair. But we can ask ourselves simple questions like: Who in here is a big shot? Who is feared? Who do you turn to when you need to get something done? Who do you turn to when you need a name?

To the extent that scientist are humans living in a society, the politics within science is probably not all too dissimilar to politics outside. And to the extent that ideas attach themselves to people, this matters to the content of science, not just the people who do it. Sometimes, science changes in process of refinement of models that looks relatively rational and driven by theories and data. Sometimes, it changes, or doesn’t change, by bickering and animosity. Sometimes it changes because the proponents of certain ideas have resources and others don’t. Maybe we can imagine a scenario where parallel invention happen so often that, on average, it doesn’t matter who is in our out and the good ideas prevail. I doubt that is generally the case, though.

Second, there is politics in the sense of policy: government policy, international organisation policy, funding agency priorities, the strategy of a non-governmental organisation etc. Such organisations obviously have power over what research gets done and how, as they should in a democratic society — and as they certainly will make sure to, in any kind of society. To the extent that scientists respond to economic incentives and follow rules, that puts science in connection with politics. Certainly, any scientist involved in the process of applying for funding spends a lot of time thinking about how science aligns with policy and how it is useful.

Because, third, science is useful, which makes it political in the same sense that it is ethical or unethical — research responds to and has effects, even if often modest, on issues in the world. I would argue that science almost always aspires to do something useful, even if indirectly, even in basic science and obscure topics. Scientists are striving to make a difference, because they know how their topic can make a difference, when this isn’t common knowledge. Who knew that it would be important to study the molecular biology of emerging coronaviruses? Well, researchers who studied emerging coronaviruses, of course.

But even if researchers didn’t strive to do any good, all those grant applications were completely insincere and Hardy’s Mathematician’s Apology were right that researchers are chiefly driven by curiosity, pride and ambition … Almost all research would still have some, if modest, political ramifications. If there were no conceivable, even indirect, ways that some research affects any decisions taken by anyone — I’d say it’s either a case of very odd research indeed or very poor imagination.

This post is inspired by this tweet by John Cole, in turn replying to Hilary Agro. I don’t know who these scientists who don’t think that science has political elements are, but I’ll just agree and say that they are thoroughly mistaken.

Akademiskt skrivande: Magnus Linton om bra text

För några veckor sedan kom journalisten och författaren Manus Linton på videobesök till SLU och pratade om skrivande. Han är nämligen på uppdrag att få forskare att skriva bättre — han har varit ”writer-in-residence” (snygg titel) på Uppsala universitet och drivit workshopprojektet Skriven mening på samma tema. Det var inte klart för mig om det är samma projekt som pågår fortfarande trots att det enligt sidan ska ha slutat förra året, eller någon typ av fortsättning; sak samma. Föredraget handlade om vad bra text är, följt av tips. Han har en bok på samma ämne, Text & stil, som jag inte läst än men blev minst sagt sugen på.

Linton har mest arbetat med forskare inom samhällsvetenskap och humaniora, men jag tror ändå att flera poänger kan överföras på naturvetenskapliga texter. Eftersom jag verkar i ett fält med lite annan textkultur har jag lite svårt att känna igen den genre han beskrev som ”ditt nästa antologibidrag”, men jag vet i alla fall vad ”en essä på en dagstidnings en kultursida” är, även om det är tveksamt om jag någonsin kommer att skriva en sådan om min forskning. (Kultursideredaktörer, maila för all del!) Det handlade om skrivande som riktar sig till fler än de närmast sörjande — ”text som inte enbart är skriven för att undervisa eller övertyga de närmaste kollegorna”. Vilket till exempel kan innefatta forskningsansökningar, som ju oftast läses och bedöms av kunniga människor som inte är just experter på ämnet för ansökan. Och jag vågar påstå att det i viss mån kan gälla forskningsartiklar, åtminstone de som inte helt är skrivna enligt formulär 1A — till exempel olika översikt-, perspektiv- eller letter to the editor-artiklar.

Jag gillar att Linton ogenerat pratade om bra text. Det tycks mig som forskare gärna använder eufemismer som att något är ”precist” eller ”tydligt”, när de menar ”bra”. Vad menar han då med bra? En bra text fyra saker för sig: För det första, den innebär ett möte mellan författare och läsare, på så sätt att den respekterar läsarens förmåga och lämnar tillräckligt utrymme åt läsaren att tänka själv. För det andra är den meningsfull för tre olika sfärer samtidigt: för författaren själv — det finns motivation och intresse att skriva den; för den egna akademiska miljön — den har tillräcklig skärpa och stringens; och för världen utanför — den är begriplig och använder minimalt med jargong. För det tredje tar den risker. För det fjärde handlar den om något, ett visst problem, och kan visa varför det är viktigt.

Det var lätt att känna igen problembeskrivningarna. Många forskare skriver ”narcissistiskt eller ogeneröst”. Många forskare skriver ängsligt: ”Man får intrycket att huvudsyftet är att inte göra fel.” En del texter är i praktiken översikter som samplar andras slutsatser utan att bidra med något nytt — där inflikade han att naturvetare kanske inte känner igen den beskrivningen och jag fnissade för mig själv, för jag kände mycket väl igen beskrivningen. Han var också kritisk till begreppet populärvetenskap, eftersom det ”får forskare att fokusera på enkelhet, även språkligt”. Det är nog sant, kanske inte om journalistiskt kunniga forskningsskribenter, men väl om oss forskare när vi försöker skriva så att icke-forskare kan begripa. Det stämmer att det lätt blir för torftigt och för inriktat på att förklara.

För den långa listan konkreta tips tror jag man får vända sig till boken, men här är några minnesvärda tips och tekniker från föredraget: Skapa rörelse med perspektivbyten och att använda annat material än just den egna studien. Allt material — upplevelser, debatter, händelser — kan ha plats i en essä. Tänk på början och slutet; akademisk text slutar ofta svagt genom att bara ebba ut. (En vän lästen nyligen ett utkast till en av mina ansökningar och påpekade: ”Du avslutar inte med något”. Det stämde.) Vad gäller början, ge inte en massa torr bakgrund eller en rad abstrakta frågor, utan börja med något som har laddning, visar varför det som kommer är viktigt och bygger förtroende hos läsaren. Någon frågade om problemen med att skriva om kontroversiella etiska frågor, och svaret blev att man ska vara glad för att ens ämne är kontroversiellt. Balansera fakta (”hur”) och analys (”varför”). Sakpåståenden utan analys är tråkigt, men analys utan konkreta fakta är obegripligt. Jag påminns om ett citat från JBS Haldane (från hans ”How to write a popular science article”) som citerades i podcasten Genetics Unzipped nyligen: ”start from a known fact, such as a bomb explosion, a bird’s song’ or a piece of cheese” och ”proceed to your goal in a series of hops rather than a single long jump”.

Lärde jag mig något då? Det var många bra idéer som är värda att lägga på minnet. En av dem testade jag direkt — att ta bort underrubrikerna så att texten måste hänga ihop utan dem. Men om det är något vi lärt oss av att studera och undervisa är det att man inte blir särskilt mycket bättre på att göra något genom att lyssna på någon som pratar om det, hur kul och klokt det än är. Men jag ska ta fram några av hans tips nästa gång jag sitter och våndas över en asigt tråkig text jag skriver.

Two books about academic writing

The university gave us gift cards for books for Christmas, and I spent them on academic self-help books. I expect that reading them will make me completely insufferable and, I hope, teach me something. Two of these books deal with how two write, but in very different ways: ”How to write a lot” by Paul Silvia and ”How to take smart notes” by Sönke Ahrens. In some ways, they have diametrically opposite views of what academic writing is, but they still agree on the main practical recommendation.


”How to write a lot” by Paul Silvia

In line with the subtitle — ”a practical guide to productive academic writing” — and the publication with the American Psychological Association brand ”LifeTools”, this is an extremely practical little book. It contains one single message that can be stated simply, a few chapters of elaboration on it, and a few chapters of padding in the form of advice on style, writing grant applications, and navigating the peer review process.

The message can be summarised like this: In order to write a lot, schedule writing time every day (in the morning, or in the afternoon if you are an afternoon person) and treat it like a class you’re teaching, in the sense that you won’t cancel or schedule something else over it unless absolutely necessary. In order to use that time productively, make a list of concrete next steps that will advance your writing projects (that may include other tasks, such as data analysis or background reading, that make the writing possible), and keep track of your progress. You might consider starting or joining a writing group for motivation and accountability.

If that summary was enough to convince you that keeping a writing schedule is a good thing, and give you an idea of how to do it, there isn’t that much else for you in the book. You might still want to read it, though, because it is short and quite funny. Also, the chapters I called padding contain sensible advice: carefully read the instructions for the grant you want to apply for, address all the reviewer comments either by changing something or providing a good argument not to change it, and so on. There is value in writing these things down; the book has potential as something to put in the hands of new researchers. The chapter on style is fine, I guess. I like that it recommends semicolons and discourages acronyms. But what is wrong with the word ”individuals”? Nothing, really, it’s just another academic advice-giver strunkwhiting their pet peeves.

However, if you aren’t convinced about the main message, the book provides a few sections trying to counter common counterarguments to scheduling writing time — ”specious barriers” according to the author — and cites some empirical evidence. That evidence consists of the one (1) publication about writing habits, which itself is a book with the word ”self-help” in the title (Boice 1990, which I couldn’t get a hold of). The data are re-drawn as a bar chart without sample sizes or uncertainty indicators. Uh-oh. I couldn’t get a hold of the book itself, but I did find this criticism of it (Sword 2016):

The admonition ‘Write every day!’ echoes like a mantra through recent books, manuals, and online resources on academic development and research productivity … [long list of citations including the first edition of Silvia’s book]. Ironically, however, this research-boosting advice is seldom backed up by the independent research of those who advocate it. Instead, proponents of the ‘write every day’ credo tend to base their recommendations mostly on anecdotal sources such as their own personal practice, the experiences of their students and colleagues, and the autobiographical accounts of full-time professional authors such as Stephen King, Annie Lamott, Maya Angelou, and bell hooks (see King, 2000, p. 148; Lamott, 1994, pp. xxii, 232; Charney, 2013; hooks, 1999, p. 15). Those who do seek to bolster their advice with research evidence almost inevitably cite the published findings of behavioural psychologist Robert Boice, whose famous intervention studies with ‘blocked’ writers took place more than two decades ago, were limited in demographic scope, and have never been replicated. Boice himself laced his empirical studies with the language of religious faith, referring to his write-every-day crusade as ‘missionary work’ and encouraging those who benefitted from his teachings to go forth and recruit new ‘disciples’ (Boice, 1990, p. 128). Remove Boice from the equation, and the existing literature on scholarly writing offers little or no conclusive evidence that academics who write every day are any more prolific, productive, or otherwise successful than those who do not.

You can tell from the tone where that is going. Sword goes on to give observational evidence that many academics don’t write every day and still do well enough to make it into an interview study of people considered ”exemplary writers”. Then again, maybe they would do even better if they did block out an hour of writing every morning. Sword ends by saying that she still recommends scheduled writing, keeping track of progress, etc, for the same reason as Silvia does — because they have worked for her. At any rate, the empirical backing seems relatively weak. As usual with academic advice, we are in anecdote country.

This book assumes that you have a backlog of writing to do and that academic writing is a matter of applying body to chair, hands to keyboard. You know what to do, now go do the work. This seems to often be true in natural science, and probably also in Silvia’s field of psychology: when we write a typical journal article know what we did, what the results were, and we have a fair idea about what about them is worth discussing. I’m not saying it’s necessarily easy, fun or painless to express that in stylish writing, but it doesn’t require much deep thought or new ideas. Sure, the research takes place in a larger framework of theory and ideas, but each paper moves that frame only very slightly, if at all. Silvia has this great quote that I think gets the metaphor right:

Novelists and poets are the landscape artists and portrait painters; academic writers are the people with big paint sprayers who repaint your basement.

Now on to a book about academic writing that actually does aspire to tell you how to have deep thoughts and new ideas.

”How to take smart notes” by Sönke Ahrens

”How to take smart notes”, instead, is a book that de-emphasises writing as a means to produce a text, and emphasises writing as a tool for thinking. It explains and advocates for a particular method for writing and organising research notes — about the literature and one’s own ideas –, arguing that it can can make researchers both more productive and creative. One could view the two books as dealing with two different steps of writing, with Ahrens’ book presenting a method for coming up with ideas and Silvia’s book presenting a method for turning those ideas into manuscripts, but Ahrens actually seems to suggest that ”How to take smart notes” provides a workflow that goes all the way to finished product — and as such, it paints a very different picture of the writing process.

The method is called Zettelkasten, which is German for a filing box for index cards (or ”slip-box”, but I refuse to call it that), and metonymically a note-taking method that uses one. That is, you use a personal index card system for research notes. In short, the point of the method is that when you read about or come up with an interesting idea (fact, hypothesis, conjecture etc) that you want to save, you write it on a single note, give it a number, and stick it in your archive. You also pull out other notes that relate to the idea, and add links between this new note and what’s already in your system. The box is nowadays metaphorical and replaced by software. Ahrens is certainly not the sole advocate; the idea even has its own little movement with hashtags, a subreddit and everything.

It is fun to compare ”How to take smart notes” to ”How to write a lot”, because early in the book, Ahrens criticises writing handbooks for missing the point by starting too late in the writing process — that is, when you already know what kind of text you are going to write — and neglecting the part that Ahrens thinks is most important: how you get the ideas in place to know what to write. He argues that the way to know what to write is to read widely, take good notes and make connections between those notes, and then the ideas for things to write will eventually emerge from the resulting structure. Then, at some point, you take the relevant notes out of the system, arrange them in the order into a manuscript, and then edit over them until the text is finished. So, we never really sit down to write. We write parts of our texts every day as notes, and then we edit them into shape. That is a pretty controversial suggestion, but it’s also charming. Overall, this book is delightfully contrarian, asking the reader not to plan their writing but be guided by their professional intuition; not to brainstorm ideas, because the good ideas will be in their notes and not in their brain; not to worry about forgetting what they read because forgetting is actually a good thing; to work only on things they find interesting, and so on.

This book is not practical, but an attempt to justify the method and turn it into a writing philosophy. I like that choice, because that is much more interesting than a simple guide. The explanation for how to practically implement a Zettelkasten system takes up less than three pages of the book, and does not include any meaningful practical information about how to set it up on a computer. All of that can be found on the internet in much greater detail. I’ve heard Ahrens say in an interview that the reason he didn’t go into the technology too much is that he isn’t convinced there is a satisfying software solution yet; I agree. The section on writing a paper, Zettelkasten-style, is less than five pages. The rest of the book is trying to connect the methods to observations from pedagogy and psychology, and to lots of anecdotes. Investor Charlie Munger said something about knowledge? You bet it can be read as an endorsement of Zettelkasten!

Books about writing can reveal something about how they were written, and both these books do. In ”How to write a lot”, Silvia talks about his own writing schedule and even includes a photo of his workspace to illustrate the point that you don’t need fancy equipment. In ”How to take smart notes”, Ahrens gives an example of how this note taking methods led him to an idea:

This book is also written with the help of a slip-box. It was for example a note on ”technology, acceptance problems” that pointed out to me that the answer to the question why some people struggle to implement the slip-box could be found in a book on the history of the shipping container. I certainly would not have looked for that intentionally — doing research for a book on effective writing! This is just one of many ideas the slip-box pointed out to me.

If we can learn something from what kinds of text this method produces by looking at ”How to take smart notes”, it seems that the method might help make connections between different topics and gather illustrative anecdotes, because the book is full of those. This also seems to be something Ahrens values in a text. On the other hand, it also seems that the method might lead to disorganised text, because the book also is full of that. It is divided into four principles and six steps, but I can neither remember what the steps and principles are nor how they relate to each other. The principle of organisation seems to be free form elaboration and variation, rather than disposition. Maybe it would work well as a hypertext, preserving some of the underlying network structure.

But we don’t know what the direction of causality is here. Maybe Ahrens just writes in this style and likes this method. Maybe with different style choices or editing, a Zettelkasten-composed text will look just like any other academic text. It must be possible to write plain old IMRAD journal articles with this system too. Imagine I needed to write an introductory paragraph on genetic effects on growth in chickens and were storing my notes in a Zettelkasten; I go to a structure note about growth in chickens, pull out all my linked literature notes about different studies, all accompanied by my own short summaries of what they found. Seems like this could be pretty neat, even for such a modest intellectual task.

Finally, what is that one main practical recommendation that both books, despite their utterly different perspectives on writing, agree on? To make it a habit to write every day.


Sword, H. (2016). ‘Write every day!’: a mantra dismantled. International Journal for Academic Development, 21(4), 312-322.

Silvia, P. J. (2019). How to write a lot: A practical guide to productive academic writing. Second edition. American Psychological Association

Ahrens, S. (2017). How to take smart notes: One simple technique to boost writing, learning and thinking. North Charleston, SC: CreateSpace Independent Publishing Platform.

Theory in genetics

A couple of years ago, Brian Charlesworth published this essay about the value of theory in Heredity. He liked the same Sturtevant & Beadle quote that I liked.

Two outstanding geneticists, Alfred Sturtevant and George Beadle, started their splendid 1939 textbook of genetics (Sturtevant and Beadle 1939) with the remark ‘Genetics is a quantitative subject. It deals with ratios, and with the geometrical relationships of chromosomes. Unlike most sciences that are based largely on mathematical techniques, it makes use of its own system of units. Physics, chemistry, astronomy, and physiology all deal with atoms, molecules, electrons, centimeters, seconds, grams—their measuring systems are all reducible to these common units. Genetics has none of these as a recognizable component in its fundamental units, yet it is a mathematically formulated subject that is logically complete and self contained’.

This statement may surprise the large number of contemporary workers in genetics, who use high-tech methods to analyse the functions of genes by means of qualitative experiments, and think in terms of the molecular mechanisms underlying the cellular or developmental processes, in which they are interested. However, for those who work on transmission genetics, analyse the genetics of complex traits, or study genetic aspects of evolution, the core importance of mathematical approaches is obvious.

Maybe this comes a surprise to some molecularly minded biologists; I doubt those working adjacent to a field called ”biophysics” or trying to understand what on Earth a ”t-distributed stochastic neighbor embedding” does to turn single-cell sequences into colourful blobs will have missed that there are quantitative aspects to genetics.

Anyways, Sturtevant & Beadle (and Charlesworth) are thinking of another kind of quantitation: they don’t just mean that maths is useful to geneticists, but of genetics as a particular kind of abstract science with its own concepts. It’s the distinction between viewing genetics as chemistry and genetics as symbols. In this vein, Charlesworth makes the distinction between statistical estimation and mathematical modelling in genetics, and goes on to give examples of the latter by an anecdotal history models of genetic variation, eventually going deeper into linkage disequilibrium. It’s a fun read, but it doesn’t really live up to the title by spelling out actual arguments for mathematical models, other than the observation that they have been useful in population genetics.

The hypothetical recurring reader will know this blog’s position on theory in genetics: it is useful, not just for theoreticians. Consequently, I agree with Charlesworth that formal modelling in genetics is a good thing, and that there is (and ought to be more of) constructive interplay between data and theory. I like that he suggests that mathematical models don’t even have to be that sophisticated to be useful; even if you’re not a mathematician, you can sometimes improve your understanding by doing some sums. He then takes that back a little by telling a joke about how John Maynard Smith’s paper on hitch-hiking was so difficult that only two researchers in the country could be smart enough to understand it. The point still stands. I would add that this applies to even simpler models than I suspect that Charlesworth had in mind. Speaking from experience, a few pseudo-random draws from a binomial distribution can sometimes clear your head about a genetic phenomenon, and while this probably won’t amount to any great advances in the field, it might save you days of fruitless faffing.

As it happens, I also recently read this paper (Robinaugh et al. 2020) about the value of formal theory in psychology, and in many ways, it makes explicit some things that Charlesworth’s essay doesn’t spell out, but I think implies: We want our scientific theories to explain ”robust, generalisable features of the world” and represent the components of the world that give rise to those phenomena. Formal models, expressed in precise languages like maths and computational models are preferable to verbal models, that express the structure of a theory in words, because these precise languages make it easier to deduce what behaviour of the target system that the model implies. Charlesworth and Robinaugh et al. don’t perfectly agree. For one thing, Robinaugh et al. seem to suggest that a good formal model should be able to generate fake data that can be compared to empirical data summaries and give explanations of computational models, while Charlesworth seems to view simulation as an approximation one sometimes has to resort to.

However, something that occurred to me while reading Charlesworth’s essay was the negative framing of why theory is useful. This is how Charlesworth recommends mathematical modelling in population genetic theory, by approvingly repeating this James Crow quote:

I hope to have provided evidence that the mathematical modelling of population genetic processes is crucial for a proper understanding of how evolution works, although there is of course much scope for intuition and verbal arguments when carefully handled (The Genetical Theory of Natural Selection is full of examples of these). There are many situations in which biological complexity means that detailed population genetic models are intractable, and where we have to resort to computer simulations, or approximate representations of the evolutionary process such as game theory to produce useful results, but these are based on the same underlying principles. Over the past 20 years or so, the field has moved steadily away from modelling evolutionary processes to developing statistical tools for estimating relevant parameters from large datasets (see Walsh and Lynch 2017 for a comprehensive review). Nonetheless, there is still plenty of work to be done on improving our understanding of the properties of the basic processes of evolution.

The late, greatly loved, James Crow used to say that he had no objection to graduate students in his department not taking his course on population genetics, but that he would like them to sign a statement that they would not make any pronouncements about evolution. There are still many papers published with confused ideas about evolution, suggesting that we need a ‘Crow’s Law’, requiring authors who discuss evolution to have acquired a knowledge of basic population genetics.

This is one of the things I prefer about Robinaugh et al.’s account: To them, theory is not mainly about clearing up confusion and wrongness, but about developing ideas by checking their consistency with data, and exploring how they can be modified to be less wrong. And when we follow Charlesworth’s anecdotal history of linked selection, it can be read as sketching a similar path. It’s not a story about some people knowing ”basic population genetics” and being in the right, and others now knowing it and being confused (even if that surely happens also); it’s about a refinement of models in the face of data — and probably vice versa.

If you listen to someone talking about music theory, or literary theory, they will often defend themselves against the charge that theory drains their domain of the joy and creativity. Instead, they will argue that theory helps you appreciate the richness of music, and gives you tools to invent new and interesting music. You stay ignorant of theory at your own peril, not because you risk doing things wrong, but because you risk doing uninteresting rehashes, not even knowing what you’re missing. Or something like that. Adam Neely (”Why you should learn music theory”, YouTube video) said it better. Now, the analogy is not perfect, because the relationship between empirical data and theory in genetics is such that the theory really does try to say true or false things about the genetics in a way that music theory (at least as practiced by music theory YouTubers) does not. I still think there is something to be said for theory as a tool for creativity and enjoyment in genetics.


Charlesworth, B. (2019). In defence of doing sums in genetics. Heredity, 123(1), 44-49.

Robinaugh, D., Haslbeck, J., Ryan, O., Fried, E. I., & Waldorp, L. (2020). Invisible hands and fine calipers: A call to use formal theory as a toolkit for theory construction. Paper has since been published in a journal, but I read the preprint.