A plot of genes on chromosomes

Marta Cifuentes and Wayne Crismani asked on Twitter if there is a web tool similar to the Arabidopsis Chromosome Map Tool that makes figures of genes on chromosomes for humans. This will not really be an answer to the question — not a web tool, not conveniently packaged — but I thought that would be a nice plot to make in R with ggplot2. We will use the ggrepel package to help with labelling, and get information from the Ensembl REST API with httr and jsonlite.

The plot and the final code to generate it

Below I will go through the functions that get us there, but here is the end product. The code is on GitHub as usual.

## Some Ensembl genes to test with

ensembl_genes <- c("ENSG00000125845", ## BMP2
"ENSG00000181690", ## PLAG1
"ENSG00000177508", ## IRX3
"ENSG00000140718") ## FTO

chr_sizes <- get_chromosome_sizes_from_ensembl(species = "homo_sapiens")

coords <- get_coordinates_from_ensembl(ensembl_genes)

plot_genes_test <- plot_genes(coords,
chr_sizes)


We will use Ensembl and access the genes via Ensembl Gene IDs. Here, I’ve looked up the Ensembl Gene IDs for four genes I like (in humans).

We need to know how long human chromosomes are in order to plot them, so we have a function for that; we also need to get coordinates for the genes, and we have a function for that. They are both below. These functions call up the Ensembl REST API to get the data from the Ensembl database.

Finally, there is a plotting function that takes the coordinates and the chromosome sizes as input and return a ggplot2 plot.

Getting the data out of the Ensembl REST API

Now, starting from the top, we will need to define those functions to interact with the Ensembl REST API. This marvellous machine allows us to get data out of the Ensembl database over HTTP, writing our questions in the URL. It is nicely described with examples from multiple languages on the Ensembl REST API website.

An alternative to using the REST API would be to download gene locations from BioMart. This was my first thought. BioMart is more familiar to me than the REST API, and it also has the nice benefit that it is easy to download every gene and store it away for the future. However, there isn’t a nice way to get chromosome lengths from BioMart, so we would have to download them from somewhere else. This is isn’t hard, but I thought using the REST API for both tasks seemed nicer.

## Plot showing the location of a few genomes on chromosomes

library(httr)
library(jsonlite)
library(ggplot2)
library(ggrepel)
library(purrr)

## Get an endpoint from the Ensembl REST api and return parsed JSON

get_from_rest_api <- function(endpoint_string,
server = "https://rest.ensembl.org/") {
rest <- GET(paste(server, endpoint_string, sep = ""),
content_type("application/json"))

stop_for_status(rest)

fromJSON(toJSON(content(rest)))
}


This first function gets content by sending a request, checking whether it worked (and stopping with an error if it didn’t), and then unpacking the content into an R object.

## Get chromosomes sizes from the Ensembl REST API

get_chromosome_sizes_from_ensembl <- function(species) {

json <- get_from_rest_api(paste("info/assembly/", species, sep = ""))

data.frame(name = as.character(json$top_level_region$name),
length = as.numeric(json$top_level_region$length),
stringsAsFactors = FALSE)
}


This second function asks for the genome assembly information for a particular species with the GET info/assembly/:species endpoint, and extracts the chromosome lengths into a data frame. The first part of data gathering is done, now we just need the coordinates fort the genes of interest.

## Get coordinates from Ensembl ids

get_coordinates_from_ensembl <- function(ensembl_ids) {

map_dfr(ensembl_ids,
function(ei) {
json <- get_from_rest_api(paste("lookup/id/", ei, sep = ""))

data.frame(position = (json$start + json$end)/2,
chr = json$seq_region_name, display_name = json$display_name,
stringsAsFactors = FALSE)
})
}


This function asks for the gene information for each gene ID we’ve given it with the GET lookup/id/:id endpoint, and extracts the rough position (mean of start and end coordinate), chromosome name, and the ”display name”, which in the human case will be a gene symbol. (For genes that don’t have a gene symbol, we would need to set up this column ourselves.)

At this point, we have the data we need in two data frames. That means it’s time to make the plot.

Plotting code

We will build a plot with two layers: first the chromosomes (as a geom_linerange) and then the gene locations (as a geom_text_repel from the ggrepel package). The text layer will move the labels around so that they don’t overlap even when the genes are close to each other, and by setting the nudge_x argument we can move them to the side of the chromosomes.

Apart from that, we change the scale to set he order of chromosomes and reverse the scale of the y-axis so that chromosomes start at the top of the plot.

The function returns a ggplot2 plot object, so one can do some further customisation after the fact — but for some features one would have to re-write things inside the function.

plot_genes <- function(coordinates,
chromosome_sizes) {

## Restrict to chromosomes that are in data
chrs_in_data <-
chromosome_sizes[chromosome_sizes$name %in% coordinates$chr,]
chr_order <- order(as.numeric(chrs_in_data$name)) ggplot() + geom_linerange(aes(x = name, ymin = 1, ymax = length/1e6), size = 2, colour = "grey", data = chrs_in_data) + geom_text_repel(aes(x = chr, y = position/1e6, label = display_name), nudge_x = 0.33, data = coordinates) + scale_y_reverse() + ## Fix ordering of chromosomes on x-axis scale_x_discrete(limits = chrs_in_data$name[chr_order],
labels = chrs_in_data$name[chr_order]) + theme_bw() + theme(panel.grid = element_blank()) + xlab("Chromosome") + ylab("Position (Mbp)") }  Possible extensions One feature from the Arabidopsis inspiration that is missing here is the position of centromeres. We should be able to use the option ?bands=1 in the GET info/assembly/:species to get cytogenetic band information and separate p and q arms of chromosomes. This will not be universal though, i.e. not available for most species I care about. Except to make cartoons of gene positions, I think this might be a nice way to make plots of genome regions with very course resolution, i.e. linkage mapping results, where one could add lines to show genomic confidence intervals, for example. Journal club of one: ”A unifying concept of animal breeding programmes” The developers of the MoBPS breeding programme simulators have published three papers about it over the last years: one about the MoBPS R package (Pook et al. 2020), one about their web server MoBPSweb (Pook et al. 2021), and one that discusses the logic of the specification for breeding programmes they use in the web interface (Simianer et al. 2021). The idea is that this specification can be used to describe breeding programmes in a precise and interoperable manner. The latter — about breeding programme specification — reads as if the authors had a jolly good time thinking about what breeding and breeding programmes are; at least, I had such feelings while reading it. It is accompanied by an editorial by Simianer (2021) talking about what he thinks are the important research directions for animal breeding research. Using simulation to aid breeding programme design is one of them. Defining and specifying breeding programmes After defining a breeding programme in a narrow genetic sense as a process achieving genetic change (more about that later), Simianer et al. (2021) go on to define a specification of such a breeding programme — or more precisely, of a model of a breeding programme. They use the word ”definition” for both things, but they really talk about two different things: defining what ”a breeding programme” is and specifying a particular model of a particular breeding programme. Here, I think it is helpful to think of Guest & Martin’s (2020) distinction, borrowed from psychology, between a specification of a model and an implementation of a model. A specification is a description of a model based in natural language or math. Breeding programme modelling often takes the shape of software packages, where the software implementation is the model and a specification is quite vague. Simianer et al. (2021) can be seen as a step towards a way of writing specifications, on a higher level in Guest & Martin’s hierarchy, of what such a breeding programme model should achieve. They argue that such a specification (”formal description”) needs to be comprehensive, unambiguous and reproducible. They claim that this can be achieved with two parts: the breeding environment and the breeding structure. The breeding environment includes: • founder populations, • quantitative genetic parameters for traits, • genetic architectures for traits, • economic values for traits in breeding goal, • description of genomic information, • description of breeding value estimation methods. Thus, the ”formal” specification depends on a lot of information that is either unknowable in practice (genetic architecture), estimated with error (genetic parameters), hard to describe other than qualitatively (founder population) and dependent on particular software implementations and procedures (breeding value estimation). This illustrates the need to distinguish the map from the territory — to the extent that the specification is exact, it describes a model of a breeding programme, not a real breeding programme. The other part of their specification is the graph-based model of breeding structure. I think this is their key new idea. The breeding structure, in their specification, consists of nodes that represent groups of elementary objects (I would say ”populations”) and edges that represent transformations that create new populations (such as selection or mating) or are a shorthand for repeating groups of edges and nodes. The elementary objects could be individuals, but they also give the example of gametes and genes (I presume they mean in the sense of alleles) as possible elementary objects. One could also imagine groups of genetically identical individuals (”genotypes” in a plant breeding sense). Nodes contain a given number of individuals, and can also have a duration. Edges are directed, and correspond to processes such as ageing, selection, reproduction, splitting or merging populations. They will carry attributes related to the transformation. Edges can have a time associated with them that it takes for the transformation to happen (e.g. for animals to gestate or grow to a particular age). Here is an example from Pook et al. (2021) of a breeding structure graph for dairy cattle: If we ignore the red edges for now, we can follow the flow of reproduction (yellow edges) and selection (green edges): The part on the left is what is going on in the breeding company: cows (BC-Cows) reproduce with selected bulls (BC-SelectedBulls), and their offspring become the next generation of breeding company cows and bulls (BC-NextCows and BC-NextBulls). On the right is the operation of a farm, where semen from the breeding company is used to inseminate cows and heifers (heifer, cow-L1, cow-L2, cow-L3) to produce calfs (calf-h, calf-L1, calf-L2, calf-L3). Each cycle each of these groups, through a selection operation, give rise to the next group (heifers becoming cows, first lactation cows becoming second lactation cows etc). Breeding loops vs breeding graphs vs breeding forms Except for the edges that specify breeding operations, there is also a special meta-edge type, the repeat edge, that is used to simplify breeding graphs with repeated operations. A useful edge class to describe breeding programmes that are composed of several breeding cycles is ”repeat.” It can be used to copy resulting nodes from one breeding cycle into the nodes of origin of the next cycle, assuming that exactly the same breeding activities are to be repeated in each cycle. The “repeat” edge has the attribute “number of repeats” which allows to determine the desired number of cycles. In the MoBPSweb paper (Pook et al. 2020), they describe how it is implemented in MoBPS: Given a breeding specification in MoBPSweb JSON format, the simulator will generate a directed graph by copying the nodes on the breeding cycle as many times as is specified by the repeat number. In this way, repeat edges are eliminated to make the breeding graph acyclic. The conversion of the breeding scheme itself is done by first detecting if the breeding scheme has any “Repeat” edges (Simianer et al. 2020), which are used to indicate that a given part of the breeding programme is carried out multiple times (breeding cycles). If that is the case, it will subsequently check which nodes can be generated without the use of any repeat. Next, all repeats that can be executed based on the already available nodes are executed by generating copies of all nodes between the node of origin and the target node of the repeat (including the node of origin and excluding the target node). Nodes generated via repeat are serial-numbered via “_1,” “_2” etc. to indicate the repeat number. This procedure is repeated until all repeat edges are resolved, leading to a breeding programme without any repeats remaining. There are at least three ways to specify the breeding structures for breeding programme simulations: these breeding graphs, breeding loops (or more generally, specifying breeding in a programming language) and breeding forms (when you get a pre-defined breeding structure and are allowed fill in the numbers). If I’m going to compare the graph specification to what I’m more familiar with, this is how you would create a breeding structure in AlphaSimR: library(AlphaSimR) ## Breeding environment founderpop <- runMacs(nInd = 100, nChr = 20) simparam <- SimParam$new(founderpop)
simparam$setSexes("yes_sys") simparam$addTraitA(nQtlPerChr = 100)
ymax <- max(data$response) plot_points_ga <- ggplot() + geom_quasirandom(aes(x = factor(group), y = response), data = data) + xlab("Group") + ylab("Response") + theme_bw() + scale_y_continuous(limits = c(ymin, ymax)) height_of_plot <- ymax-ymin group0_fraction <- (coef(model)[1] - ymin)/height_of_plot diff_min <- - height_of_plot * group0_fraction diff_max <- (1 - group0_fraction) * height_of_plot plot_difference_ga <- ggplot() + geom_pointrange(aes(x = term, y = estimate, ymin = estimate - 2 * std.error, ymax = estimate + 2 * std.error), data = result[2,]) + scale_y_continuous(limits = c(diff_min, diff_max)) + ylab("Difference") + xlab("Comparison") + theme_bw() (plot_points_ga | plot_difference_ga) + plot_layout(widths = c(0.75, 0.25))  Literature Gardner, M. J., & Altman, D. G. (1986). Confidence intervals rather than P values: estimation rather than hypothesis testing. British Medical Journal Ho, J., Tumkaya, T., Aryal, S., Choi, H., & Claridge-Chang, A. (2019). Moving beyond P values: data analysis with estimation graphics. Nature methods. Reflektioner om högskolepedagogik: undervisning online Den senaste i serien högskolepedagogiska kurser var en workshop om e-lärande, alltså undervisning online. Det är användbart både för genuina distanskurser, för kurser som behöver hållas på distans i nödfall därför att det råkar vara pandemi, och för vilka kurser som helst — eftersom varje kurs med självaktning har ett inslag av online-aktiviteter numera. Vi använder ju alltid en digital lärplattform till att dela material, hantera inlämningar, meddelanden och diskussioner, även om kursen också har klassrumsaktiviteter. Jag har hittils inte behövt spela in några föreläsningar eller demonstrationer, men nu är jag bättre förberedd ifall det skulle behövas. En fördel med att behöva lyssna på mig själv alldeles för noggrant för att kunna skriva textningen: Jag insåg att jag, i min nedkortade föreläsning, gav en ganska torftig förklaring av ett visst genetiskt koncept (för insatta: ja, det var naturligtvis kopplingsojämvikt). Om jag skulle använda den i faktisk undervisning måste jag spela in den delen igen med en bättre förklaring. Den automatiska textningen har det inte lätt med min engelska kombinerad med genetisk terminologi. Jag minns inte vad jag sa här, men det var något helt annat. Annars var det mest intressanta att prata (och i någon mån klaga) med andra deltagare om det senaste årets distansundervisning. Ett återkommande klagomål under det gångna året är hur trist det är att föreläsa för en skärm, jämfört med att göra det inför en publik i ett rum. Jag håller med: Det är både tråkigare och mer stressande att prata till en skärm, och det blir knappast bättre när det är en inspelning på gång. Men å andra sidan, varför är det så viktigt att titta på åhörarnas ansikten? Vet vi att studenterna hänger med för att de ser ut att hänga med — eller att de inte hänger med för att de ser ut att uttråkat titta ut i rymden? Nej, såklart inte. Det är klart att det är trevligt för mig att se dem jag talar till, men jag har svårt att se att det ger mig någon användbar information om vad de förstår och inte, om jag inte frågar dem. Och på motsatt håll: är det viktigt att studenterna ser mitt ansikte? Även det omvända klassrummet, i all sin påstådda radikalitet, verkar en smula fixerat vid föreläsningar. Å ena sidan känns det seriöst med en videoföreläsning, i alla fall om den inte är för tafflig — att jag tagit tiden och ansträngningen att samla ihop och spela in materialet. Och det finns ett visst underhållningsvärde, som inte är att förakta, i att läraren visar sitt ansikte och har ett personligt tilltal. Å andra sidan, all kritik som finns mot föreläsningsformen (med undantaget att man inte kan spola tillbaka och se den igen) kan riktas mot den förinspelade föreläsningen. Den eventuella lilla interaktivitet som finns i en live-föreläsning försvinner också. Det viktiga är att studenterna lär sig så bra som möjligt, och frågan är om de blir bättre föreberedda för en lektion eller ett seminarium av att få en inspelad föreläsning än de skulle bli av få läsanvisningar eller en förberedelseuppgift. Journal club of one: ”Genome-wide enhancer maps link risk variants to disease genes” (Here is a a paper from a recent journal club.) Nasser et al. (2021) present a way to prioritise potential noncoding causative variants. This is part of solving the fine mapping problem, i.e. how to find the underlying causative variants from genetic mapping studies. They do it by connecting regulatory elements to genes and overlapping those regulatory elements with variants from statistical fine-mapping. Intuitively, it might not seem like connecting regulatory elements to genes should be that hard, but it is a tricky problem. Enhancers — because that is the regulatory element most of this is about; silencers and insulators get less attention — do not carry any sequence that tells us what gene they will act on. Instead, this needs to be measured or predicted somehow. These authors go the measurement route, combining chromatin sequencing with chromosome conformation capture. This figure from the supplementary materials show what the method is about: Additional figure 1 from Nasser et al. (2021) showing an overview of the workflow and an example of two sets of candidate variants derived from-fine mapping, each with variants that overlap enhancers connected to IL10. They use chromatin sequence data (such as ATAC-seq, histone ChIP-seq or DNAse-seq) in combination with HiC chromosome conformation capture data to identify enhancers and connect them to genes (this was developed earlier in Fulco et al. 2019). The ”activity-by-contact model” means to multiply the contact frequency (from HiC) between promoter and enhancer with the enhancer activity (read counts from chromatin sequencing), standardised by the total contact–activity product with all elements within a window. Fulco et al. (2019) previously found that this conceptually simple model did well at connecting enhancers to genes (as evaluated with a CRISPR inhibition assay called CRISPRi-FlowFISH, which we’re not going into now, but it’s pretty ingenious). In order to use this for fine-mapping, they calculated activity-by-contact maps for every gene combined with every open chromatin element within 5 Mbp for 131 samples from ENCODE and other sources. The HiC data were averaged of contacts in ten cell types, transformed to be follow a power-law distribution. That is, they did not do the HiC analysis per cell type, but use the same average HiC contact matrix combined with chromatin data from different cell types. Thus, the specificity comes from the promoters and enhancers that are called as active in each sample — I assume this is because the HiC data comes from a different (and smaller) set of cell types than the chromatin sequencing. Element–gene pairs that reached above a threshold were considered connected, for a total of about six million connections, involving 23,000 genes and 270,000 enhancers. On average, a gene had 2.8 enhancers and an enhancer connected to 2.7 genes. They picked putative causative variants by overlapping the variant sets with these activity-by-contact maps and selecting the highest scoring enhancer gene pair.They used fine-mapping results from multiple previous studies. These variants were estimated with different methods, but they are all some flavour of fine-mapping by variable selection. Statistical fine mapping estimate sets of variants, called credible sets, that have high posterior probability of being the causative variant. They included only completely noncoding credible sets, i.e. those that did not include a coding sequence or splice variant. They applied this to 72 traits in humans, generating predictions for ~ 5000 noncoding credible sets of variants. Did it work? Variants for fine-mapping were enriched in connected enhancers more than in open chromatin in general, in cell types that are relevant to the traits. In particular, inflammatory bowel disease variants were enriched in enhancers in 65 samples, including immune cell types and gut cells. The strongest enrichment was in activated dendritic cells. They used a set of genes previously known to be involved in inflammatory bowel disease, assuming that they were the true causative genes for their respective noncoding credible sets, and then compared their activity-by-contact based prioritisation of the causative gene to simply picking the closest gene. Picking the closest gene was right in 30 out of 37 sets. Picking the gene with the highest activity-by-contact score was right in 17 cases out of 18 sets that overlapped an activity-by-contact enhancer. Thus, this method had higher precision but worse recall. They also tested a number of eQTL-based, enrichment and enhancer–gene mapping methods, that did worse. What it tells us about causative variants Most of the putative causative variants, picked based on maximising activity-by-contact, were close to the proposed gene (median 13 kbp) and most involved the closest gene (77%). They were often found their putative causative variants to be located in enhancers that were only active in a few cell- or tissue types (median 4), compared to the promoters of the target genes, that were active in a broader set (median 120). For example, in inflammatory bowel disease, there were several examples where the putatively causal enhancer was only active in a particular immune cell or a stimulated state of an immune cell. My thoughts What I like about this model is that it is so different to many of the integrative and machine learning methods we see in genomics. It uses two kinds of data, and relatively simple computations. There is no machine learning. There is no sequence evolution or conservation analysis. Instead, measure two relevant quantities, standardise and preprocess them a bit, and then multiply them together. If the success of the activity-by-contact model for prioritising causal enhancers generalises beyond the 18 causative genes investigated in the original paper, this is an argument for simple biology-based heuristics over machine learning models. It also suggest that, in the absence of contact data, one might do well by prioritising variant in enhancers that are highly active in relevant cell types, and picking the closest gene as the proposed causative gene. However, the dataset needs to cover the relevant cell types, and possibly cells that are in the relevant stimulated states, meaning that it provides a motivation for rich conditional atlas-style datasets of chromatin and chromosome conformation. I am, personally, a little bit sad that expression QTL methods seem to doing poorly. On the other hand, it makes some sense that eQTL might not have the genomic resolution to connect enhancers to genes. Finally, if the relatively simple activity-by-contact model or the ridiculously simple method of ”picking the closest gene” beats machine learning models using the same data types and more, it suggests that the machine learning methods might not be solving theright problem. After all, they are not trained directly to prioritise variants for complex traits — because there are too few known causative variants for complex traits. Literature Fulco, C. P., Nasser, J., Jones, T. R., Munson, G., Bergman, D. T., Subramanian, V., … & Engreitz, J. M. (2019). Activity-by-contact model of enhancer–promoter regulation from thousands of CRISPR perturbations. Nature genetics. Nasser, J., Bergman, D. T., Fulco, C. P., Guckelberger, P., Doughty, B. R., Patwardhan, T. A., … & Engreitz, J. M. (2021). Genome-wide enhancer maps link risk variants to disease genes. Nature. Researchers in ecology and evolution don’t use Platt’s strong inference, and that’s okay This paper (Betts et al. 2021) came out about a month ago investigating whether ecology and evolution papers explicitly state mechanistic hypotheses, and arguing that they ought to, preferably multiple alternative hypotheses. It advocates the particular flavour of hypothetico–deductivism expressed by Platt (1964) as ”strong inference”. The key idea in Platt’s (1964) account of strong inference, that distinguishes it from garden variety accounts of scientific reasoning, is his emphasis on multiple alternative hypotheses and experiments that distinguish between them. He describes science progressing like a decision tree, with experiments as branching points — a ”conditional inductive tree”. He also emphasises theory construction, as he approvingly quotes biologists on the need to think hard about what possibilities there are in order to make the most informative experiments. The empirical part of Betts et al. (2021) consists of a literature survey, where the authors read 268 empirical articles from ecology, evolution and glam journals published 1991-2018 to look whether they explicitly stated hypotheses (that is, proposed explanations or causes, regardless of whether they used the actual word ”hypothesis”), whether these were mechanistic, and whether there were multiple working hypotheses contrasted against each other. They estimated the slope over time, and the association between hypothesis use and journal impact factor and whether the research was funded by a major grant. The results suggest that papers with explicit hypotheses are in a minority, that there was no significant change over time, and little association with impact factor or grants. The prevalence of mechanistic hypotheses was 26% and of multiple working hypotheses 6.7%. There were no significant time trends in hypothesis use. There was a significant difference in journal impact factor in one of the comparisons, where papers with mechanistic hypotheses were published in journals with 0.3 higher impact factor on average. There was no association with grants. The authors go on to discuss how strong inference is still useful both to the scientific community and to individual researchers, arguing that they might not get more grants or fancier papers, but they will feel better and their research will be of higher quality. How to interpret the lack of clear increase or decrease over time depends on one’s level of optimism I guess. An optimistic take could be that the authors’ fear that machine learning and large datasets are turning researchers away from explanation seem not to be a major concern. A pessimistic take could be, like the they suggest in the Discussion, that decades of admonitions to do hypothetico–deductive science have not had much effect. Thinking about causality is a good thing I wholeheartedly agree with the authors that thinking about explanations, causality and mechanism is a useful thing to do, and probably something we ought to do more. It is probably useful to spend more time than we do (for me, to spend more time than I do) thinking about how theories map to testable hypotheses, how those hypotheses map to quantities that can be estimated, and how well the methods and data at hand manage to perform that estimation. Some of my best lessons from science over the last years have come from that sort of thing. I also agree with them that causality is often what scientists are after — even in many cases where we think that the goal is prediction, the most trustworthy explanation for any ability of a prediction model to generalise is going to be an explanation in terms of mechanisms. They don’t go into this too much, but the caption of Figure 1 gives an example of how even when we are interested in prediction, explanations can be handy. To take an example from my field: genomic prediction, when we fit statistical models to DNA data to predict trait values for breeding, seems like a pure prediction problem. And animal breeders are pragmatic enough to use anything that worked; if tea leaves worked well for breeding value prediction, they would use them (I am sure I have heard or read some animal breeding researcher make that joke, but I can’t find the source now). But why don’t tea leaves work, while single nucleotide markers spaced somewhat evenly across the genome do? Because we have a fairly well established theory for how genetic variants cause trait variation between individuals in a fairly predictable way. That doesn’t automatically mean that the statistical associations and predictions will transfer between situations — in fact they don’t. But there is theory that helps explain why genomic predictions generalise more or less well. I also like that they, when they define what a hypothesis is (a proposal of a mechanism or cause of a phenomenon), make very clear that statistical hypotheses and null hypotheses don’t count as scientific hypotheses. There is more to explore here about the relationship between statistical inference and scientific hypotheses, and about the rhetorical move to declare something the null or default model, but that is for another day. If scientists don’t use strong inference, maybe the problem isn’t with the scientists Given the mostly negative results, the discussion starts as follows: Overall, the prevalence of hypothesis use in the ecological and evolutionary literature is strikingly low and has been so for the past 25 years despite repeated calls to reverse this pattern […]. Why is this the case? They don’t really have an answer to this question. They consider whether most work is descriptive fact finding, or purely about making prediction models, but argue that it is unlikely that 75% of ecology and evolution research is like that — and I agree. They consider a lack of individual incentives for formulating hypotheses, and that might be true; there was no striking association between hypotheses, grants or glamorous publications (unless you consider 0.3 journal impact factor units a compelling individual-level incentive). They suggest that there are costs to hypothesising — it ”an feel like a daunting hurdle”. However, they do not consider the option that their proposed model of science isn’t actually a useful method. To think about that, we should discuss some of the criticisms of strong inference. O’Donohue & Buchanan (2001) criticise the strong inference model by arguing that there are problems with each step of the method, and that the history of science anecdotes that Platt use to illustrate it actually show little evidence of being based on strong inference. Specifically, Platt’s first step, devising alternative hypotheses, is problematic both because one might lack the background knowledge to devise many alternative hypotheses, and that there is no sure way to enumerate the plausible alternative hypotheses. The second step, devicing crucial experiments, is problematic because of the Duhem–Quine problem, namely that experiments are never conclusive; even when the data are inconsistent with a hypothesis, we do not know whether the problem is with the hypothesis or with any number of, sometimes implicit, auxiliary assumptions. (By the way, I love that Betts et al. cite two ecologists called Quinn and Dunham (1983) who wrote about problems with conclusively testing hypotheses in ecology and evolution. I wish they got together to write it just because the names are so perfect for the topic.) The third step, conducting crucial experiments, is problematic because experimental results may not cleanly separate hypotheses. Then again, would Platt not just reply that one ought to devise a better experiment then? This objection seems weak. Science is hard and it seems perfectly possible that there are lots of plausible alternative hypotheses that can’t be told apart, at least with data that can be realistically gathered. Finally, O’Donohue & Buchanan (2001) go through some of Platt’s examples given of supposed strong inference, and suggest that Platt did not represent them accurately. And Platt’s paper really reads as a series of hero-worshipping anecdotes about great scientists, who were very successful and therefore must have employed strong inference. It is not convincing history of science. Bett et al. (2021) instead give two examples of science that they suppose could have been helped by strong inference. The first example is Lamarck who is supposed to have been able to possibly come up with evolution by natural selection if he had entertained multiple working hypotheses. The second is psychologist Amy Cuddy’s power pose work which supposedly could have been more reproducible had it considered more causal mechanisms. They give no analysis of Lamarck’s scientific method or argument for how strong inference might have helped him. The evidence that strong inference could have helped Amy Cuddy is that she said in an interview that she should have considered the psychological mechanisms behind power posing more. The claim, inherited from Platt, that multiple working hypotheses reduce confirmation bias really cries out for evidence. As far as I can tell, neither Platt nor Betts et al. provide any, beyond the intuition that you get less attached to one hypothesis if you entertain more than one. That doesn’t seem unreasonable to me, but it just shoves the problem to the next step. Now I have several plausible hypotheses, and I need to decide on one of them, that will advance my decision tree of experiments to the next branching point and provide the headline result for my next paper. That choice seem to me to be just as ripe for confirmation bias and perverse incentives than the choice to call the result for or against a particular hypothesis. In cases where there are only two hypotheses that are taken to be mutually exclusive, the distinction seems only rhetorical. How Betts et al. (2021) themselves use hypotheses Let us look at how Betts et al. (2021) themselves use hypotheses and whether they successfully use strong inference for the empirical part of the paper. That the abstract states two hypotheses — that the number of papers with explicit hypotheses could have decreased because of a perceived rise in descriptive big data research; that explicit hypotheses could have increased because of hypotheses being promoted by journals and funders — none of which turn out to be consistent with the data, which shows a steady low prevalence of explicitly stated hypotheses. One should note that in no way are these two mechanistic accounts mutually exclusive. If the slope of the line had been positive, that would have no logical force to compel us to believe that the rise of machine learning in research did not lead some researchers to abandon hypothesis-driven research — at most, we could conclude that the quantitative effect of accounts that promote and discourage explicit hypotheses balance towards the former. Thus, we see two of the objections to Platt’s strong inference paradigm in action: the set of alternative hypotheses is by no means covering the whole range of possibilities, and the study in question is not a conclusive test that allows us to exclude any of them. In the second set of analyses, measuring whether explicitly stating a hypothesis was associated with journal ranking, citations, or funding, the authors predict that hypotheses ought to be associated with these things if they confer academic success. This conform to their ”if–then” pattern for a research hypothesis, so presumably it is a hypothesis. In this case, there is no alternative hypothesis. This illustrates a third problem with Platt’s strong inference, namely that it is seldom actually applied in real research, even by its proponents, presumably because it is difficult to do so. If we look at these two sets of analyses (considering change over time in explicit hypothesis use and association between hypothesis use and individual-researcher incentives) and the main message of the paper, which is that strong inference is useful and needs to be encouraged, there is a disconnect. The two sets of hypotheses, whether they are examples of strong inference or not, do not in any way test the theory that strong inference is a useful scientific method, or the normative claim that it therefore should be incentivised — rather, they illustrate them. We can ask Platt’s diagnostic question from the 1964 paper about the idea that strong inference is a method that needs to be encouraged — what would disprove this view? Some kind of data, surely, but nothing that was analysed in this paper. I hypothesise that this is common in scientific papers. A lot of the reasoning goes on at a higher level than the hypothesis — theories, frameworks, normative stances — and the whether individual hypotheses stand and fall have little bearing on these larger structures. This is not necessarily bad or unscientific, even if it does not conform to Platt’s strong inference. Method angst Finally, the paper starts out with a strange anecdote: the claim that there is in the beginning of most scientists’ careers a period of ”hypothesis angst” where the student questions the hypothetico–deductive method. This is stated without evidence, and without following through on the cliff-hanger by explaining how the angst resolves. How are early career scientists convinced to come back into the fold? The anecdote becomes even stranger once you realise that, according to their data, explicit hypothesis use isn’t very common. If most research don’t use explicit hypotheses, it seems more likely that students, who have just sat through courses on scientific method, would feel cognitive dissonance, annoyance or angst over the fact that researcher around them don’t state explicit hypotheses or follow the simple schema of hypothetico–deductivism. Literature Betts, MG, Hadley, AS, Frey, DW, et al. When are hypotheses useful in ecology and evolution?. Ecol Evol. 2021; 00: 1-15. O’Donohue, W., & Buchanan, J. A. (2001). The weaknesses of strong inference. Behavior and philosophy, 1-20. Platt, JR. (1964) Strong Inference: Certain systematic methods of scientific thinking may produce much more rapid progress than others. Science 146 (3642), 347-353. A genetic mapping animation in R Cullen Roth posted a beautiful animation of quantitative trait locus mapping on Twitter. It is pretty amazing. I wanted to try to make something similar in R with gganimate. It’s not going to be as beautiful as Roth’s animation, but it will use the same main idea of showing both a test statistic along the genome, and the underlying genotypes and trait values. For example, Roth’s animation has an inset scatterplot that appears above the peak after it’s been reached; to do that I think we would have to go a bit lower-level than gganimate and place our plots ourselves. First, we’ll look at a locus associated with body weight in chickens (with data from Henriksen et al. 2016), and then a simulated example. We will use ggplot2 with gganimate and a magick trick for combining the two animations. Here are some pertinent snippets of the code; as usual, find the whole thing on Github. LOD curve We will use R/qtl for the linkage mapping. We start by loading the data file (Supplementary Dataset from Henriksen et al. 2016). A couple of individuals have missing covariates, so we won’t be able to use them. This piece of code first reads the cross file, and then removes the two offending rows. library(qtl) ## Read cross file cross <- read.cross(format = "csv", file = "41598_2016_BFsrep34031_MOESM83_ESM.csv") cross <- subset(cross, ind = c("-34336", "-34233"))  For nice plotting, let’s restrict ourselves to fully informative markers (that is, the ones that tell the two founder lines of the cross apart). There are some partially informative ones in the dataset too, and R/qtl can get some information out of them thanks to genotype probability calculations with its Hidden Markov Model. They don’t make for nice scatterplots though. This piece of code extracts the genotypes and identifies informative markers as the ones that only have genotypes codes 1, 2 or 3 (homozygote, heterozygote and other homozygote), but not 5 and 6, which are used for partially informative markers. ## Get informative markers and combine with phenotypes for plotting geno <- as.data.frame(pull.geno(cross, chr = 1)) geno_values <- lapply(geno, unique) informative <- unlist(lapply(geno_values, function(g) all(g %in% c(1:3, NA)))) geno_informative <- geno[informative]  Now for the actual scan. We run a single QTL scan with covariates (sex, batch that the chickens were reared in, and principal components of genotypes), and pull out the logarithm of the odds (LOD) across chromosome 1. This piece of code first prepares a design matrix of the covariates, and then runs a scan of chromosome 1. ## Prepare covariates pheno <- pull.pheno(cross) covar <- model.matrix(~ sex_number + batch + PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 + PC10, pheno, na.action = na.exclude)[,-1] scan <- scanone(cross = cross, pheno.col = "weight_212_days", method = "hk", chr = 1, addcovar = covar)  Here is the LOD curve along chromosome 1 that want to animate. The peak is the biggest-effect growth locus in this intercross, known as ”growth1”. With gganimate, animating the points is as easy as adding a transition layer. This piece of code first makes a list of some formatting for our graphics, then extracts the LOD scores from the scan object, and makes the plot. By setting cumulative in transition_manual the animation will add one data point at the time, while keeping the old ones. library(ggplot2) library(gganimate) formatting <- list(theme_bw(base_size = 16), theme(panel.grid = element_blank(), strip.background = element_blank(), legend.position = "none"), scale_colour_manual(values = c("red", "purple", "blue"))) lod <- as.data.frame(scan) lod <- lod[informative,] lod$marker_number <- 1:nrow(lod)

plot_lod <- qplot(x = pos,
y = lod,
data = lod,
geom = c("point", "line")) +
ylab("Logarithm of odds") +
xlab("Position") +
formatting +
transition_manual(marker_number,
cumulative = TRUE)


Plot of the underlying data

We also want a scatterplot of the data. Here what a jittered scatterplot will look like at the peak. The horizontal axes are genotypes (one homozygote, heterozygote in the middle, the other homozygote) and the vertical axis is the body mass in grams. We’ve separated the sexes into small multiples. Whether to give both sexes the same vertical axis or not is a judgement call. The hens weigh a lot less than the roosters, which means that it’s harder to see patterns among them when put on the same axis as the roosters. On the other hand, if we give the sexes different axes, it will hide that difference.

This piece of code builds a combined data frame with informative genotypes and body mass. Then, it makes the above plot for each marker into an animation.

library(tidyr)

## Combined genotypes and weight
geno_informative$id <- pheno$id
geno_informative$w212 <- pheno$weight_212_days
geno_informative$sex <- pheno$sex_number

melted <- pivot_longer(geno_informative,
-c("id", "w212", "sex"))

melted <- na.exclude(melted)

marker_numbers <- data.frame(name = rownames(scan),
marker_number = 1:nrow(scan),
stringsAsFactors = FALSE)

melted <- inner_join(melted, marker_numbers)

## Recode sex to words
melted$sex_char <- ifelse(melted$sex == 1, "male", "female")

plot_scatter <- qplot(x = value,
geom = "jitter",
y = w212,
colour = factor(value),
data = melted) +
facet_wrap(~ factor(sex_char),
ncol = 1) +
xlab("Genotype") +
ylab("Body mass") +
formatting +
transition_manual(marker_number)



Combining the animations

And here is the final animation:

To put the pieces together, we use this magick trick (posted by Matt Crump). That is, animate the plots, one frame for each marker, and then use the R interface for ImageMagick to put them together and write them out.

gif_lod <- animate(plot_lod,
fps = 2,
width = 320,
height = 320,
nframes = sum(informative))

gif_scatter <- animate(plot_scatter,
fps = 2,
width = 320,
height = 320,
nframes = sum(informative))

## Magick trick from Matt Crump

new_gif <- image_append(c(mgif_lod[1], mgif_scatter[1]))
for(i in 2:sum(informative)){
combined <- image_append(c(mgif_lod[i], mgif_scatter[i]))
new_gif <- c(new_gif, combined)
}

image_write(new_gif, path = "out.gif", format = "gif")



Literature

Henriksen, Rie, et al. ”The domesticated brain: genetics of brain mass and brain structure in an avian species.” Scientific reports 6.1 (2016): 1-9.