Using R: plyr to purrr, part 1

Publicerat i computer stuff, data analysis, english av mrtnj

This is the second post about my journey towards writing more modern Tidyverse-style R code; here is the previous one. We will look at the common case of taking subset of data out of a data frame, making some complex R object from them, and then extracting summaries from those objects.

More nostalgia about plyr

I miss the plyr package. Especially ddply, ldply, and dlply, my favourite trio of R functions of all times. Yes, the contemporary Tidyverse package dplyr is fast and neat. And those plyr functions operating on arrays were maybe overkill; I seldom used them. But plyr was so smooth, so beautiful, and — after you’ve bashed your head against it for some time and it changed your mind — so intuitive. The package still exists, but it’s retired, and we shouldn’t keep writing R code like it’s 2009, should we?

I used to like to do something like this: take a data frame, push it through a function that returns some complex object, store those objects in a list, and then map various functions over that list to extract the parts I care about. What is the equivalent in the new idiom? To do the same thing but with the purrr package, of course! purrr replaces the list-centric parts of plyr, while dplyr covers data frame-centric summarisation, mutation and so on.

For this example we will be using the lm function on subsets of data and store the model object. It’s the simple case that everyone reaches for to demonstrate these features, but it’s a bit dubious. If you find yourself fitting a lot of linear models to subsets, maybe there are other models you should think about Especially here, when the fake data just happens to come from a multilevel model with varying intercepts … But in this case, let’s simulate a simple linear regression and look at the coefficients we get out.

set.seed(20210807)

n_groups <- 10
group_sizes <- rpois(n_groups, 10)
n <- sum(group_sizes)

fake_data <- tibble(id = 1:n,
                    group = rep(1:n_groups,
                                times = group_sizes),
                    predictor = runif(n, 0, 1))
group_intercept <- rnorm(n_groups, 0, 1)

fake_data$response <- fake_data$predictor * 10 +
  group_intercept[fake_data$group] +
  rnorm(n)

And here is the plyr code: First, dlply takes us from a data frame, splitting it by group, to a list of linear models. Then, ldply takes us from the list of models to a data frame of coefficients. tidy is a function from the wonderful broom package which extracts the same information as you would get in the rather unwieldy object from summary(lm), but as a data frame.

library(plyr)
library(broom)

fit_model <- function(data) {
  lm(response ~ predictor, data)
}

models <- dlply(fake_data,
                     "group",
                     fit_model)
result <- ldply(models, tidy)

This is what the results might looks like. Notice how ldply adds the split labels nicely into the group column, so we know which rows came from which subset.

   group        term   estimate std.error  statistic      p.value
1      1 (Intercept) -0.2519167 0.5757214 -0.4375670 6.732729e-01
2      1   predictor 10.6136902 1.0051490 10.5593207 5.645878e-06
3      2 (Intercept)  3.1528489 0.6365294  4.9531864 7.878498e-04
4      2   predictor  8.2075766 1.1458702  7.1627452 5.292586e-05
5      3 (Intercept) -0.8103777 0.6901212 -1.1742542 2.786901e-01
...

split/map: The modern synthesis

If we pull out purrr, we can get the exact same table like so. The one difference is that we get a tibble (that is, a contemporary, more well-behaved data frame) out of it instead of a base R data.frame.

library(purrr)

models <- map(split(fake_data,
                    fake_data$group),
                    fit_model)
result <- map_df(models,
                 tidy,
                 .id = "group")

# A tibble: 80 x 6
   group term        estimate std.error statistic  p.value
                            
 1 1     (Intercept)     1.67     0.773      2.16 6.32e- 2
 2 1     predictor       8.67     1.36       6.39 2.12e- 4
 3 2     (Intercept)     4.11     0.566      7.26 4.75e- 5
 4 2     predictor       8.19     1.11       7.36 4.30e- 5
 5 3     (Intercept)    -7.50     0.952     -7.89 9.99e- 5
 6 3     predictor      11.5      1.75       6.60 3.03e- 4
 7 4     (Intercept)   -19.8      0.540    -36.7  7.32e-13
 8 4     predictor      11.5      0.896     12.8  5.90e- 8
 9 5     (Intercept)   -12.4      1.03     -12.0  7.51e- 7
10 5     predictor       9.69     1.82       5.34 4.71e- 4
# … with 70 more rows

First, the base function split lets us break the data into subsets based on the values of a variable, which in this case is our group variable. The output of this function is a list of data frames, one for each group.

Second, we use map to apply a function to each element of that list. The function is the same modelling function that we used above, which shoves the data into lm. We now have our list of linear models.

Third, we apply the tidy function to each element of that list of models. Because we want the result to be one single data frame consolidating the output from each element, we use map_df, which will combine the results for us. (If we’d just use map again, we would get a list of data frames.) The .id argument tells map to add the group column that indicates what element of the list of models each row comes from. We want this to be able to identify the coefficients.

If we want to be fancy, we can express with the Tidyverse-related pipe and dot notation:

library(magrittr)

result <- fake_data %>%
  split(.$group) %>%
  map(fit_model) %>%
  map_df(tidy, .id = "group")

Nesting data into list columns

This (minus the pipes) is where I am at in most of my R code nowadays: split with split, apply with map and combine with map_dfr. This works well and looks neater than the lapply/Reduce solution discussed in part 0.

We can push it a step further, though. Why take the linear model out of the data frame, away from its group labels any potential group-level covariates — isn’t that just inviting some kind of mix-up of the elements? With list columns, we could store the groups in a data frame, keeping the data subsets and any R objects we generate from them together. (See Wickham’s & Grolemund’s R for Data Science for a deeper case study of this idea.)

library(dplyr)
library(tidyr)

fake_data_nested <- nest(group_by(fake_data, group),
                         data = c(id, predictor, response))

fake_data_models <- mutate(fake_data_nested,
                           model = map(data,
                                       fit_model),
                           estimates = map(model,
                                           tidy))

result <- unnest(fake_data_models, estimates)

First, we use the nest function to create a data frame where each row is a group, and the ”data” column contains the data for that subset.

Then, we add two list columns to that data frame: our list of models, and then our list of data frames with the coefficients.

Finally, we extract the estimates into a new data frame with unnest. The result is the same data frame of coefficients and statistics, also carrying along the data subsets, linear models and coefficents.

The same code with pipes:

fake_data %>% 
  group_by(group) %>% 
  nest(data = c(id, predictor, response)) %>% 
  mutate(model = map(data, fit_model),
         estimates = map(model, tidy)) %>% 
  unnest(estimates) -> ## right way assignment just for kicks
  result

I’m still a list column sceptic, but I have to concede that this is pretty elegant. It gets the job done, it keeps objects that belong together together so that there is no risk of messing up the order, and it is not that much more verbose. I especially like that we can run the models and extract the coefficients in the same mutate call.

Coda: mixed model

Finally, I can’t resist comparing the separate linear models to a linear mixed model of all the data.

We use lme4 to fit a varying-intercept model, a model that puts the same coefficient on the slope between the response and predictor in each group, but allows the intercepts to vary between groups, assuming them to be drawn from the same normal distribution. We put the coefficients from the linear models fit in each group separately and the linear mixed model together in the same data frame to be able to plot them.

library(ggplot2)
library(lme4)

model <- lmer(response ~ (1|group) + predictor,
              fake_data)

lm_coef <- pivot_wider(result,
                       names_from = term,
                       values_from = estimate,
                       id_cols = group)

lmm_coef <- cbind(group = levels(model@flist$group),
                  coef(model)$group)

model_coef <- rbind(transform(lm_coef, model = "lm"),
                    transform(lmm_coef, model = "lmm"))

colnames(model_coef)[2] <- "intercept"

ggplot() +
  geom_point(aes(x = predictor,
                 y = response,
                 colour = factor(group)),
             data = fake_data) +
  geom_abline(aes(slope = predictor,
                  intercept = intercept,
                  colour = factor(group),
                  linetype = model),
              data = model_coef) +
  theme_bw() +
  theme(panel.grid = element_blank())

As advertised, the linear mixed model has the same slope in every group, and intercepts pulled closer to the mean. Now, we know that this is a good model for these fake data because we created them, and in the real world, that is obviously not the case … The point is: if we are going to fit a more complex model of the whole data, we want to be able to explore alternatives and plot them together with the data. Having elegant ways to transform data frames and summarise models at our disposal makes that process smoother.

1Aug2021

”It depends on what you want to do with it”

Publicerat i english, the practice of science av mrtnj

Note to self: ”It depends on what you want to do with it” is the stereotypical answer to questions about method and data analysis. Ask what genetic analysis software you should use to do X, how you should filter the variants you’ve detected from sequencing, whether to mark duplicates on your aligned reads, what a genetic parameter estimate means … I just wrote a version of this answer in an email. Mea maxima culpa.

Admittedly, it is probably often the technically correct answer, but it’s also useless without qualification. It always depends. Everything is everything-dependent. Goddag yxskaft. Moreover, I suspect that it often means ”I don’t know” or ”it doesn’t matter” — but you can’t say that, can you? That would sound unprofessional.

Here is what to ask either the supposed expert who blurted out this non-answer, or yourself if you are about to: What are the alternatives? A couple will do to give an idea of what the field of solutions and answers is like. What does it depend on? An idea of why one want to choose one alternative over the other will do. What do we need to know? Sometimes we need preliminary data, a huge modelling effort … or just that I go read some papers before I say anything more.

25Jul2021

A plot of genes on chromosomes

Publicerat i computer stuff av mrtnj

Marta Cifuentes and Wayne Crismani asked on Twitter if there is a web tool similar to the Arabidopsis Chromosome Map Tool that makes figures of genes on chromosomes for humans. This will not really be an answer to the question — not a web tool, not conveniently packaged — but I thought that would be a nice plot to make in R with ggplot2. We will use the ggrepel package to help with labelling, and get information from the Ensembl REST API with httr and jsonlite.

The plot and the final code to generate it

Below I will go through the functions that get us there, but here is the end product. The code is on GitHub as usual.

## Some Ensembl genes to test with

ensembl_genes <- c("ENSG00000125845", ## BMP2
                   "ENSG00000181690", ## PLAG1
                   "ENSG00000177508", ## IRX3
                   "ENSG00000140718") ## FTO

chr_sizes <- get_chromosome_sizes_from_ensembl(species = "homo_sapiens")

coords <- get_coordinates_from_ensembl(ensembl_genes)

plot_genes_test <- plot_genes(coords,
                              chr_sizes)

We will use Ensembl and access the genes via Ensembl Gene IDs. Here, I’ve looked up the Ensembl Gene IDs for four genes I like (in humans).

We need to know how long human chromosomes are in order to plot them, so we have a function for that; we also need to get coordinates for the genes, and we have a function for that. They are both below. These functions call up the Ensembl REST API to get the data from the Ensembl database.

Finally, there is a plotting function that takes the coordinates and the chromosome sizes as input and return a ggplot2 plot.

Getting the data out of the Ensembl REST API

Now, starting from the top, we will need to define those functions to interact with the Ensembl REST API. This marvellous machine allows us to get data out of the Ensembl database over HTTP, writing our questions in the URL. It is nicely described with examples from multiple languages on the Ensembl REST API website.

An alternative to using the REST API would be to download gene locations from BioMart. This was my first thought. BioMart is more familiar to me than the REST API, and it also has the nice benefit that it is easy to download every gene and store it away for the future. However, there isn’t a nice way to get chromosome lengths from BioMart, so we would have to download them from somewhere else. This is isn’t hard, but I thought using the REST API for both tasks seemed nicer.

## Plot showing the location of a few genomes on chromosomes

library(httr)
library(jsonlite)
library(ggplot2)
library(ggrepel)
library(purrr)

## Get an endpoint from the Ensembl REST api and return parsed JSON

get_from_rest_api <- function(endpoint_string,
                              server = "https://rest.ensembl.org/") {
  rest <- GET(paste(server, endpoint_string, sep = ""),
              content_type("application/json"))
  
  stop_for_status(rest)
  
  fromJSON(toJSON(content(rest)))
}

This first function gets content by sending a request, checking whether it worked (and stopping with an error if it didn’t), and then unpacking the content into an R object.

## Get chromosomes sizes from the Ensembl REST API

get_chromosome_sizes_from_ensembl <- function(species) {

  json <- get_from_rest_api(paste("info/assembly/", species, sep = ""))

  data.frame(name = as.character(json$top_level_region$name),
             length = as.numeric(json$top_level_region$length),
             stringsAsFactors = FALSE)
}

This second function asks for the genome assembly information for a particular species with the GET info/assembly/:species endpoint, and extracts the chromosome lengths into a data frame. The first part of data gathering is done, now we just need the coordinates fort the genes of interest.

## Get coordinates from Ensembl ids

get_coordinates_from_ensembl <- function(ensembl_ids) {
 
  map_dfr(ensembl_ids,
          function(ei) {
            json <- get_from_rest_api(paste("lookup/id/", ei, sep = ""))
            
            data.frame(position = (json$start + json$end)/2,
                       chr = json$seq_region_name,
                       display_name = json$display_name,
                       stringsAsFactors = FALSE)
          })
}

This function asks for the gene information for each gene ID we’ve given it with the GET lookup/id/:id endpoint, and extracts the rough position (mean of start and end coordinate), chromosome name, and the ”display name”, which in the human case will be a gene symbol. (For genes that don’t have a gene symbol, we would need to set up this column ourselves.)

At this point, we have the data we need in two data frames. That means it’s time to make the plot.

Plotting code

We will build a plot with two layers: first the chromosomes (as a geom_linerange) and then the gene locations (as a geom_text_repel from the ggrepel package). The text layer will move the labels around so that they don’t overlap even when the genes are close to each other, and by setting the nudge_x argument we can move them to the side of the chromosomes.

Apart from that, we change the scale to set he order of chromosomes and reverse the scale of the y-axis so that chromosomes start at the top of the plot.

The function returns a ggplot2 plot object, so one can do some further customisation after the fact — but for some features one would have to re-write things inside the function.

plot_genes <- function(coordinates,
                       chromosome_sizes) {

  ## Restrict to chromosomes that are in data  
  chrs_in_data <-
    chromosome_sizes[chromosome_sizes$name %in% coordinates$chr,]
  chr_order <- order(as.numeric(chrs_in_data$name))
  
  ggplot() +
    geom_linerange(aes(x = name,
                       ymin = 1,
                       ymax = length/1e6),
                   size = 2,
                   colour = "grey",
                   data = chrs_in_data) +
    geom_text_repel(aes(x = chr,
                        y = position/1e6,
                        label = display_name),
                    nudge_x = 0.33,
                    data = coordinates) +
    scale_y_reverse() +
    ## Fix ordering of chromosomes on x-axis
    scale_x_discrete(limits = chrs_in_data$name[chr_order],
                     labels = chrs_in_data$name[chr_order]) +
    theme_bw() +
    theme(panel.grid = element_blank()) +
    xlab("Chromosome") +
    ylab("Position (Mbp)")
  
}

Possible extensions

One feature from the Arabidopsis inspiration that is missing here is the position of centromeres. We should be able to use the option ?bands=1 in the GET info/assembly/:species to get cytogenetic band information and separate p and q arms of chromosomes. This will not be universal though, i.e. not available for most species I care about.

Except to make cartoons of gene positions, I think this might be a nice way to make plots of genome regions with very course resolution, i.e. linkage mapping results, where one could add lines to show genomic confidence intervals, for example.

18Jul2021

Journal club of one: ”A unifying concept of animal breeding programmes”

Publicerat i english, genetik av mrtnj

The developers of the MoBPS breeding programme simulators have published three papers about it over the last years: one about the MoBPS R package (Pook et al. 2020), one about their web server MoBPSweb (Pook et al. 2021), and one that discusses the logic of the specification for breeding programmes they use in the web interface (Simianer et al. 2021). The idea is that this specification can be used to describe breeding programmes in a precise and interoperable manner. The latter — about breeding programme specification — reads as if the authors had a jolly good time thinking about what breeding and breeding programmes are; at least, I had such feelings while reading it. It is accompanied by an editorial by Simianer (2021) talking about what he thinks are the important research directions for animal breeding research. Using simulation to aid breeding programme design is one of them.

Defining and specifying breeding programmes

After defining a breeding programme in a narrow genetic sense as a process achieving genetic change (more about that later), Simianer et al. (2021) go on to define a specification of such a breeding programme — or more precisely, of a model of a breeding programme. They use the word ”definition” for both things, but they really talk about two different things: defining what ”a breeding programme” is and specifying a particular model of a particular breeding programme.

Here, I think it is helpful to think of Guest & Martin’s (2020) distinction, borrowed from psychology, between a specification of a model and an implementation of a model. A specification is a description of a model based in natural language or math. Breeding programme modelling often takes the shape of software packages, where the software implementation is the model and a specification is quite vague. Simianer et al. (2021) can be seen as a step towards a way of writing specifications, on a higher level in Guest & Martin’s hierarchy, of what such a breeding programme model should achieve.

They argue that such a specification (”formal description”) needs to be comprehensive, unambiguous and reproducible. They claim that this can be achieved with two parts: the breeding environment and the breeding structure.

The breeding environment includes:

founder populations,
quantitative genetic parameters for traits,
genetic architectures for traits,
economic values for traits in breeding goal,
description of genomic information,
description of breeding value estimation methods.

Thus, the ”formal” specification depends on a lot of information that is either unknowable in practice (genetic architecture), estimated with error (genetic parameters), hard to describe other than qualitatively (founder population) and dependent on particular software implementations and procedures (breeding value estimation). This illustrates the need to distinguish the map from the territory — to the extent that the specification is exact, it describes a model of a breeding programme, not a real breeding programme.

The other part of their specification is the graph-based model of breeding structure. I think this is their key new idea. The breeding structure, in their specification, consists of nodes that represent groups of elementary objects (I would say ”populations”) and edges that represent transformations that create new populations (such as selection or mating) or are a shorthand for repeating groups of edges and nodes.

The elementary objects could be individuals, but they also give the example of gametes and genes (I presume they mean in the sense of alleles) as possible elementary objects. One could also imagine groups of genetically identical individuals (”genotypes” in a plant breeding sense). Nodes contain a given number of individuals, and can also have a duration.

Edges are directed, and correspond to processes such as ageing, selection, reproduction, splitting or merging populations. They will carry attributes related to the transformation. Edges can have a time associated with them that it takes for the transformation to happen (e.g. for animals to gestate or grow to a particular age). Here is an example from Pook et al. (2021) of a breeding structure graph for dairy cattle:

If we ignore the red edges for now, we can follow the flow of reproduction (yellow edges) and selection (green edges): The part on the left is what is going on in the breeding company: cows (BC-Cows) reproduce with selected bulls (BC-SelectedBulls), and their offspring become the next generation of breeding company cows and bulls (BC-NextCows and BC-NextBulls). On the right is the operation of a farm, where semen from the breeding company is used to inseminate cows and heifers (heifer, cow-L1, cow-L2, cow-L3) to produce calfs (calf-h, calf-L1, calf-L2, calf-L3). Each cycle each of these groups, through a selection operation, give rise to the next group (heifers becoming cows, first lactation cows becoming second lactation cows etc).

Breeding loops vs breeding graphs vs breeding forms

Except for the edges that specify breeding operations, there is also a special meta-edge type, the repeat edge, that is used to simplify breeding graphs with repeated operations.

A useful edge class to describe breeding programmes that are composed of several breeding cycles is ”repeat.” It can be used to copy resulting nodes from one breeding cycle into the nodes of origin of the next cycle, assuming that exactly the same breeding activities are to be repeated in each cycle. The “repeat” edge has the attribute “number of repeats” which allows to determine the desired number of cycles.

In the MoBPSweb paper (Pook et al. 2020), they describe how it is implemented in MoBPS: Given a breeding specification in MoBPSweb JSON format, the simulator will generate a directed graph by copying the nodes on the breeding cycle as many times as is specified by the repeat number. In this way, repeat edges are eliminated to make the breeding graph acyclic.

The conversion of the breeding scheme itself is done by first detecting if the breeding scheme has any “Repeat” edges (Simianer et al. 2020), which are used to indicate that a given part of the breeding programme is carried out multiple times (breeding cycles). If that is the case, it will subsequently check which nodes can be generated without the use of any repeat. Next, all repeats that can be executed based on the already available nodes are executed by generating copies of all nodes between the node of origin and the target node of the repeat (including the node of origin and excluding the target node). Nodes generated via repeat are serial-numbered via “_1,” “_2” etc. to indicate the repeat number. This procedure is repeated until all repeat edges are resolved, leading to a breeding programme without any repeats remaining.

There are at least three ways to specify the breeding structures for breeding programme simulations: these breeding graphs, breeding loops (or more generally, specifying breeding in a programming language) and breeding forms (when you get a pre-defined breeding structure and are allowed fill in the numbers).

If I’m going to compare the graph specification to what I’m more familiar with, this is how you would create a breeding structure in AlphaSimR:

library(AlphaSimR)

## Breeding environment

founderpop <- runMacs(nInd = 100,
                      nChr = 20)
simparam <- SimParam$new(founderpop)
simparam$setSexes("yes_sys")
simparam$addTraitA(nQtlPerChr = 100)
simparam$setVarE(h2 = 0.3)

## Breeding structure

n_time_steps <- 10

populations <- vector(mode = "list",
                      length = n_time_steps + 1)

populations[[1]] <- newPop(founderpop,
                           simParam = simparam)

for (gen_ix in 2:(n_time_steps + 1)) {

  ## Breeding cycle happens here

}

In the AlphaSimR script, the action typically happens within the loop. You apply different functions on population objects to make your selection, move individuals between parts of the breeding programme, create offspring etc. That is, in the MoBPS breeding structure, populations are nodes and actions are edges. In AlphaSimR, populations are objects and operations are functions. In order to not have to copy paste your breeding code, you use the control flow structures of R to make a loop (or some functional equivalent). In MoBPS graph structure, in order to not have to create every node and edge manually, you use the Repeat edge.

Breeding graphs with many repeat edges with different times attached to them have the potential to be complicated, and the same is true of breeding loops. I would have to use both of them more to have an opinion about what is more or less intuitive.

Now that we’ve described what they do in the paper, let’s look at some complications.

Formal specifications only apply to idealised breeding programmes

The authors claim that their concept provides a formal breeding programme specification (in their words, ”formal description”) that can be fully understood and implemented by breeders. It seems like the specification fails to live up to this ambition, and it appears doubtful whether any type of specification can. This is because they do not distinguish between specifying a model of a breeding programme and specifying a real implementation of a breeding programme.

First, as mentioned above, the ”breeding environment” as described by them, contains information that can never be specified for any real population, such as the genetic architecture of complex traits.

Second, their breeding structure is described in terms of fixed numbers, which will never be precise due to mortality, conception rates, logistics and practical concerns. They note such fluctuations in population size as a limitation in the Discussion. To some extent, random mortality, reproductive success etc an be modelled by putting random distributions on various parameters. (I am not sure how easy this is to do in the MoBPS framework; it is certainly possible.) However, this adds uncertainty about what these hyperparameter should be and whether they are realistic.

Such complications would just be nit-picking if the authors had not suggested that their specification can be used to communicate breeding programmes between breeders and between breeders and authorities, such as when a breeding programme is seeking approval. They acknowledge that the authorities, for example in the EU, want more detailed information that are beyond the scope of their specification.

And the concept is not very formal in the first place

Despite the claimed formality, every class of object in the breeding structure is left open, with many possible actions and many possible attributes that are never fully defined.

It is somewhat ambiguous what is to be the ”formal” specification — it cannot be the description in the paper as it is not very formal or complete; it shouldn’t be the implementation in MoBPS and MoBPSweb, as the concept is claimed to be universal; maybe it is the JSON specification of the breeding structure and background as described in the MoBPSweb paper (Pook et al. 2020). The latter seems the best candidate for a well-defined formal way to specify breeding programme models, but then again, the JSON format appears not to have a published specification, and appears to contain implementation-specific details relating to MoBPS.

This also matters to the suggested use of the specification to communicate real breeding programme designs. What, precisely, is it that will be communicated? Are breed societies and authorities expected to communicate with breeding graphs, JSON files, or with verbal descriptions using their terms (e.g. ”breeding environment”, ”breeding structure”, their node names and parameters for breeding activities)?

There is almost never a need for a definition

As I mentioned before, the paper starts by asking what a breeding programme is. They refer to different descriptions of breeding programme design from textbooks, and a legal definition from EU regulation 2016/1012; article 2, paragraph 26, which goes:

‘breeding programme’ means a set of systematic actions, including recording, selection, breeding and exchange of breeding animals and their germinal products, designed and implemented to preserve or enhance desired phenotypic and/or genotypic characteristics in the target breeding population.

There seems to be some agreement that a breeding programme, in addition to being the management of reproduction of a domestic animal population, also is systematic and goal-directed activity. Despite these descriptions of breeding programmes and their shared similarity, the authors argue that there is no clear formal definition of what a breeding programme is, and that this would be useful to delineate and specify breeding programmes.

They define a breeding programme as an organised process that aims to change the genetic composition in a desired direction, from one group of individuals to a group of individuals at a later time. The breeding programme comprises those individuals and activities that contribute to this process. For example, crossbred individuals in a multiplier part of a terminal crossbreeding programme would be included to the extent that they contribute information to the breeding of nucleus animals.

We define a breeding programme as a structured, man-driven process in time that starts with a group of individuals X at time $t_1$ and leads to a group of individuals Y at time $t_2 > t_1$ . The objective of a breeding programme is to transform the genetic characteristics of group X to group Y in a desired direction, and a breeding programme is characterized by the fact that the implemented actions aim at achieving this transformation.

They actually do not elaborate on what it means that a genetic change has direction, but since they want the definition to apply both to farm animal and conservation breeding programmes, the genetic goals could be formulated both in terms of changes in genetic values for traits and in terms of genetic relationships.

Under many circumstances, this is a reasonable proxy also for an economic target: The breeding structures and interventions considered in theoretical breeding programme designs can often be evaluated in terms of their effect on the response to selection, and if the response improves, so will the economic benefit. However, this definition seems a little unnecessary and narrow. If you wanted to, say, add a terminal crossbreeding step to the simulation and evaluate the performance in terms of the total profitability of the crossbreeding programme (that is, something that is outside of the breeding programme in the sense of the above definition), nothing is stopping you, and the idea is not in principle outside of the scope of animal breeding.

Finally, an interesting remark about efficiency

When discussing the possibility of using their concept to communicate breeding programmes to authorities when seeking approval the authors argue that authorities should not take efficiency of the breeding programme into account when they evaluate breeding programmes for approval. They state this point forcefully without explaining their reasoning:

It should be noted, though, that an evaluation of the efficiency of breeding programme is not, and should never be, a precondition for the approval of a breeding programme by the authorities.

This raises the question: why not? There might be both environmental, economical and animal ethical reasons to consider not approving breeding programmes that can be shown to make inefficient use of resources. Maybe such evaluation would be impractical — breeding programme analysis and simulation might have to be put on a firmer scientific grounding and be made more reproducible and transparent before we trust it to make such decisions — but efficiency does seem like an appropriate thing to want in a breeding scheme, also from the perspective of society and the authorities.

I am not advocating any particular new regulation for breeding programmes here, but I wonder where the ”should never” came from. This reads like a comment added to appease a reviewer — the passage is missing from the preprint version.

Literature

Pook, T., Schlather, M., & Simianer, H. (2020a). MoBPS-modular breeding program simulator. G3: Genes, Genomes, Genetics, 10(6), 1915-1918. https://academic.oup.com/g3journal/article/10/6/1915/6026363

Pook, T., Büttgen, L., Ganesan, A., Ha, N. T., & Simianer, H. (2021). MoBPSweb: A web-based framework to simulate and compare breeding programs. G3, 11(2), jkab023. https://academic.oup.com/g3journal/article/11/2/jkab023/6128572

Simianer, H., Büttgen, L., Ganesan, A., Ha, N. T., & Pook, T. (2021). A unifying concept of animal breeding programmes. Journal of Animal Breeding and Genetics, 138 (2), 137-150. https://onlinelibrary.wiley.com/doi/full/10.1111/jbg.12534

Simianer, H. (2021), Harvest Moon: Some personal thoughts on past and future directions in animal breeding research. J Anim Breed Genet, 138: 135-136. https://doi.org/10.1111/jbg.12538

Guest, O., & Martin, A. E. (2020). How computational modeling can force theory building in psychological science. Perspectives on Psychological Science, 1745691620970585. https://journals.sagepub.com/doi/full/10.1177/1745691620970585

13Jul2021

Paper: ”Genetic variation in recombination rate in the pig”

Publicerat i english, genetik av mrtnj

Our paper on genetic variation in recombination in the pig just came out the other week. I posted about it already when it was a preprint, but we dug a little deeper into some of the results in response to peer review, so let us have a look at it again.

Summary

Recombination between chromosomes during meiosis leads to shuffling of genetic material between chromosomes, creating new combinations of alleles. Recombination rate varies, though, between parts of the genome, between sexes, and between individuals. Illustrating that, here is a figure from the paper showing how recombination rate varies along the chromosomes of the pig genome. Female recombination rate is higher than male recombination rate on most chromosomes, and in particular in regions of higher recombination rate in the middle of certain chromosomes.

Fig 2 from the paper: average recombination rate in 1 Mbp windows along the pig genome. Each line is a population, coloured by sexes.

In several other vertebrates, part of that individual variation in recombination rate (in the gametes passed on by that individual) is genetic, and associated with regions close to known meiosis-genes. It turns out that this is the case in the pig too.

In this paper, we estimated recombination rates in nine genotyped pedigree populations of pigs, and used that to perform genome-wide association studies of recombination rate. The heritability of autosomal recombination rate was around 0.07 in females, 0.05 in males.

The major genome-wide association hit on chromosome 8, well known from other mammals, overlapped the RNF212 gene in most populations in females, and to a lesser extent in males.

Fig 6 from the paper, showing genome-wide association results for eight of the populations (one had too few individuals with recombination rate estimates after filtering for GWAS). The x-axis is position in the genome, and the y-axis is the negative logarithm of the p-value from a linear mixed model with repeated measures.

One of the things we added after the preprint is a meta-analysis of genome-wide association over all the populations (separated by sex). In total, there were six associated regions, five of which are close to known recombination genes: RNF212, SHOC1, SYCP2, MSH4 and HFM1. In particular, several of the candidates are genes involved in whether a double strand break resolves as a crossover or non-crossover. However, we do not have the genomic resolution to know whether these are actually the causative genes; there are significant markers overlapping many genes, and the candidate genes are not always the closest gene.

How well does the recombination landscape agree with previous maps?

The recombination landscapes accord pretty well with Tortereau et al.’s (2012) maps. We find a similar sex difference, with higher recombination in females on all autosomes except chromosome 1 and 13, and a stronger association with GC content in females. However, our recombination rates tend to be higher, possibly due to some overestimation. Different populations, where recombinations are estimated independently, also have similar recombination landscapes.

But it doesn’t agree so well with Lozada-Soto et al. (2021), does it?

No, that is true. Between preprint and finished version, Lozada-Soto et al. (2021) published a genome-wide association study of recombination rate in the pig. They found heritabilities of recombination rate of a similar magnitude as we did, but their genome-wide association results are completely different. We did not find the hits that they found, and they did not find the hits we found, or any previously known candidate genes for recombination. To be honest, I don’t have a good explanation for these differences.

How about recombination hotspots and PRDM9?

At a very fine scale, most recombination tends to occur in hotspots of around a few kilobasepairs. As this study used pedigrees and SNP chips with much coarser density than this, we cannot say much about the fine-scale recombination landscape. We work, at the finest, with windows of 1 Mbp. However, as the pig appears to have a working and rapidly evolving PRDM9 gene (encoding a protein that is responsible for recombination hotspot targeting), the pig probably has a PRDM9-based landscape of hotspots just like humans and mice (Baker et al. 2019).

Tortereau et al. (2012) found a positive correlation between counts of the PRDM9 DNA-binding motif and recombination rate, which is biologically plausible, as more PRDM9 motifs should mean more hotspots. So, we estimated this correlation for comparison, but found only a very weak relationship — this is one point where our results are inconsistent with previous maps. This might be because of changes in improved pig genome assembly we use, or it might be an indication that we have worse genomic resolution due to the genotype imputation involved in our estimation. However, one probably shouldn’t expect to find strong relationships between a process at the kilobasepair-scale when using windows of 1 Mbp in the first place.

Can one breed for increased recombination to improve genetic gain?

Not really. Because recombination breaks up linkage disequlibrium between causative variants, higher recombination rate could reveal genetic variation for selection and improve genetic gain. However, previous studies suggest that recombination rate needs to increase quite a lot (two-fold or more) to substantially improve breeding (Battagin et al. 2016). We made some back of the envelope quantitative genetic calculations on the response to selection for recombination, and it would be much smaller than that.

Literature

Johnsson M*, Whalen A*, Ros-Freixedes R, Gorjanc G, Chen C-Y, Herring WO, de Koning D-J, Hickey JM. (2021) Genetics variation in recombination rate in the pig. Genetics Selection Evolution (* equal contribution)

Tortereau, F., Servin, B., Frantz, L., Megens, H. J., Milan, D., Rohrer, G., … & Groenen, M. A. (2012). A high density recombination map of the pig reveals a correlation between sex-specific recombination and GC content. BMC genomics, 13(1), 1-12.

Lozada‐Soto, E. A., Maltecca, C., Wackel, H., Flowers, W., Gray, K., He, Y., … & Tiezzi, F. (2021). Evidence for recombination variability in purebred swine populations. Journal of Animal Breeding and Genetics, 138(2), 259-273.

Baker, Z., Schumer, M., Haba, Y., Bashkirova, L., Holland, C., Rosenthal, G. G., & Przeworski, M. (2017). Repeated losses of PRDM9-directed recombination despite the conservation of PRDM9 across vertebrates. Elife, 6, e24133.

Battagin, M., Gorjanc, G., Faux, A. M., Johnston, S. E., & Hickey, J. M. (2016). Effect of manipulating recombination rates on response to selection in livestock breeding programs. Genetics Selection Evolution, 48(1), 1-12.

4Jul2021

Preprint: ”Genetics of tibia bone properties of crossbred commercial laying hens in different housing systems”

Publicerat i english, genetik av mrtnj

We have a new preprint posted to Biorxiv looking into the genetic basis of bone strength and other bone properties in crossbred laying hens in two different housing environments (furnished cages and floor pens).

Here are the citation and abstract:

Martin Johnsson, Helena Wall, Fernando A Lopes Pinto, Robert H. Fleming, Heather A. McCormack, Cristina Benavides-Reyes, Nazaret Dominguez-Gasca, Estefania Sanchez-Rodriguez, Ian C. Dunn, Alejandro B. Rodriguez-Navarro, Andreas Kindmark, Dirk-Jan de Koning (2021) Genetics of tibia bone properties of crossbred commercial laying hens in different housing systems. bioRxiv 2021.06.21.449243

Osteoporosis and bone fractures are a severe problem for the welfare of laying hens, with genetics and environment, such as housing system, each making substantial contributions to bone strength. In this work, we performed genetic analyses of bone strength, bone mineral density and bone composition, as well as body weight, in 860 commercial crossbred laying hens from two different companies, kept in either furnished cages or floor pens. We compared bone traits between housing systems and crossbreds, and performed a genome-wide association study of bone properties and body weight.

As expected, the two housing systems produced a large difference in bone strength, with layers housed in floor pens having stronger bones. These differences were accompanied by differences in bone geometry, mineralisation and chemical composition. Genome-scans either combining or independently analysing the two housing systems revealed no genome-wide significant loci for bone breaking strength. We detected three loci for body weight that were shared between the housing systems on chromosomes 4, 6 and 27 (either genome-wide significant or suggestive when the housing systems were analysed individually) and these coincide with associations for bone length.

In summary, we found substantial differences in bone strength, content and composition between hens kept in floor pens and furnished cages that could be attributed to greater physical activity in pen housing. We found little evidence for large-effect loci for bone strength in commercial crossbred hens, consistent with a highly polygenic architecture for bone strength in the production environment. The lack of consistent genetic associations between housing systems in combination with the differences in bone phenotypes support gene-by-environment interactions with housing system.

The background is that bone quality is a serious problem for laying hens; that housing systems that allow for more movement are known to lead to stronger bones; and that previous work on the genetics of bone parameters comes mostly from pure lines or from experimental intercrosses between divergent lines. Here, we study commercial crossbred laying hens from two different companies.

Being housed in a floor pen, where there is more opportunity for physical activity, or in a furnished cage makes a big difference to bone breaking strength. For comparison, we also show body weight, which is not much different between the housing environments. This difference was accompanied by differences in bone composition (see details in the paper).

And here are the Manhattan plots from genome-wide association: bone strength shows no major loci, as opposed to body weight, which has strong associations that are shared between the housing systems.

And if we compare the genome-wide associations, marker for marker, between the housing systems, there is nothing in common between the suggestive associations for bone strength. (Body weight below for comparison.)

This includes not detecting major loci for bone strength that have been found in pure lines of chickens. We think this is due to gene-by-environment interactions with housing (i.e. physical activity). This might be a complication for genomic selection for bone quality, as selection might need to be targeted to different housing systems.

Finally, the three strong association for body weight shown above overlap previously detected loci on chromosomes 4, 6, and 27. We do not have the genomic resolution to nominate candidate genes with any confidence, but the chromosome 4 locus overlaps both the CCKAR gene, which is a strong candidate for growth and body mass in the chicken and the LCORL/NCAPG locus, which has been associated with body size in several species. These regions (plus a fourth one) are also associated with bone length:

17Maj2021

Convincing myself about the Monty Hall problem

Publicerat i computer stuff, english, så kallad humor av mrtnj

Like many others, I’ve never felt that the solution to the Monty Hall problem was intuitive, despite the fact that explanations of the correct solution are everywhere. I am not alone. Famously, columnist Marilyn vos Savant got droves of mail from people trying to school her after she had published the correct solution.

The problem goes like this: You are a contestant on a game show (based on a real game show hosted by Monty Hall, hence the name). The host presents you with three doors, one of which contains a prize — say, a goat — and the others are empty. After you’ve made your choice, the host opens one of the doors, showing that it is empty. You are now asked whether you would like to stick to your initial choice, or switch to the other door. The right thing to do is to switch, which gives you 2/3 probability of winning the goat. This can be demonstrated in a few different ways.

A goat is a great prize. Image: Casey Goat by Pete Markham (CC BY-SA 2.0)

So I sat down to do 20 physical Monty Hall simulations on paper. I shuffled three cards with the options, picked one, and then, playing the role of the host, took away one losing option, and noted down if switching or holding on to the first choice would have been the right thing to do. The results came out 13 out of 20 (65%) wins for the switching strategy, and 7 out of 20 (35%) for the holding strategy. Of course, the Monty Hall Truthers out there must question whether this demonstration in fact happened — it’s too perfect, isn’t it?

The outcome of the simulation is less important than the feeling that came over me as I was running it, though. As I was taking on the role of the host and preparing to take away one of the losing options, it started feeling self-evident that the important thing is whether the first choice is right. If the first choice is right, holding is the right strategy. If the first choice is wrong, switching is the right option. And the first choice, clearly, is only right 1/3 of the time.

In this case, it was helpful to take the game show host’s perspective. Selvin (1975) discussed the solution to the problem in The American Statistician, and included a quote from Monty Hall himself:

Monty Hall wrote and expressed that he was not ”a student of statistics problems” but ”the big hole in your argument is that once the first box is seen to be empty, the contestant cannot exchange his box.” He continues to say, ”Oh, and incidentally, after one [box] is seen to be empty, his chances are no longer 50/50 but remain what they were in the first place, one out of three. It just seems to the contestant that one box having been eliminated, he stands a better chance. Not so.” I could not have said it better myself.

A generalised problem

Now, imagine the same problem with a number d number of doors, w number of prizes and o number of losing doors that are opened after the first choice is made. We assume that the losing doors are opened at random, and that switching entails picking one of the remaining doors at random. What is the probability of winning with the switching strategy?

The probability of picking the a door with or without a prize is:

$\Pr(\text{pick right first}) = \frac{w}{d}$

$\Pr(\text{pick wrong first}) = 1 - \frac{w}{d}$

If we picked a right door first, we have w – 1 winning options left out of d – o – 1 doors after the host opens o doors:

$\Pr(\text{win\textbar right first}) = \frac{w - 1}{d - o - 1}$

If we picked the wrong door first, we have all the winning options left:

$\Pr(\text{win\textbar wrong first}) = \frac{w}{d - o - 1}$

Putting it all together:

$\Pr(\text{win\textbar switch}) = \Pr(\text{pick right first}) \cdot \Pr(\text{win\textbar right first}) + \\ + \Pr(\text{pick wrong first}) \cdot \Pr(\text{win\textbar wrong first}) = \\ = \frac{w}{d} \frac{w - 1}{d - o - 1} + (1 - \frac{w}{d}) \frac{w}{d - o - 1}$

As before, for the hold strategy, the probability of winning is the probability of getting it right the first time:

$\Pr(\text{win\textbar hold}) = \frac{w}{d}$

With the original Monty Hall problem, w = 1, d = 3 and o = 1, and therefore

$\Pr(\text{win\textbar switch}) = \frac{1}{3} \cdot 0 + \frac{2}{3} \cdot 1$

Selvin (1975) also present a generalisation due to Ferguson, where there are n options and p doors that are opened after the choice. That is, w = 1, d = 3 and o = 1. Therefore,

$\Pr(\text{win\textbar switch}) = \frac{1}{n} \cdot 0 + (1 - \frac{1}{n}) \frac{1}{n - p - 1} = \frac{n - 1}{n(n - p - 1)}$

which is Ferguson’s formula.

Finally, in Marilyn vos Savant’s column, she used this thought experiment to illustrate why switching is the right thing to do:

Here’s a good way to visualize what happened. Suppose there are a million doors, and you pick door #1. Then the host, who knows what’s behind the doors and will always avoid the one with the prize, opens them all except door #777,777. You’d switch to that door pretty fast, wouldn’t you?

That is, w = 1 still, d = 10⁶ and o = 10⁶ – 2.

$\Pr(\text{win\textbar switch}) = 1 - \frac{1}{10^6}$

It turns out that the solution to the generalised problem is that it is always better to switch, as long as there is a prize, and as long as the host opens any doors. One can also generalise it to choosing sets of more than one door. This makes some intuitive sense: as long as the host takes opens some doors, taking away losing options, switching should enrich for prizes.

Some code

To be frank, I’m not sure I have convinced myself of the solution to the generalised problem yet. However, using the code below, I did try the calculation for all combinations of total number of doors, prizes and doors opened up to 100, and in all cases, switching wins. That inspires some confidence, should I end up on a small ruminant game show.

The code below first defines a wrapper around R’s sampling function, which has a very annoying alternative behaviour when fed a vector of length one, to be able to build a computational version of my physical simulation. Finally, we have a function for the above formulae. (See whole thing on GitHub if you are interested.)

## Wrap sample into a function that avoids the "convenience"
## behaviour that happens when the length of x is one

sample_safer <- function(to_sample, n) {
  assert_that(n <= length(to_sample))
  if (length(to_sample) == 1)
    return(to_sample)
  else {
    return(sample(to_sample, n))
  }
}


## Simulate a generalised Monty Hall situation with
## w prizes, d doors and o doors that are opened.

sim_choice <- function(w, d, o) {
  ## There has to be less prizes than unopened doors
  assert_that(w < d - o) 
  wins <- rep(1, w)
  losses <- rep(0, d - w)
  doors <- c(wins, losses)
  
  ## Pick a door
  choice <- sample_safer(1:d, 1)
  
  ## Doors that can be opened
  to_open_from <- which(doors == 0)
  
  ## Chosen door can't be opened
  to_open_from <- to_open_from[to_open_from != choice]
  
  ## Doors to open
  to_open <- sample_safer(to_open_from, o)
  
  ## Switch to one of the remaining doors
  possible_switches <- setdiff(1:d, c(to_open, choice))
  choice_after_switch <- sample_safer(possible_switches , 1)
  
  result_hold <- doors[choice]
  result_switch <- doors[choice_after_switch]
  c(result_hold,
    result_switch)
}


## Formulas for probabilities

mh_formula <- function(w, d, o) {
  ## There has to be less prizes than unopened doors
  assert_that(w < d - o) 
  
  p_win_switch <- w/d * (w - 1)/(d - o - 1) +
                     (1 - w/d) * w / (d - o - 1) 
  p_win_hold <- w/d
  c(p_win_hold,
    p_win_switch)
}


## Standard Monty Hall

mh <- replicate(1000, sim_choice(1, 3, 1))

> mh_formula(1, 3, 1)
[1] 0.3333333 0.6666667
> rowSums(mh)/ncol(mh)
[1] 0.347 0.653

The Monty Hall problem problem

Guest & Martin (2020) use this simple problem as their illustration for computational model building: two 12 inch pizzas for the same price as one 18 inch pizza is not a good deal, because the 18 inch pizza contains more food. Apparently this is counter-intuitive to many people who have intuitions about inches and pizzas.

They call the risk of having inconsistencies in our scientific understanding because we cannot intuitively grasp the implications of our models ”The pizza problem”, arguing that it can be ameliorated by computational modelling, which forces you to spell out implicit assumptions and also makes you actually run the numbers. Having a formal model of areas of circles doesn’t help much, unless you plug in the numbers.

The Monty Hall problem problem is the pizza problem with a vengeance; not only is it hard to intuitively grasp what is going on in the problem, but even when presented with compelling evidence, the mental resistance might still remain and lead people to write angry letters and tweets.

Literature

Guest, O, & Martin, AE (2020). How computational modeling can force theory building in psychological science. Preprint.

Selvin, S (1975) On the Monty Hall problem. The American Statistician 29:3 p.134.

9Maj2021

Showing a difference in mean between two groups, take 2

Publicerat i data analysis, english av mrtnj

A couple of years ago, I wrote about the paradoxical difficulty of visualising a difference in means between two groups, while showing both the data and some uncertainty interval. I still feel like many ills in science come from our inability to interpret a simple comparison of means. Anything with more than two groups or a predictor that isn’t categorical makes things worse, of course. It doesn’t take much to overwhelm the intuition.

My suggestion at the time was something like this — either a panel that shows the data an another panel with coefficients and uncertainty intervals; or a plot that shows the with lines that represent the magnitude of the difference at the upper and lower limit of the uncertainty interval.

Alternative 1 (left), with separate panels for data and coefficient estimates, and alternative 2 (right), with confidence limits for the difference shown as vertical lines. For details, see the old post about these graphs.

Here is the fake data and linear model we will plot. If you want to follow along, the whole code is on GitHub. Group 0 should have a mean of 4, and the difference between groups should be two, and as the graphs above show, our linear model is not too far off from these numbers.

library(broom)

data <- data.frame(group = rep(0:1, 20))
data$response <- 4 + data$group * 2 + rnorm(20)

model <- lm(response ~ factor(group), data = data)
result <- tidy(model)

Since the last post, a colleague has told me about the Gardner-Altman plot. In a paper arguing that confidence intervals should be used to show the main result of studies, rather than p-values, Gardner & Altman (1986) introduced plots for simultaneously showing confidence intervals and data.

Their Figure 1 shows (hypothetical) systolic blood pressure data for a group of diabetic and non-diabetic people. The left panel is a dot plot for each group. The right panel is a one-dimensional plot (with a different scale than the right panel; zero is centred on the mean of one of the groups), showing the difference between the groups and a confidence interval as a point with error bars.

There are functions for making this kind of plot (and several more complex alternatives for paired comparisons and analyses of variance) in the R package dabestr from Ho et al. (2019). An example with our fake data looks like this:

Alternative 3: Gardner-Altman plot with bootstrap confidence interval.

library(dabestr)

bootstrap <- dabest(data,
                    group,
                    response,
                    idx = c("0", "1"),
                    paired = FALSE)

bootstrap_diff <- mean_diff(bootstrap)

plot(bootstrap_diff)

While this plot is neat, I think it is a little too busy — I’m not sure that the double horizontal lines between the panels or the shaded density for the bootstrap confidence interval add much. I’d also like to use other inference methods than bootstrapping. I like how the scale of the right panel has the same unit as the left panel, but is offset so the zero is at the mean of one of the groups.

Here is my attempt at making a minimalistic version:

Alternative 4: Simplified Garner-Altman plot.

This piece of code first makes the left panel of data using ggbeeswarm (which I think looks nicer than the jittering I used in the first two alternatives), then the right panel with the estimate and approximate confidence intervals of plus/minus two standard errors of the mean), adjusts the scale, and combines the panels with patchwork.

library(ggbeeswarm)
library(ggplot2
library(patchwork)

ymin <- min(data$response)
ymax <- max(data$response)

plot_points_ga <- ggplot() +
  geom_quasirandom(aes(x = factor(group), y = response),
                   data = data) +
  xlab("Group") +
  ylab("Response") +
  theme_bw() +
  scale_y_continuous(limits = c(ymin, ymax))

height_of_plot <- ymax-ymin

group0_fraction <- (coef(model)[1] - ymin)/height_of_plot

diff_min <- - height_of_plot * group0_fraction

diff_max <- (1 - group0_fraction) * height_of_plot

plot_difference_ga <- ggplot() +
  geom_pointrange(aes(x = term, y = estimate,
                      ymin = estimate - 2 * std.error,
                      ymax = estimate + 2 * std.error),
                  data = result[2,]) +
  scale_y_continuous(limits = c(diff_min, diff_max)) +
  ylab("Difference") +
  xlab("Comparison") +
  theme_bw()

(plot_points_ga | plot_difference_ga) +
    plot_layout(widths = c(0.75, 0.25))

Literature

Gardner, M. J., & Altman, D. G. (1986). Confidence intervals rather than P values: estimation rather than hypothesis testing. British Medical Journal

Ho, J., Tumkaya, T., Aryal, S., Choi, H., & Claridge-Chang, A. (2019). Moving beyond P values: data analysis with estimation graphics. Nature methods.

2Maj2021

Reflektioner om högskolepedagogik: undervisning online

Publicerat i svenska av mrtnj

Den senaste i serien högskolepedagogiska kurser var en workshop om e-lärande, alltså undervisning online. Det är användbart både för genuina distanskurser, för kurser som behöver hållas på distans i nödfall därför att det råkar vara pandemi, och för vilka kurser som helst — eftersom varje kurs med självaktning har ett inslag av online-aktiviteter numera. Vi använder ju alltid en digital lärplattform till att dela material, hantera inlämningar, meddelanden och diskussioner, även om kursen också har klassrumsaktiviteter.

Jag har hittils inte behövt spela in några föreläsningar eller demonstrationer, men nu är jag bättre förberedd ifall det skulle behövas. En fördel med att behöva lyssna på mig själv alldeles för noggrant för att kunna skriva textningen: Jag insåg att jag, i min nedkortade föreläsning, gav en ganska torftig förklaring av ett visst genetiskt koncept (för insatta: ja, det var naturligtvis kopplingsojämvikt). Om jag skulle använda den i faktisk undervisning måste jag spela in den delen igen med en bättre förklaring.

Den automatiska textningen har det inte lätt med min engelska kombinerad med genetisk terminologi. Jag minns inte vad jag sa här, men det var något helt annat.

Annars var det mest intressanta att prata (och i någon mån klaga) med andra deltagare om det senaste årets distansundervisning. Ett återkommande klagomål under det gångna året är hur trist det är att föreläsa för en skärm, jämfört med att göra det inför en publik i ett rum. Jag håller med: Det är både tråkigare och mer stressande att prata till en skärm, och det blir knappast bättre när det är en inspelning på gång. Men å andra sidan, varför är det så viktigt att titta på åhörarnas ansikten? Vet vi att studenterna hänger med för att de ser ut att hänga med — eller att de inte hänger med för att de ser ut att uttråkat titta ut i rymden? Nej, såklart inte. Det är klart att det är trevligt för mig att se dem jag talar till, men jag har svårt att se att det ger mig någon användbar information om vad de förstår och inte, om jag inte frågar dem.

Och på motsatt håll: är det viktigt att studenterna ser mitt ansikte? Även det omvända klassrummet, i all sin påstådda radikalitet, verkar en smula fixerat vid föreläsningar. Å ena sidan känns det seriöst med en videoföreläsning, i alla fall om den inte är för tafflig — att jag tagit tiden och ansträngningen att samla ihop och spela in materialet. Och det finns ett visst underhållningsvärde, som inte är att förakta, i att läraren visar sitt ansikte och har ett personligt tilltal. Å andra sidan, all kritik som finns mot föreläsningsformen (med undantaget att man inte kan spola tillbaka och se den igen) kan riktas mot den förinspelade föreläsningen. Den eventuella lilla interaktivitet som finns i en live-föreläsning försvinner också. Det viktiga är att studenterna lär sig så bra som möjligt, och frågan är om de blir bättre föreberedda för en lektion eller ett seminarium av att få en inspelad föreläsning än de skulle bli av få läsanvisningar eller en förberedelseuppgift.

25Apr2021

Journal club of one: ”Genome-wide enhancer maps link risk variants to disease genes”

Publicerat i english, genetik av mrtnj

(Here is a a paper from a recent journal club.)

Nasser et al. (2021) present a way to prioritise potential noncoding causative variants. This is part of solving the fine mapping problem, i.e. how to find the underlying causative variants from genetic mapping studies. They do it by connecting regulatory elements to genes and overlapping those regulatory elements with variants from statistical fine-mapping. Intuitively, it might not seem like connecting regulatory elements to genes should be that hard, but it is a tricky problem. Enhancers — because that is the regulatory element most of this is about; silencers and insulators get less attention — do not carry any sequence that tells us what gene they will act on. Instead, this needs to be measured or predicted somehow. These authors go the measurement route, combining chromatin sequencing with chromosome conformation capture.

This figure from the supplementary materials show what the method is about:

Additional figure 1 from Nasser et al. (2021) showing an overview of the workflow and an example of two sets of candidate variants derived from-fine mapping, each with variants that overlap enhancers connected to IL10.

They use chromatin sequence data (such as ATAC-seq, histone ChIP-seq or DNAse-seq) in combination with HiC chromosome conformation capture data to identify enhancers and connect them to genes (this was developed earlier in Fulco et al. 2019). The ”activity-by-contact model” means to multiply the contact frequency (from HiC) between promoter and enhancer with the enhancer activity (read counts from chromatin sequencing), standardised by the total contact–activity product with all elements within a window. Fulco et al. (2019) previously found that this conceptually simple model did well at connecting enhancers to genes (as evaluated with a CRISPR inhibition assay called CRISPRi-FlowFISH, which we’re not going into now, but it’s pretty ingenious).

In order to use this for fine-mapping, they calculated activity-by-contact maps for every gene combined with every open chromatin element within 5 Mbp for 131 samples from ENCODE and other sources. The HiC data were averaged of contacts in ten cell types, transformed to be follow a power-law distribution. That is, they did not do the HiC analysis per cell type, but use the same average HiC contact matrix combined with chromatin data from different cell types. Thus, the specificity comes from the promoters and enhancers that are called as active in each sample — I assume this is because the HiC data comes from a different (and smaller) set of cell types than the chromatin sequencing. Element–gene pairs that reached above a threshold were considered connected, for a total of about six million connections, involving 23,000 genes and 270,000 enhancers. On average, a gene had 2.8 enhancers and an enhancer connected to 2.7 genes.

They picked putative causative variants by overlapping the variant sets with these activity-by-contact maps and selecting the highest scoring enhancer gene pair.They used fine-mapping results from multiple previous studies. These variants were estimated with different methods, but they are all some flavour of fine-mapping by variable selection. Statistical fine mapping estimate sets of variants, called credible sets, that have high posterior probability of being the causative variant. They included only completely noncoding credible sets, i.e. those that did not include a coding sequence or splice variant. They applied this to 72 traits in humans, generating predictions for ~ 5000 noncoding credible sets of variants.

Did it work?

Variants for fine-mapping were enriched in connected enhancers more than in open chromatin in general, in cell types that are relevant to the traits. In particular, inflammatory bowel disease variants were enriched in enhancers in 65 samples, including immune cell types and gut cells. The strongest enrichment was in activated dendritic cells.

They used a set of genes previously known to be involved in inflammatory bowel disease, assuming that they were the true causative genes for their respective noncoding credible sets, and then compared their activity-by-contact based prioritisation of the causative gene to simply picking the closest gene. Picking the closest gene was right in 30 out of 37 sets. Picking the gene with the highest activity-by-contact score was right in 17 cases out of 18 sets that overlapped an activity-by-contact enhancer. Thus, this method had higher precision but worse recall. They also tested a number of eQTL-based, enrichment and enhancer–gene mapping methods, that did worse.

What it tells us about causative variants

Most of the putative causative variants, picked based on maximising activity-by-contact, were close to the proposed gene (median 13 kbp) and most involved the closest gene (77%). They were often found their putative causative variants to be located in enhancers that were only active in a few cell- or tissue types (median 4), compared to the promoters of the target genes, that were active in a broader set (median 120). For example, in inflammatory bowel disease, there were several examples where the putatively causal enhancer was only active in a particular immune cell or a stimulated state of an immune cell.

My thoughts

What I like about this model is that it is so different to many of the integrative and machine learning methods we see in genomics. It uses two kinds of data, and relatively simple computations. There is no machine learning. There is no sequence evolution or conservation analysis. Instead, measure two relevant quantities, standardise and preprocess them a bit, and then multiply them together.

If the success of the activity-by-contact model for prioritising causal enhancers generalises beyond the 18 causative genes investigated in the original paper, this is an argument for simple biology-based heuristics over machine learning models. It also suggest that, in the absence of contact data, one might do well by prioritising variant in enhancers that are highly active in relevant cell types, and picking the closest gene as the proposed causative gene.

However, the dataset needs to cover the relevant cell types, and possibly cells that are in the relevant stimulated states, meaning that it provides a motivation for rich conditional atlas-style datasets of chromatin and chromosome conformation.

I am, personally, a little bit sad that expression QTL methods seem to doing poorly. On the other hand, it makes some sense that eQTL might not have the genomic resolution to connect enhancers to genes.

Finally, if the relatively simple activity-by-contact model or the ridiculously simple method of ”picking the closest gene” beats machine learning models using the same data types and more, it suggests that the machine learning methods might not be solving theright problem. After all, they are not trained directly to prioritise variants for complex traits — because there are too few known causative variants for complex traits.

Literature

Fulco, C. P., Nasser, J., Jones, T. R., Munson, G., Bergman, D. T., Subramanian, V., … & Engreitz, J. M. (2019). Activity-by-contact model of enhancer–promoter regulation from thousands of CRISPR perturbations. Nature genetics.

Nasser, J., Bergman, D. T., Fulco, C. P., Guckelberger, P., Doughty, B. R., Patwardhan, T. A., … & Engreitz, J. M. (2021). Genome-wide enhancer maps link risk variants to disease genes. Nature.

On unicorns and genes

Martin Johnsson's blog about genetics and sundry things

Författararkiv: mrtnj

Using R: plyr to purrr, part 1

More nostalgia about plyr

split/map: The modern synthesis

Nesting data into list columns

Coda: mixed model

”It depends on what you want to do with it”

A plot of genes on chromosomes

The plot and the final code to generate it

Getting the data out of the Ensembl REST API

Plotting code

Possible extensions

Journal club of one: ”A unifying concept of animal breeding programmes”

Defining and specifying breeding programmes

Breeding loops vs breeding graphs vs breeding forms

Formal specifications only apply to idealised breeding programmes

And the concept is not very formal in the first place

There is almost never a need for a definition

Finally, an interesting remark about efficiency

Paper: ”Genetic variation in recombination rate in the pig”

Preprint: ”Genetics of tibia bone properties of crossbred commercial laying hens in different housing systems”

Convincing myself about the Monty Hall problem

A generalised problem

Some code

The Monty Hall problem problem

Showing a difference in mean between two groups, take 2

Reflektioner om högskolepedagogik: undervisning online

Journal club of one: ”Genome-wide enhancer maps link risk variants to disease genes”

Did it work?

What it tells us about causative variants

My thoughts