A model of polygenic adaptation in an infinite population

How do allele frequencies change in response to selection? Answers to that question include ”it depends”, ”we don’t know”, ”sometimes a lot, sometimes a little”, and ”according to a nonlinear differential equation that actually doesn’t look too horrendous if you squint a little”. Let’s look at a model of the polygenic adaptation of an infinitely large population under stabilising selection after a shift in optimum. This model has been developed by different researchers over the years (reviewed in Jain & Stephan 2017).

Here is the big equation for allele frequency change at one locus:

\dot{p}_i = -s \gamma_i p_i q_i (c_1 - z') - \frac{s \gamma_i^2}{2} p_i q_i (q_i - p_i) + \mu (q_i - p_i )

That wasn’t so bad, was it? These are the symbols:

  • the subscript i indexes the loci,
  • \dot{p} is the change in allele frequency per time,
  • \gamma_i is the effect of the locus on the trait (twice the effect of the positive allele to be precise),
  • p_i is the frequency of the positive allele,
  • q_i the frequency of the negative allele,
  • s is the strength of selection,
  • c_1 is the phenotypic mean of the population; it just depends on the effects and allele frequencies
  • \mu is the mutation rate.

This breaks down into three terms that we will look at in order.

The directional selection term

-s \gamma_i p_i q_i (c_1 - z')

is the term that describes change due to directional selection.

Apart from the allele frequencies, it depends on the strength of directional selection s, the effect of the locus on the trait \gamma_i and how far away the population is from the new optimum (c_1 - z'). Stronger selection, larger effect or greater distance to the optimum means more allele frequency change.

It is negative because it describes the change in the allele with a positive effect on the trait, so if the mean phenotype is above the optimum, we would expect the allele frequency to decrease, and indeed: when

(c_1 - z') < 0

this term becomes negative.

If you neglect the other two terms and keep this one, you get Jain & Stephan's "directional selection model", which describes behaviour of allele frequencies in the early phase before the population has gotten close to the new optimum. This approximation does much of the heavy lifting in their analysis.

The stabilising selection term

-\frac{s \gamma_i^2}{2} p_i q_i (q_i - p_i)

is the term that describes change due to stabilising selection. Apart from allele frequencies, it depends on the square of the effect of the locus on the trait. That means that, regardless of the sign of the effect, it penalises large changes. This appears to make sense, because stabilising selection strives to preserve traits at the optimum. The cubic influence of allele frequency is, frankly, not intuitive to me.

The mutation term

Finally,

\mu (q_i - p_i )

is the term that describes change due to new mutations. It depends on the allele frequencies, i.e. how of the alleles there are around that can mutate into the other alleles, and the mutation rate. To me, this is the one term one could sit down and write down, without much head-scratching.

Walking in allele frequency space

Jain & Stephan (2017) show a couple of examples of allele frequency change after the optimum shift. Let us try to draw similar figures. (Jain & Stephan don’t give the exact parameters for their figures, they just show one case with effects below their threshold value and one with effects above.)

First, here is the above equation in R code:

pheno_mean <- function(p, gamma) {
  sum(gamma * (2 * p - 1))
}

allele_frequency_change <- function(s, gamma, p, z_prime, mu) {
  -s * gamma * p * (1 - p) * (pheno_mean(p, gamma) - z_prime) +
    - s * gamma^2 * 0.5 * p * (1 - p) * (1 - p - p) +
    mu * (1 - p - p)
}

With this (and some extra packaging; code on Github), we can now plot allele frequency trajectories such as this one, which starts at some arbitrary point and approaches an optimum:

Animation of alleles at two loci approaching an equilibrium. Here, we have two loci with starting frequencies 0.2 and 0.1 and effect size 1 and 0.01, and the optimum is at 0. The mutation rate is 10-4 and the strength of selection is 1. Animation made with gganimate.

Resting in allele frequency space

The model describes a shift from one optimum to another, so we want want to start at equilibrium. Therefore, we need to know what the allele frequencies are at equilibrium, so we solve for 0 allele frequency change in the above equation. The first term will be zero, because

(c_1 - z') = 0

when the mean phenotype is at the optimum. So, we can throw away that term, and factor the rest equation into:

(1 - 2p) (-\frac{s \gamma ^2}{2} p(1-p) + \mu) = 0

Therefore, one root is p = 1/2. Depending on your constitution, this may or may not be intuitive to you. Imagine that you have all the loci, each with a positive and negative allele with the same effect, balanced so that half the population has one and the other half has the other. Then, there is this quadratic equation that gives two other equilibria:

\mu - \frac{s\gamma^2}{2}p(1-p) = 0
\implies p = \frac{1}{2} (1 \pm \sqrt{1 - 8 \frac{\mu}{s \gamma ^2}})

These points correspond to mutation–selection balance with one or the other allele closer to being lost. Jain & Stephan (2017) show a figure of the three equilibria that looks like a semicircle (from the quadratic equation, presumably) attached to a horizontal line at 0.5 (their Figure 1). Given this information, we can start our loci out at equilibrium frequencies. Before we set them off, we need to attend to the effect size.

How big is a big effect? Hur långt är ett snöre?

In this model, there are big and small effects with qualitatively different behaviours. The cutoff is at:

\hat{\gamma} = \sqrt{ \frac{8 \mu}{s}}

If we look again at the roots to the quadratic equation above, they can only exist as real roots if

\frac {8 \mu}{s \gamma^2} < 1

because otherwise the expression inside the square root will be negative. This inequality can be rearranged into:

\gamma^2 > \frac{8 \mu}{s}

This means that if the effect of a locus is smaller than the threshold value, there is only one equilibrium point, and that is at 0.5. It also affects the way the allele frequency changes. Let us look at two two-locus cases, one where the effects are below this threshold and one where they are above it.

threshold <- function(mu, s) sqrt(8 * mu / s)

threshold(1e-4, 1)
[1] 0.02828427

With mutation rate of 10-4 and strength of selection of 1, the cutoff is about 0.028. Let our ”big” loci have effect sizes of 0.05 and our small loci have effect sizes of 0.01, then. Now, we are ready to shift the optimum.

The small loci will start at an equilibrium frequency of 0.5. We start the large loci at two different equilibrium points, where one positive allele is frequent and the other positive allele is rare:

get_equilibrium_frequencies <- function(mu, s, gamma) {
  c(0.5,
    0.5 * (1 + sqrt(1 - 8 * mu / (s * gamma^2))),
    0.5 * (1 - sqrt(1 - 8 * mu / (s * gamma^2))))
}

(eq0.05 <- get_equilibrium_frequencies(1e-4, 1, 0.05))
[1] 0.50000000 0.91231056 0.08768944
get_equlibrium_frequencies(1e-4, 1, 0.01)
[1] 0.5 NaN NaN

Look at them go!

These animations show the same qualitative behaviour as Jain & Stephan illustrate in their Figure 2. With small effects, there is gradual allele frequency change at both loci:

However, with large effects, one of the loci (the one on the vertical axis) dramatically changes in allele frequency, that is it’s experiencing a selective sweep, while the other one barely changes at all. And the model will show similar behaviour when the trait is properly polygenic, with many loci, as long as effects are large compared to the (scaled) mutation rate.

Here, I ran 10,000 time steps; if we look at the phenotypic means, we can see that they still haven’t arrived at the optimum at the end of that time. The mean with large effects is at 0.089 (new optimum of 0.1), and the mean with small effects is 0.0063 (new optimum: 0.02).

Let’s end here for today. Maybe another time, we can return how this model applies to actually polygenic architectures, that is, with more than two loci. The code for all the figures is on Github.

Literature

Jain, K., & Stephan, W. (2017). Modes of rapid polygenic adaptation. Molecular biology and evolution, 34(12), 3169-3175.

The genomic scribe in hyperspace

When I was in school (it must have been in gymnasiet, roughly corresponding to secondary school or high school), I remember giving a presentation on a group project about the human genome project, and using the illiterate copyist analogy. After sequencing the human genome, we are able to blindly copy the text of life; we still need to learn to read it. At this point, I had no clue whatsoever that I would be working in genetics in the future. I certainly felt very clever coming up with that image. I must have read it somewhere.

If it is true that the illiterate scribe is a myth, and they must have had at least some ability to read, that makes the analogy more apt: even in 2003, researchers actually had a fairly good idea of how to read certain aspects of genetics. The genetic code is from 1961, for crying out loud (Yanofsky 2007)!

My classroom moment must have been around 2003, which is the year the ENCODE project started, aiming to do just that: create an encyclopedia (or really, a critical apparatus) of the human genome. It’s still going: a drove of papers from its third phase came out last year, and apparently it’s now in the fourth phase. ENCODE can’t be a project in the usual sense of a planned undertaking with a defined goal, but rather a research programme in the general direction of ”a comprehensive parts list of functional elements in the human genome” (ENCODE FAQ). Along with the phase 3 empirical papers, they published a fun perspective article (The ENCODE Project Consortium et al. 2020).

ENCODE commenced as an ambitious effort to comprehensively annotate the elements in the human genome, such as genes, control elements, and transcript isoforms, and was later expanded to annotate the genomes of several model organisms. Mapping assays identified biochemical activities and thus candidate regulatory elements.

The age means that ENCODE has lived through generations of genomic technologies. Phase 1 was doing functional genomics with microarrays, which now sounds about as quaint as doing it with blots. Nowadays, they have CRISPR-based editing assays and sequencing methods for chromosome 3D structure that just seem to keep adding Cs to their acronyms.

Last time I blogged about the ENCODE project was in 2013 (in Swedish), in connection with the opprobrium about junk DNA. If you care about junk DNA, check out Sean Eddy’s FAQ (Eddy 2012). If you still want to be angry about what percentage of the genome has function, what gene concepts are useful and the relationship between quantitative genetics and genomics, check out this Nature Video. It’s funny, because the video pre-empts some of the conclusions of the perspective article.

The video says: to do many of the potentially useful things we want to do with genomes (like sock cancer in the face, presumably), we need to look at individual differences (”between you, and you, and you”) and how they relate to traits. And an encyclopedia, great as it may be, is not going to capture that.

The perspective says:

It is now apparent that elements that govern transcription, chromatin organization, splicing, and other key aspects of genome control and function are densely encoded in the human genome; however, despite the discovery of many new elements, the annotation of elements that are highly selective for particular cell types or states is lagging behind. For example, very few examples of condition-specific activation or repression of transcriptional control elements are currently annotated in ENCODE. Similarly, information from human fetal tissue, reproductive organs and primary cell types is limited. In addition, although many open chromatin regions have been mapped, the transcription factors that bind to these sequences are largely unknown, and little attention has been devoted to the analysis of repetitive sequences. Finally, although transcript heterogeneity and isoforms have been described in many cell types, full-length transcripts that represent the isoform structure of spliced exons and edits have been described for only a small number of cell types.

That is, the future of genomics is in variation. We want to know about: organismic/developmental background (cell lines vs primary vs induced vs tissue), environmental variation (condition-dependence), genetic variation (gene editing assays that change local genetic variants, the genetic background of different cell line and human genomes), dynamics (time and induction). To put it in plain terms: We need to know how the genome regulation of different cells and individuals are different, and what that does to them. To put it in fancy terms: we are moving towards cellular phenomics, quantitative genomics, and an ever-expanding hypercube of data.

Literature

Eddy, S. R. (2012). The C-value paradox, junk DNA and ENCODE. Current biology, 22(21), R898-R899.

ENCODE Project Consortium, Snyder, M. P., Gingeras, T. R., Moore, J. E., Weng, Z., Gerstein, M. B., Ren, B., … & Myers, R. M. (2020). Perspectives on ENCODE. Nature, 583(7818), 693-698.

Yanofsky, C. (2007). Establishing the triplet nature of the genetic code. Cell, 128(5), 815-818.

Shell stuff I didn’t know

I generally stay away from doing anything more complicated in a shell script than making a directory and running an R script or a single binary, and especially avoid awk and sed as much as possible. However, sometimes the shell actually does offer a certain elegance and convenience (and sometimes deceitful traps).

Here are three things I only learned recently:

Stripping directory and suffix from file names

Imagine we have a project where files are named with the sample ID followed by some extension, like so:

project/data/sample1.g.vcf
project/data/sample2.g.vcf
project/data/sample3.g.vcf

Quite often, we will want to grab all the in a directory and extract the base name without extension and without the whole path leading up to the file. There is a shell command for this called basename:

basename -s .g.vcf project/data/sample*.g.vcf
sample1
sample2
sample3

The -s flag gives the suffix to remove.

This is much nicer than trying to regexp it, for example with R:

library(stringr)

files <- dir("project/data")
basename <- str_match(files, "^.*/(.+)\\.g\\.vcf")

Look at that second argument … ”^.*/(.+)\\.g\\.vcf” What is this?! And let me tell you, that was not my first attempt at writing that regexp either. Those of us who can interpret this gibberish must acknowledge that we have learned to do so only through years of suffering.

For that matter, it’s also than the bash suffix and prefix deletion syntax, which is one of those things I think one has to google every time.

for string in project/data/*.g.vcf; do
    nosuffix=${string%.g.vcf}
    noprefix=${nosuffix#project/data/}
    echo $noprefix
done

Logging both standard out and standard error

When sending jobs off to a server to be run without you looking at them, it’s often convenient to save the output to a file. To redirect standard output to a file, use ”>”, like so:

./script_that_prints_output.sh > out_log.txt

However, there is also another output stream used to record (among other things) error messages (in some programs; this isn’t very consistent). Therefore, we should probably log the standard error stream too. To redirect standard error to a file:

./script_that_prints_output.sh 2> error_log.txt

And to redirect both to the same file:

./script_that_prints_output.sh > combined_log.txt 2>&1

The last bit is telling the shell to redirect the standard error stream to standard out, and then both of them get captured in the file. I didn’t know until recently that one could do this.

The above code contained some dots, and speaking of that, here is a deceitful shell trap to trip up the novice:

The dot command (oh my, this is so bad)

When working on a certain computer system, there is a magic invocation that needs to be in the script to be able to use the module system. It should look like this:

. /etc/profile.d/modules.sh

That means ”source the script found at /etc/profiles.d/modules.sh” — which will activate the module system for you.

It should not look like this:

./etc/profile.d/modules.sh
bash: ./etc/profile.d/modules.sh: No such file or directory

That means that bash tries to find a file called ”etc/profile.d/modules.sh” located in the current directory — which (probably) doesn’t exist.

If there is a space after the dot, it is a command that means the same as source, i.e. run a script from a file. If there is no space after the dot, it means a relative file path — also often used to run a script. I had never actually thought about it until someone took away the space before the dot, and got the above error message (plus something else more confusing, because a module was missing).

2020 blog recap

Dear diary,

During 2020, ”On unicorns and genes” published a total of 29 posts (not including this one, because it’s scheduled for 2021). This means that I kept on schedule for the beginning of the year, then had an extended blog vacation in the fall. I did write a little bit more in Swedish (about an attempt at Crispr debate, a course I took in university pedagogy, and some more about that course) which was one of the ambitions.

Let’s pick one post per month to represent the blogging year of 2020:

January: Things that really don’t matter: megabase or megabasepair. This post deals with a pet peeve of mine: should we write physical distances in genetics as base pairs (bp) or bases?

February: Using R: from plyr to purrr, part 0 out of however many. (Part one might appear at some point, I’m sure.) Finally, the purrr tidyverse package has found a place in my code. It’s still not the first tool I reach for when I need to apply a function, but it’s getting there.

March: Preprint: ”Genetics of recombination rate variation in the pig”. Preprint post about our work with genetic mapping of recombination rate in the pig.

April: Virtual animal breeding journal club: ”An eQTL in the cystathionine beta synthase gene is linked to osteoporosis in laying hens”. The virtual animal breeding journal club, organised by John Cole, was one of the good things that happened in 2020. I don’t know if it will live on in 2021, but if not, it was a treat as long as it lasted. This post contains my slides from when I presented a recent paper, from some colleagues, about the genetics of bone quality in chickens.

May: Robertson on genetic correlation and loss of variation. A post about a paper by Alan Robertson from 1959. This paper is reasonably often cited as a justification for 0.80 as some kind of cut-off for when a genetic correlation is sufficiently different enough from 1 to be important. That is not at really what the paper says.

June: Journal club of one: ”Genomic predictions for crossbred dairy cattle”. My reading on a paper about genomic evaluation for crossbred cattle in the US.

July: Twin lambs with different fathers. An all too brief methods description prompted me to write some R code. This might be my personal favourite of the year.

August: Journal club of one: ”Chromosome-level and haplotype-resolved genome assembly enabled by high-throughput single-cell sequencing of gamete genomes”. Journal club post about a preprint with a neat-looking genome assembly strategy. This is where the posts start becoming sparse.

December: One notebook’s worth of work. Introspective post about my attempts to organise my work.

In other news, trips were cancelled, Zoom teaching happened, and I finally got the hang of working from home. We received funding for a brand new research project about genome dynamics during animal breeding. There will be lots of sequence data. There will be simulations. It starts next year, and I will write more about it later.

Also, Uppsala is sometimes quite beautiful:

The next notebook of work

Dear diary,

The last post was about my attempt to use the Getting Things Done method to bring some more order to research, work, and everything. This post will contain some more details about my system, at a little less than a year into the process, on the off chance that anyone wants to know. This post will use some Getting Things Done jargon without explaining it. There are many useful guides online, plus of course the book itself.

Medium

Most of my system lives in paper notebooks. The main notebook contains my action list, projects list, waiting for list and agendas plus a section for notes. I quickly learned that the someday/maybe lists won’t fit, so I now have a separate (bigger) notebook for those. My calendar is digital. I also use a note taking app for project support material, and as an extra inbox for notes I jot down on my phone. Thus, I guess it’s a paper/digital hybrid.

Contexts

I have five contexts: email/messaging, work computer, writing, office and home. There were more in the beginning, but I gradually took out the ones I didn’t use. They need to be few enough and map cleanly to situations, so that I remember to look at them. I added the writing context because I tend to treat, and schedule, writing tasks separately from other work tasks. The writing context also includes writing-adjacent support tasks such as updating figures, going through reviewer comments or searching for references.

Inboxes

I have a total of nine inboxes, if you include all the email accounts and messenger services where people might contact me about things I need to do. That sounds excessive, but only three of those are where I put things for myself (physical inbox, notes section of notebook, and notes app), and so far they’re all getting checked regularly.

Capture

I do most of my capture in the notes app on my phone (when not at a desk) or on piece of paper (when at my desk). When I get back to having in-person meetings, I assume more notes are going to end up in the physical notebook, because it’s nicer to take meeting notes on paper than on a phone.

Agendas

The biggest thing I changed in the new notebook was to dedicate much more space to agendas, but it’s already almost full! It turns out there are lots of things ”I should talk to X about the next time we’re speaking”, rather than send X an email immediately. Who knew?

Waiting for

This is probably my favourite. It is useful to have a list of who have said they will get back to me, when, and about what. That little date next to their name helps me not feel like a nag when I ask them again after a reasonable time, and makes me appreciate them more when they respond quickly.

Weekly review

I already had the habit of scheduling an appointment with myself on Fridays (or otherwise towards the end of the week) to go over some recurring items. I’ve expanded this appointment to do a weekly review of the notebook, calendar, someday/maybe list, and some other bespoke checklist items. I bribe myself with sweets to support this habit.

Things I’d like to improve

Here are some of the things I want to improve:

  • The project list. A project sensu Getting Things Done can be anything from purchase new shoes to taking over the world. The project list is supposed to keep track of what you’ve undertaken to do, and make sure you have come up with actions that progress them. My project list isn’t very complete, and doesn’t spark new actions very often.
  • Project backlogs. On the other hand, I have some things on the project list that are projects in a greater sense, and will have literally thousands of actions, both from me and others. These obviously need planning ahead beyond the next thing to do. I haven’t yet figured out the best way to keep a backlog of future things to do in a project, potentially with dependencies, and feed them into my list of things to do when they become current.
  • Notes. I have a strong note taking habit, but a weak note reading habit. Essentially, many of my notes are write-only; this feels like a waste. I’ve started my attempts to improve the situation with meeting notes: trying to take five minutes right after a meeting (if possible) to go over the notes, extract any calendar items, actions and waiting-fors, and decide whether I need to save the note or if I can throw it away. What to do about research notes from reading and from seminars is another matter.

One notebook’s worth of work

Image: an Aviagen sponsored notebook from the 100 Years of Genetics meeting in Edinburgh, with post-its sticking out, next to a blue Ballograf pen

Dear diary,

”If could just spend more time doing stuff instead of worrying about it …” (Me, at several points over the years.)

I started this notebook in spring last year and recently filled it up. It contains my first implementation of the system called ”Getting Things Done” (see the book by David Allen with the same name). Let me tell you a little about how it’s going.

The way I organised my work, with to-do lists, calendar, work journal, and routines for dealing with email had pretty much grown organically up until the beginning of this year. I’d gotten some advice, I’d read the odd blog post and column about email and calendar blocking, but beyond some courses in project management (which are a topic for another day), I’d gotten myself very little instruction on how to do any of this. How does one actually keep a good to-do list? Are there principles and best practices? I was aware that Getting Things Done was a thing, and last spring, a mention in passing on the Teaching in Higher Ed podcast prompted me to give it a try.

I read up a little. The book was right there in the university library, unsurprisingly. I also used a blog post by Alberto Taiuti about doing Getting Things Done in a notebook, and read some other writing by researchers about how they use the method (Robert Talbert and Veronika Cheplygina).

There is enough out there about this already that I won’t make my own attempt to explain the method in full, but here are some of the interesting particulars:

You are supposed to be careful about how you organise your to-do lists. You’re supposed to make sure everything on the list is a clear, unambiguous next action that you can start doing when you see it. Everything else that needs thinking, deciding, mulling over, reflecting etc, goes somewhere else, not on your list of thing to do. This means that you can easily pick something off your list and start work on it.

You are supposed to be careful about your calendar. You’re supposed to only put things in there that have a fixed date and time attached, not random reminders or aspirational scheduling of things you would like to do. This means that you can easily look at your calendar and know what your day, week and month look like.

You are supposed to be careful to record everything you think about that matters. You’re supposed to take a note as soon as you have a potentially important thought and put it in a dedicated place that you will check and go through regularly. This means that you don’t have to keep things in your head.

This sounds pretty straightforward, doesn’t it? Well, despite having to-do lists, calendars and a habit of note-taking for years, I’ve not been very disciplined about any of this before. My to-do list items have often been vague, too big tasks that are hard to get started on. My calendar has often contained aspirational planning entries that didn’t survive contact with the realities of the workday. I often delude myself that I’ll remember an idea or a decision, to have quietly it slip out of my mind.

Have I become more productive, or less stressed? The honest answer is that I don’t know. I don’t have a reliable way to track either productivity or stress levels, and even if I did: the last year has not really been comparable to the year before, for several reasons. However, I feel like thinking more about how I organise my work makes a difference, and I’ve felt a certain joy working on the process, as well as a certain dread when looking at it all organised in one place. Let’s keep going and see where this takes us.

Against question and answer time

Here is a semi-serious suggestion: Let’s do away with questions and answers after talks.

I’ll preface with two examples:

First, a scientist I respect highly had just given a talk. As we were chatting away afterwards, I referred to someone who had asked a question during the talk. The answer: ”I didn’t pay attention. I don’t listen when people talk at me like that.”

Second, Swedish author Göran Hägg had this little joke about question and answer time. I paraphrase from memory: Question time is useless because no reasonable person who has a useful contribution will be socially uninhibited enough to ask a question in a public forum (at least not in Sweden). To phrase it more nicely: Having a useful contribution and feeling comfortable to speak up might not be that well correlated.

I have two intuitions about this. On the one hand, there’s the idea that science thrives on vigorous criticism. I have been at talks where people bounce questions at the speaker, even during the talk and even with pretty serious criticisms, and it works just fine. I presume it has to do both with respect, skill at asking and answering, and the power and knowledge differentials between interlocutors.

On the other hand, we would prefer to have a good conversation and productive arguments, and I’m sure everyone has been in seminar rooms where that wasn’t the case. It’s not a good conversation if, say, question and answers turn into old established guys (sic) shouting down students. In some cases, it seems the asker is not after a productive argument, nor indeed any honest attempt to answer the question. (You might be able to tell by them barking a new question before the respondent has finished.)

Personally, I’ve turned to asking fewer questions. If it’s something I’ve misunderstood, it’s unlikely that I will get the explanation I need without conversation and interaction. If I have a criticism, it’s unlikely that I will get the best possible answer from the speaker on the spot. If I didn’t like the seminar, am upset with the speaker’s advisor, hate it when people mangle the definition of ”epigenetics” or when someone shows a cartoon of left-handed DNA, it’s my problem and not something I need to share with the audience.

I think questions and answers is one of thing that actually has benefitted from a move to digital seminars on a distance, where questions are often written in chat. This might be because of a difference in tone between writing a question down or asking it verbally, or thanks to the filtering capabilities of moderators.

My talk at the ChickenStress Genomics and Bioinformatics Workshop

A few months ago I gave a talk at the ChickenStress Genomics and Bioinformatics Workshop about genetic mapping of traits and gene expression.

ChickenStress is a European training network of researchers who study stress in chickens, as you might expect. It brings together people who work with (according to the work package names) environmental factors, early life experiences and genetics. The network is centered on a group of projects by early stage researchers — by the way, I think that’s a really good way to describe the work of a PhD student — and organises activities like this workshop.

I was asked to talk about our work from my PhD on gene expression and behaviour in the chicken (Johnsson & al. 2018, Johnsson & al. 2016), concentrating on concepts and methods rather than results. If I have any recurring readers, they will already know that brief is exactly what I like to do. I talked about the basis of genetic mapping of traits and gene expression, what data one needs to do it, and gave a quick demo for a flavour of an analysis workflow (linear mixed model genome-wide association in GEMMA).

Here are slides, and the git repository of the demo:

Various positions II

Again, what good is a blog if you can’t post your arbitrary idiosyncratic opinions as if you were an authority?

Don’t make a conference app

I get it, you can’t print a full-blown paper program book: it is too much, no one reads it, and it feels wasteful. But please, please, for the love of everything holy, don’t make an app. Put the text, straight up, on a website in plaintext. It loads quickly, it’s searchable, it can be automatically generated. The conference app will be cloddy, take up space on the phone, eat bandwidth on some strained mobile contract, and invariably freeze.

Posters, still bad in 2020

Don’t believe the lies: a once folded canvas poster will never look good again. You haven’t had fun on a conference before you’ve tried ironing a poster on a hostel floor with an iron that belongs in a museum.

Poster sessions are bad by necessity. If they had had space and time to be anything other than a crowded mess, the conference would have to accept substantially fewer posters. That means fewer participants, probably especially earlier career participants, and the value of having them outweighs the value of a somewhat better poster session.

Gene accession numbers

PLOS Genetics has a great policy in their submission guidelines that doesn’t seem to get followed very much in papers they actually publish. This should be the norm in every genetics paper. I feel bad that it’s not the case in all my papers.

As much as possible, please provide accession numbers or identifiers for all entities such as genes, proteins, mutants, diseases, etc., for which there is an entry in a public database, for example:

Ensembl
Entrez Gene
FlyBase
InterPro
Mouse Genome Database (MGD)
Online Mendelian Inheritance in Man (OMIM)
PubChem

Identifiers should be provided in parentheses after the entity on first use.

In the future, with the right ontologies and repositories in place, I hope this will be the case with traits, methods and so on as well.

UK Biobank and dbGAP are not open data

And that is fine.

Stop it with the work-life balance tweets

No-one should tweet about work-life balance; whether you write about how much you work or how diligent you are about your hours, it comes off as bragging.

Tenses

Write your papers in the past or present tense, whichever you prefer. In the context of a scientific paper, the difference between past and present communicates nothing. I suppose you’re not supposed to mix tenses, but that doesn’t matter either. Most readers probably won’t notice. If you ask me about my stylistic opinion: present tense for everything. But again, it doesn’t matter.

Journal club of one: ”Chromosome-level and haplotype-resolved genome assembly enabled by high-throughput single-cell sequencing of gamete genomes”

Genome assembly researchers are still figuring out new wild ways of combining different kinds of data. For example, ”trio binning” took what used to be a problem, — the genetic difference between the two genome copies that a diploid individual carries — and turned it into a feature: if you assemble a hybrid individual with genetically distant parents, you can separate the two copies and get two genomes in one. (I said that admixture was the future of every part of genetics, didn’t I?) This paper (Campoy et al. 2020) describes ”gamete binning” which uses sequencing of gametes perform a similar trick.

Expressed another way, gamete binning means building an individual-specific genetic map and then using it to order and phase the pieces of the assembly. This means two sequence datasets from the same individual — one single cell short read dataset from gametes (10X linked reads) and one long read dataset from the parent (PacBio) — and creatively re-using them in different ways.

This is what they do:

1. Assemble the long reads into a preliminary assembly, which will be a mosaic of the two genome copies (barring gross differences, ”haplotigs”, that can to some extent be removed).

2. Align the single cell short reads to the preliminary assembly and call SNPs. (They also did some tricks to deal with regions without SNPs, separating those that were not variable between genomes and those that were deleted in one genome.) Because the gametes are haploid, they get the phase of the parent’s genotype.

3. Align the long reads again. Now, based on the phased genotype, the long reads can be assigned to the genome copy they belong to. So they can partition the reads into one bin per genome copy and chromosome.

4. Assemble those bins separately. They now get one assembly for each homologous chromosome.

They apply it to an apricot tree, which has a 250 Mbp genome. When they sequence the parents of the tree, it seems to separate the genomes well. The two genome copies have quite a bit of structural variation:

Despite high levels of synteny, the two assemblies revealed large-scale rearrangements (23 inversions, 1,132 translocation/transpositions and 2,477 distal duplications) between the haplotypes making up more than 15% of the assembled sequence (38.3 and 46.2 Mb in each of assemblies; Supplementary Table 1). /…/ Mirroring the huge differences in the sequences, we found the vast amount of 942 and 865 expressed, haplotype-specific genes in each of the haplotypes (Methods; Supplementary Tables 2-3).

They can then go back to the single cell data and look at the recombination landscape and at chromosomal arrangements during meiosis.

This is pretty elegant. I wonder how dependent it is on the level of variation within the individual, and how it compares in cost and finickiness to other assembly strategies.

Literature

Campoy, José A., et al. ”Chromosome-level and haplotype-resolved genome assembly enabled by high-throughput single-cell sequencing of gamete genomes.” BioRxiv (2020).