Significance+novelty

Publicerat i english, the practice of science av mrtnj

I’ve had reasons to read and think more about research grant applications over the last years; I’ve written some, exchanged feedback with colleagues, and I was on one of the review panels for Vetenskapsrådet last year. As a general observation, it appears that I’m not the only one who struggles with explaining the ”Significance and novelty” of my work. That’s a pretty banal observation. But why is that?

It’s easy to imagine that this difficulty is just because of the curse of knowledge, that researchers are so deeply invested in our research topics that it is hard for us to imagine anyone not intuitively understanding what topic X is about, and how vital this is to humanity. Oh, those lofty scientists levitating in their mushroom towers! I am sure that is partially right; the curse of knowledge is a big problem when writing about your science, but there is a bigger problem.

If we look at statements of significance (for example in my own early drafts), it is pretty common to see significance and novelty established in a disembodied way:

This work is significant because topic X is a Big Problem.

This establishes that the sub-field encompassing the work is important in a general way.

This work is novel because, despite sustained research, no-one has yet done experiment Y in species Z with approach Å.

This establishes that there is a particular gap in the sub-sub-field where this research fits.

What these sentences fail to establish is the causal chain that the reader cares about: Will performing this research, at this time, make a worthwhile contribution to solving the Big Problem?

And there might be a simple explanation: The kind of reasoning required here is unique to the grant application. When writing papers, it is sufficient to establish that the area around the work is important and that the work ”… offers insights …” in some manner. After all, the insights are offered right there in the paper. The reader can look at them and figure out the value for themselves. The reader of a grant application can’t, because the insights have not materialised yet.

When planning new work and convincing your immediate collaborators that the work is worthwhile pursuing, you also don’t have to employ these kinds of arguments. The colleagues are likely motivated by other factors, like the direct implications for their work (and cv), how fun the new project will be, or how much they’d like to work with you. Again, the reader of the grant application needs another kind of convincing.

Thankfully, the funders help out. Here are some of the questions the VR peer review handbook (pdf) lists, that pertain to significance and novelty:

To what extent does the proposed project define new, interesting scientific questions?

To what extent does the proposed project use new ways and methods to address important scientific questions?

When applicable, is the proposed development of methods or techniques of high scientific significance? Does the proposed development allow new scientific questions to be addressed?

Maybe that helps. See you around, I need to go practice explaining how my work leads to new scientific questions.

7Feb2021

The Fulsome Principle

Publicerat i english, the practice of science av mrtnj

”If it be aught to the old tune, my lord,
It is as fat and fulsome to mine ear
As howling after music.”
(Shakespeare, The Twelfth Night)

There are problematic words that can mean opposite things, I presume either because two once different expressions meandered in the space of meaning until they were nigh indistinguishable, or because something that already had a literal meaning went and became ironic. We can think of our favourites, like a particular Swedish expression that either means that you will get paid or not, or ”fulsome”. Is ”fulsome praise” a good thing (Merriam Webster Usage Note)?

Better avoid the ambiguity. I think this is a fitting name for an observation about language use.

The Fulsome Principle: smart people will gladly ridicule others for breaking supposed rules that are in fact poorly justified.

That is, if we take any strong position about what ”fulsome” means that doesn’t acknowledge the ambiguity, we are following a rule that is poorly justified. If we make fun of anyone for getting the rule wrong, condemning them for as misusing and degrading the English language, we are embarrassingly wrong. We are also in the company of many other smart people who snicker before checking the dictionary. It could also be called the Strunk & White principle.

This is related to:

The Them Principle: If you think something sounds like a novel misuse and degradation of language, chances are it’s in Shakespeare.

This has everything to do with language use in science. How many times have you heard geneticists or evolutionary biologists haranguing some outsider to their field, science writer or student for misusing ”gene”, ”fitness”, ”adaptation” or similar? I would suspect: Many. How many times was the usage, in fact, in line with how the word is used in an adjacent sub-subfield? I would suspect: Also, many.

In ”A stylistic note” at the beginning of his book The Limits of Kindness (Hare 2013), philosopher Caspar Hare writes:

I suspect that quite often, when professional philosophers use specialized terms, they have subtly different senses of those terms in mind.

One example, involving not-so-subtly-different senses of a not-very-specialized term: We talk a great deal about biting the bullet. For example, ”I confronted David Lewis with my damning objection to modal realism, and he bit the bullet.” I have asked a number of philosophers about what, precisely, this means, and received a startling range of replies.

Around 70% say that the metaphor has to do with surgery. … So, in philosophy, for you to bite the bullet is for you to grimly accept seemingly absurd consequences of the theory you endorse. This is, I think, the most widely understood sense of the term.
/…/

Some others say that the metaphor has to do with injury … So in philosophy, for you to acknowledge that you are biting the bullet is for you to acknowledge that an objection has gravely wounded your theory.

/…/

One philosopher said to me that the metaphor has to do with magic. To bite a bullet is to catch a bullet, Houdini-style, in your teeth. So, in philosophy, for you to bite the bullet is for you to elegantly intercept a seemingly lethal objection and render it benign.

/…/

I conclude from my highly unscientific survey that, more than 30 percent of the time, when a philosopher claims to be ”biting the bullet,” a small gap opens up between what he or she means and what his or her reader or listener understands him or her to mean.

And I guess these small gaps in understanding are more common than we normally think.”

Please, don’t go back to my blog archive and look for cases of me railing against someone’s improper use of scientific language, because I’m sure I’ve done it too many times. Mea maxima culpa.

31Jan2021

The word ”genome”

Publicerat i citerat, english, genetik av mrtnj

The sources I’ve seen attribute the coinage of ”genome” to botanist Hans Winkler (1920, p. 166).

The pertinent passage goes:

Ich schlage vor, für den haploiden Chromosomensatz, der im Verein mit dem zugehörigen Protoplasma die materielle Grundlage der systematischen Einheit darstellt den Ausdruck: das Genom zu verwenden … I suggest to use the expression ”the genome” for the haploid set of chromosomes, which together with the protoplasm it belongs with make up the material basis of the systematic unit …

That’s good, but why did Winkler need this term in the first place? In this chapter, he is dealing with the relationship between chromosome number and mode of reproduction. Of course, he’s going to talk about hybridization and ploidy, and he needs some terms to bring order to the mess. He goes on to coin a couple of other concepts that I had never heard of:

… und Kerne, Zellen und Organismen, in denen ein gleichartiges Genom mehr als einmal in jedem Kern vorhanden ist, homogenomatisch zu nennen, solche dagegen, die verschiedenartige Genome im Kern führen, heterogenomatisch.

So, a homogenomic organism has more than one copy of the same genome in its nuclei, while a heterogenomic organism has multiple genomes. He also suggests you could count the genomes, di-, tri- up to polygenomic organisms. He says that this is a different thing than polyploidy, which is when an organism has multiples of a haploid chromosome set. Winkler’s example: A hybrid between a diploid species with 10 chromosomes and another diploid species with 16 chromosomes might have 13 chromosomes and be polygenomic but not polyploid.

These terms don’t seem to have stuck as much, but I found them used here en there, for example in papers on bananas (Arvanitoyannis et al. 2008) and cotton (Brown & Menzel 1952); cooking bananas are heterogenomic.

This only really makes sense in cases with recent hybridisation, where you can trace different chromosomes to origins in different species. You need to be able to trace parts of the hybrid genome of the banana to genomes of other species. Otherwise, the genome of the banana just the genome of the banana.

Analogously, we also find polygenomes in this cancer paper (Navin et al. 2010):

We applied our methods to 20 primary ductal breast carcinomas, which enable us to classify them according to whether they appear as either monogenomic (nine tumors) or polygenomic (11 tumors). We define ”monogenomic” tumors to be those consisting of an apparently homogeneous population of tumor cells with highly similar genome profiles throughout the tumor mass. We define ”polygenomic” tumors as those containing multiple tumor subpopulations that can be distinguished and grouped by similar genome structure.

This makes sense; if a tumour has clones of cells in it with a sufficiently rearranged genome, maybe it is fair to describe it as a tumour with different genomes. It raises the question what is ”sufficiently” different for something to be a different genome.

How much difference can there be between sequences that are supposed to count as the same genome? In everything above, we have taken a kind of typological view: there is a genome of an individual, or a clone of cells, that can be thought of as one entity, despite the fact that every copy of it, in every different cell, is likely to have subtle differences. Philosopher John Dupré (2010), in ”The Polygenomic Organism”, questions what we mean by ”the genome” of an organism. How can we talk about an organism having one genome or another, when in fact, every cell in the body goes through mutation (actually, Dupré spends surprisingly little time on somatic mutation but more on epigenetics, but makes a similar point), sometimes chimerism, sometimes programmed genome rearrangements?

The genome is related to types of organism by attempts to find within it the essence of a species or other biological kind. This is a natural, if perhaps naïve, interpretation of the idea of the species ‘barcode’, the use of particular bits of DNA sequence to define or identify species membership. But in this paper I am interested rather in the relation sometimes thought to hold between genomes of a certain type and an individual organism. This need not be an explicitly essentialist thesis, merely the simple factual belief that the cells that make up an organism all, as a matter of fact, have in common the inclusion of a genome, and the genomes in these cells are, barring the odd collision with a cosmic ray or other unusual accident, identical.

Dupré’s answer is that there probably isn’t a universally correct way to divide living things into individuals, and what concept of individuality one should use really depends on what one wants to do with it. I take this to mean that it is perfectly fine to gloss over real biological detail, but that we need to keep in mind that they might unexpectedly start to matter. For example, when tracing X chromosomes through pedigrees, it might be fine to ignore that X-inactivation makes female mammals functionally mosaic–until you start looking at the expression of X-linked traits.

Photo of calico cat in Amsterdam by SpanishSnake (CC0 1.0). See, I found a reason to put in a cat picture!

Finally, the genome exists not just in the organism, but also in the computer, as sequences, maps and obscure bioinformatics file formats. Arguably, keeping the discussion above in mind, the genome only exists in the computer, as a scientific model of a much messier biology. Szymanski, Vermeulen & Wong (2019) investigate what the genome is by looking at how researchers talk about it. ”The genome” turns out to be many things to researchers. Here they are writing about what happened when the yeast genetics community created a reference genome.

If the digital genome is not assumed to solely a representation of a physical genome, we might instead see ”the genome” as a discursive entity moving from the cell to the database but without ever removing ”the genome” from the cell, aggregating rather than excluding. This move and its inherent multiplying has consequences for the shape of the community that continues to participate in constructing the genome as a digital text. It also has consequences for the work the genome can perform. As Chadarevian (2004) observes for the C. elegans genome sequence, moving the genome from cell to database enables it to become a new kind of mapping tool …

/…/

Consequently, the informational genome can be used to manufacture coherence across knowledge generated by disparate labs by making it possible to line up textual results – often quite literally, in the case of genome sequences as alphabetic texts — and read across them.

/…/

Prior to the availability of the reference genome, such coherence across the yeast community was generated by strain sharing practices and standard protocols and notation for documenting variation from the reference strain, S288C, authoritatively embodied in living cells housed at Mortimer’s stock center. After the sequencing project, part of that work was transferred to the informational, textual yeast genome, making the practice of lining up and making the same available to those who worked with the digital text as well as those who worked with the physical cell.

And that brings us back to Winkler: What does the genome have in common? That it makes up the basis for the systematic unit, that it belongs to organisms that we recognize as closely related enough to form a systematic unit.

Literature

Winkler H. (1920) Verbreitung und Ursache der Parthenogenesis im Pflanzen- und Tierreiche.

Arvanitoyannis, Ioannis S., et al. ”Banana: cultivars, biotechnological approaches and genetic transformation.” International journal of food science & technology 43.10 (2008): 1871-1879.

Navin, Nicholas, et al. ”Inferring tumor progression from genomic heterogeneity.” Genome research 20.1 (2010): 68-80.

Brown, Meta S., and Margaret Y. Menzel. ”Polygenomic hybrids in Gossypium. I. Cytology of hexaploids, pentaploids and hexaploid combinations.” Genetics 37.3 (1952): 242.

Dupré, John. ”The polygenomic organism.” The Sociological Review 58.1_suppl (2010): 19-31.

Szymanski, Erika, Niki Vermeulen, and Mark Wong. ”Yeast: one cell, one reference sequence, many genomes?.” New Genetics and Society 38.4 (2019): 430-450.

24Jan2021

A model of polygenic adaptation in an infinite population

Publicerat i english, genetik av mrtnj

How do allele frequencies change in response to selection? Answers to that question include ”it depends”, ”we don’t know”, ”sometimes a lot, sometimes a little”, and ”according to a nonlinear differential equation that actually doesn’t look too horrendous if you squint a little”. Let’s look at a model of the polygenic adaptation of an infinitely large population under stabilising selection after a shift in optimum. This model has been developed by different researchers over the years (reviewed in Jain & Stephan 2017).

Here is the big equation for allele frequency change at one locus:

$\dot{p}_i = -s \gamma_i p_i q_i (c_1 - z') - \frac{s \gamma_i^2}{2} p_i q_i (q_i - p_i) + \mu (q_i - p_i )$

That wasn’t so bad, was it? These are the symbols:

the subscript i indexes the loci,
$\dot{p}$ is the change in allele frequency per time,
$\gamma_i$ is the effect of the locus on the trait (twice the effect of the positive allele to be precise),
$p_i$ is the frequency of the positive allele,
$q_i$ the frequency of the negative allele,
$s$ is the strength of selection,
$c_1$ is the phenotypic mean of the population; it just depends on the effects and allele frequencies
$\mu$ is the mutation rate.

This breaks down into three terms that we will look at in order.

The directional selection term

$-s \gamma_i p_i q_i (c_1 - z')$

is the term that describes change due to directional selection.

Apart from the allele frequencies, it depends on the strength of directional selection $s$ , the effect of the locus on the trait $\gamma_i$ and how far away the population is from the new optimum $(c_1 - z')$ . Stronger selection, larger effect or greater distance to the optimum means more allele frequency change.

It is negative because it describes the change in the allele with a positive effect on the trait, so if the mean phenotype is above the optimum, we would expect the allele frequency to decrease, and indeed: when

$(c_1 - z') < 0$

this term becomes negative.

If you neglect the other two terms and keep this one, you get Jain & Stephan's "directional selection model", which describes behaviour of allele frequencies in the early phase before the population has gotten close to the new optimum. This approximation does much of the heavy lifting in their analysis.

The stabilising selection term

$-\frac{s \gamma_i^2}{2} p_i q_i (q_i - p_i)$

is the term that describes change due to stabilising selection. Apart from allele frequencies, it depends on the square of the effect of the locus on the trait. That means that, regardless of the sign of the effect, it penalises large changes. This appears to make sense, because stabilising selection strives to preserve traits at the optimum. The cubic influence of allele frequency is, frankly, not intuitive to me.

The mutation term

Finally,

$\mu (q_i - p_i )$

is the term that describes change due to new mutations. It depends on the allele frequencies, i.e. how of the alleles there are around that can mutate into the other alleles, and the mutation rate. To me, this is the one term one could sit down and write down, without much head-scratching.

Walking in allele frequency space

Jain & Stephan (2017) show a couple of examples of allele frequency change after the optimum shift. Let us try to draw similar figures. (Jain & Stephan don’t give the exact parameters for their figures, they just show one case with effects below their threshold value and one with effects above.)

First, here is the above equation in R code:

pheno_mean <- function(p, gamma) {
  sum(gamma * (2 * p - 1))
}

allele_frequency_change <- function(s, gamma, p, z_prime, mu) {
  -s * gamma * p * (1 - p) * (pheno_mean(p, gamma) - z_prime) +
    - s * gamma^2 * 0.5 * p * (1 - p) * (1 - p - p) +
    mu * (1 - p - p)
}

With this (and some extra packaging; code on Github), we can now plot allele frequency trajectories such as this one, which starts at some arbitrary point and approaches an optimum:

Animation of alleles at two loci approaching an equilibrium. Here, we have two loci with starting frequencies 0.2 and 0.1 and effect size 1 and 0.01, and the optimum is at 0. The mutation rate is 10^-4 and the strength of selection is 1. Animation made with gganimate.

Resting in allele frequency space

The model describes a shift from one optimum to another, so we want want to start at equilibrium. Therefore, we need to know what the allele frequencies are at equilibrium, so we solve for 0 allele frequency change in the above equation. The first term will be zero, because

$(c_1 - z') = 0$

when the mean phenotype is at the optimum. So, we can throw away that term, and factor the rest equation into:

$(1 - 2p) (-\frac{s \gamma ^2}{2} p(1-p) + \mu) = 0$

Therefore, one root is $p = 1/2$ . Depending on your constitution, this may or may not be intuitive to you. Imagine that you have all the loci, each with a positive and negative allele with the same effect, balanced so that half the population has one and the other half has the other. Then, there is this quadratic equation that gives two other equilibria:

$\mu - \frac{s\gamma^2}{2}p(1-p) = 0$
$\implies p = \frac{1}{2} (1 \pm \sqrt{1 - 8 \frac{\mu}{s \gamma ^2}})$

These points correspond to mutation–selection balance with one or the other allele closer to being lost. Jain & Stephan (2017) show a figure of the three equilibria that looks like a semicircle (from the quadratic equation, presumably) attached to a horizontal line at 0.5 (their Figure 1). Given this information, we can start our loci out at equilibrium frequencies. Before we set them off, we need to attend to the effect size.

How big is a big effect? Hur långt är ett snöre?

In this model, there are big and small effects with qualitatively different behaviours. The cutoff is at:

$\hat{\gamma} = \sqrt{ \frac{8 \mu}{s}}$

If we look again at the roots to the quadratic equation above, they can only exist as real roots if

$\frac {8 \mu}{s \gamma^2} < 1$

because otherwise the expression inside the square root will be negative. This inequality can be rearranged into:

$\gamma^2 > \frac{8 \mu}{s}$

This means that if the effect of a locus is smaller than the threshold value, there is only one equilibrium point, and that is at 0.5. It also affects the way the allele frequency changes. Let us look at two two-locus cases, one where the effects are below this threshold and one where they are above it.

threshold <- function(mu, s) sqrt(8 * mu / s)

threshold(1e-4, 1)

[1] 0.02828427

With mutation rate of 10^-4 and strength of selection of 1, the cutoff is about 0.028. Let our ”big” loci have effect sizes of 0.05 and our small loci have effect sizes of 0.01, then. Now, we are ready to shift the optimum.

The small loci will start at an equilibrium frequency of 0.5. We start the large loci at two different equilibrium points, where one positive allele is frequent and the other positive allele is rare:

get_equilibrium_frequencies <- function(mu, s, gamma) {
  c(0.5,
    0.5 * (1 + sqrt(1 - 8 * mu / (s * gamma^2))),
    0.5 * (1 - sqrt(1 - 8 * mu / (s * gamma^2))))
}

(eq0.05 <- get_equilibrium_frequencies(1e-4, 1, 0.05))

[1] 0.50000000 0.91231056 0.08768944

get_equlibrium_frequencies(1e-4, 1, 0.01)

[1] 0.5 NaN NaN

Look at them go!

These animations show the same qualitative behaviour as Jain & Stephan illustrate in their Figure 2. With small effects, there is gradual allele frequency change at both loci:

However, with large effects, one of the loci (the one on the vertical axis) dramatically changes in allele frequency, that is it’s experiencing a selective sweep, while the other one barely changes at all. And the model will show similar behaviour when the trait is properly polygenic, with many loci, as long as effects are large compared to the (scaled) mutation rate.

Here, I ran 10,000 time steps; if we look at the phenotypic means, we can see that they still haven’t arrived at the optimum at the end of that time. The mean with large effects is at 0.089 (new optimum of 0.1), and the mean with small effects is 0.0063 (new optimum: 0.02).

Let’s end here for today. Maybe another time, we can return how this model applies to actually polygenic architectures, that is, with more than two loci. The code for all the figures is on Github.

Literature

Jain, K., & Stephan, W. (2017). Modes of rapid polygenic adaptation. Molecular biology and evolution, 34(12), 3169-3175.

17Jan2021

The genomic scribe in hyperspace

Publicerat i english, genetik av mrtnj

When I was in school (it must have been in gymnasiet, roughly corresponding to secondary school or high school), I remember giving a presentation on a group project about the human genome project, and using the illiterate copyist analogy. After sequencing the human genome, we are able to blindly copy the text of life; we still need to learn to read it. At this point, I had no clue whatsoever that I would be working in genetics in the future. I certainly felt very clever coming up with that image. I must have read it somewhere.

If it is true that the illiterate scribe is a myth, and they must have had at least some ability to read, that makes the analogy more apt: even in 2003, researchers actually had a fairly good idea of how to read certain aspects of genetics. The genetic code is from 1961, for crying out loud (Yanofsky 2007)!

My classroom moment must have been around 2003, which is the year the ENCODE project started, aiming to do just that: create an encyclopedia (or really, a critical apparatus) of the human genome. It’s still going: a drove of papers from its third phase came out last year, and apparently it’s now in the fourth phase. ENCODE can’t be a project in the usual sense of a planned undertaking with a defined goal, but rather a research programme in the general direction of ”a comprehensive parts list of functional elements in the human genome” (ENCODE FAQ). Along with the phase 3 empirical papers, they published a fun perspective article (The ENCODE Project Consortium et al. 2020).

ENCODE commenced as an ambitious effort to comprehensively annotate the elements in the human genome, such as genes, control elements, and transcript isoforms, and was later expanded to annotate the genomes of several model organisms. Mapping assays identified biochemical activities and thus candidate regulatory elements.

The age means that ENCODE has lived through generations of genomic technologies. Phase 1 was doing functional genomics with microarrays, which now sounds about as quaint as doing it with blots. Nowadays, they have CRISPR-based editing assays and sequencing methods for chromosome 3D structure that just seem to keep adding Cs to their acronyms.

Last time I blogged about the ENCODE project was in 2013 (in Swedish), in connection with the opprobrium about junk DNA. If you care about junk DNA, check out Sean Eddy’s FAQ (Eddy 2012). If you still want to be angry about what percentage of the genome has function, what gene concepts are useful and the relationship between quantitative genetics and genomics, check out this Nature Video. It’s funny, because the video pre-empts some of the conclusions of the perspective article.

The video says: to do many of the potentially useful things we want to do with genomes (like sock cancer in the face, presumably), we need to look at individual differences (”between you, and you, and you”) and how they relate to traits. And an encyclopedia, great as it may be, is not going to capture that.

The perspective says:

It is now apparent that elements that govern transcription, chromatin organization, splicing, and other key aspects of genome control and function are densely encoded in the human genome; however, despite the discovery of many new elements, the annotation of elements that are highly selective for particular cell types or states is lagging behind. For example, very few examples of condition-specific activation or repression of transcriptional control elements are currently annotated in ENCODE. Similarly, information from human fetal tissue, reproductive organs and primary cell types is limited. In addition, although many open chromatin regions have been mapped, the transcription factors that bind to these sequences are largely unknown, and little attention has been devoted to the analysis of repetitive sequences. Finally, although transcript heterogeneity and isoforms have been described in many cell types, full-length transcripts that represent the isoform structure of spliced exons and edits have been described for only a small number of cell types.

That is, the future of genomics is in variation. We want to know about: organismic/developmental background (cell lines vs primary vs induced vs tissue), environmental variation (condition-dependence), genetic variation (gene editing assays that change local genetic variants, the genetic background of different cell line and human genomes), dynamics (time and induction). To put it in plain terms: We need to know how the genome regulation of different cells and individuals are different, and what that does to them. To put it in fancy terms: we are moving towards cellular phenomics, quantitative genomics, and an ever-expanding hypercube of data.

Literature

Eddy, S. R. (2012). The C-value paradox, junk DNA and ENCODE. Current biology, 22(21), R898-R899.

ENCODE Project Consortium, Snyder, M. P., Gingeras, T. R., Moore, J. E., Weng, Z., Gerstein, M. B., Ren, B., … & Myers, R. M. (2020). Perspectives on ENCODE. Nature, 583(7818), 693-698.

Yanofsky, C. (2007). Establishing the triplet nature of the genetic code. Cell, 128(5), 815-818.

10Jan2021

Shell stuff I didn’t know

Publicerat i computer stuff, english av mrtnj

I generally stay away from doing anything more complicated in a shell script than making a directory and running an R script or a single binary, and especially avoid awk and sed as much as possible. However, sometimes the shell actually does offer a certain elegance and convenience (and sometimes deceitful traps).

Here are three things I only learned recently:

Stripping directory and suffix from file names

Imagine we have a project where files are named with the sample ID followed by some extension, like so:

project/data/sample1.g.vcf
project/data/sample2.g.vcf
project/data/sample3.g.vcf

Quite often, we will want to grab all the in a directory and extract the base name without extension and without the whole path leading up to the file. There is a shell command for this called basename:

basename -s .g.vcf project/data/sample*.g.vcf

sample1
sample2
sample3

The -s flag gives the suffix to remove.

This is much nicer than trying to regexp it, for example with R:

library(stringr)

files <- dir("project/data")
basename <- str_match(files, "^.*/(.+)\\.g\\.vcf")

Look at that second argument … ”^.*/(.+)\\.g\\.vcf” What is this?! And let me tell you, that was not my first attempt at writing that regexp either. Those of us who can interpret this gibberish must acknowledge that we have learned to do so only through years of suffering.

For that matter, it’s also than the bash suffix and prefix deletion syntax, which is one of those things I think one has to google every time.

for string in project/data/*.g.vcf; do
    nosuffix=${string%.g.vcf}
    noprefix=${nosuffix#project/data/}
    echo $noprefix
done

Logging both standard out and standard error

When sending jobs off to a server to be run without you looking at them, it’s often convenient to save the output to a file. To redirect standard output to a file, use ”>”, like so:

./script_that_prints_output.sh > out_log.txt

However, there is also another output stream used to record (among other things) error messages (in some programs; this isn’t very consistent). Therefore, we should probably log the standard error stream too. To redirect standard error to a file:

./script_that_prints_output.sh 2> error_log.txt

And to redirect both to the same file:

./script_that_prints_output.sh > combined_log.txt 2>&1

The last bit is telling the shell to redirect the standard error stream to standard out, and then both of them get captured in the file. I didn’t know until recently that one could do this.

The above code contained some dots, and speaking of that, here is a deceitful shell trap to trip up the novice:

The dot command (oh my, this is so bad)

When working on a certain computer system, there is a magic invocation that needs to be in the script to be able to use the module system. It should look like this:

. /etc/profile.d/modules.sh

That means ”source the script found at /etc/profiles.d/modules.sh” — which will activate the module system for you.

It should not look like this:

./etc/profile.d/modules.sh

bash: ./etc/profile.d/modules.sh: No such file or directory

That means that bash tries to find a file called ”etc/profile.d/modules.sh” located in the current directory — which (probably) doesn’t exist.

If there is a space after the dot, it is a command that means the same as source, i.e. run a script from a file. If there is no space after the dot, it means a relative file path — also often used to run a script. I had never actually thought about it until someone took away the space before the dot, and got the above error message (plus something else more confusing, because a module was missing).

3Jan2021

2020 blog recap

Publicerat i dear diary, english av mrtnj

Dear diary,

During 2020, ”On unicorns and genes” published a total of 29 posts (not including this one, because it’s scheduled for 2021). This means that I kept on schedule for the beginning of the year, then had an extended blog vacation in the fall. I did write a little bit more in Swedish (about an attempt at Crispr debate, a course I took in university pedagogy, and some more about that course) which was one of the ambitions.

Let’s pick one post per month to represent the blogging year of 2020:

January: Things that really don’t matter: megabase or megabasepair. This post deals with a pet peeve of mine: should we write physical distances in genetics as base pairs (bp) or bases?

February: Using R: from plyr to purrr, part 0 out of however many. (Part one might appear at some point, I’m sure.) Finally, the purrr tidyverse package has found a place in my code. It’s still not the first tool I reach for when I need to apply a function, but it’s getting there.

March: Preprint: ”Genetics of recombination rate variation in the pig”. Preprint post about our work with genetic mapping of recombination rate in the pig.

April: Virtual animal breeding journal club: ”An eQTL in the cystathionine beta synthase gene is linked to osteoporosis in laying hens”. The virtual animal breeding journal club, organised by John Cole, was one of the good things that happened in 2020. I don’t know if it will live on in 2021, but if not, it was a treat as long as it lasted. This post contains my slides from when I presented a recent paper, from some colleagues, about the genetics of bone quality in chickens.

May: Robertson on genetic correlation and loss of variation. A post about a paper by Alan Robertson from 1959. This paper is reasonably often cited as a justification for 0.80 as some kind of cut-off for when a genetic correlation is sufficiently different enough from 1 to be important. That is not at really what the paper says.

June: Journal club of one: ”Genomic predictions for crossbred dairy cattle”. My reading on a paper about genomic evaluation for crossbred cattle in the US.

July: Twin lambs with different fathers. An all too brief methods description prompted me to write some R code. This might be my personal favourite of the year.

August: Journal club of one: ”Chromosome-level and haplotype-resolved genome assembly enabled by high-throughput single-cell sequencing of gamete genomes”. Journal club post about a preprint with a neat-looking genome assembly strategy. This is where the posts start becoming sparse.

…

December: One notebook’s worth of work. Introspective post about my attempts to organise my work.

In other news, trips were cancelled, Zoom teaching happened, and I finally got the hang of working from home. We received funding for a brand new research project about genome dynamics during animal breeding. There will be lots of sequence data. There will be simulations. It starts next year, and I will write more about it later.

Also, Uppsala is sometimes quite beautiful:

27Dec2020

The next notebook of work

Publicerat i dear diary, english, the practice of science av mrtnj

Dear diary,

The last post was about my attempt to use the Getting Things Done method to bring some more order to research, work, and everything. This post will contain some more details about my system, at a little less than a year into the process, on the off chance that anyone wants to know. This post will use some Getting Things Done jargon without explaining it. There are many useful guides online, plus of course the book itself.

Medium

Most of my system lives in paper notebooks. The main notebook contains my action list, projects list, waiting for list and agendas plus a section for notes. I quickly learned that the someday/maybe lists won’t fit, so I now have a separate (bigger) notebook for those. My calendar is digital. I also use a note taking app for project support material, and as an extra inbox for notes I jot down on my phone. Thus, I guess it’s a paper/digital hybrid.

Contexts

I have five contexts: email/messaging, work computer, writing, office and home. There were more in the beginning, but I gradually took out the ones I didn’t use. They need to be few enough and map cleanly to situations, so that I remember to look at them. I added the writing context because I tend to treat, and schedule, writing tasks separately from other work tasks. The writing context also includes writing-adjacent support tasks such as updating figures, going through reviewer comments or searching for references.

Inboxes

I have a total of nine inboxes, if you include all the email accounts and messenger services where people might contact me about things I need to do. That sounds excessive, but only three of those are where I put things for myself (physical inbox, notes section of notebook, and notes app), and so far they’re all getting checked regularly.

Capture

I do most of my capture in the notes app on my phone (when not at a desk) or on piece of paper (when at my desk). When I get back to having in-person meetings, I assume more notes are going to end up in the physical notebook, because it’s nicer to take meeting notes on paper than on a phone.

Agendas

The biggest thing I changed in the new notebook was to dedicate much more space to agendas, but it’s already almost full! It turns out there are lots of things ”I should talk to X about the next time we’re speaking”, rather than send X an email immediately. Who knew?

Waiting for

This is probably my favourite. It is useful to have a list of who have said they will get back to me, when, and about what. That little date next to their name helps me not feel like a nag when I ask them again after a reasonable time, and makes me appreciate them more when they respond quickly.

Weekly review

I already had the habit of scheduling an appointment with myself on Fridays (or otherwise towards the end of the week) to go over some recurring items. I’ve expanded this appointment to do a weekly review of the notebook, calendar, someday/maybe list, and some other bespoke checklist items. I bribe myself with sweets to support this habit.

Things I’d like to improve

Here are some of the things I want to improve:

The project list. A project sensu Getting Things Done can be anything from purchase new shoes to taking over the world. The project list is supposed to keep track of what you’ve undertaken to do, and make sure you have come up with actions that progress them. My project list isn’t very complete, and doesn’t spark new actions very often.
Project backlogs. On the other hand, I have some things on the project list that are projects in a greater sense, and will have literally thousands of actions, both from me and others. These obviously need planning ahead beyond the next thing to do. I haven’t yet figured out the best way to keep a backlog of future things to do in a project, potentially with dependencies, and feed them into my list of things to do when they become current.
Notes. I have a strong note taking habit, but a weak note reading habit. Essentially, many of my notes are write-only; this feels like a waste. I’ve started my attempts to improve the situation with meeting notes: trying to take five minutes right after a meeting (if possible) to go over the notes, extract any calendar items, actions and waiting-fors, and decide whether I need to save the note or if I can throw it away. What to do about research notes from reading and from seminars is another matter.

20Dec2020

One notebook’s worth of work

Publicerat i dear diary, english, the practice of science av mrtnj

Dear diary,

”If could just spend more time doing stuff instead of worrying about it …” (Me, at several points over the years.)

I started this notebook in spring last year and recently filled it up. It contains my first implementation of the system called ”Getting Things Done” (see the book by David Allen with the same name). Let me tell you a little about how it’s going.

The way I organised my work, with to-do lists, calendar, work journal, and routines for dealing with email had pretty much grown organically up until the beginning of this year. I’d gotten some advice, I’d read the odd blog post and column about email and calendar blocking, but beyond some courses in project management (which are a topic for another day), I’d gotten myself very little instruction on how to do any of this. How does one actually keep a good to-do list? Are there principles and best practices? I was aware that Getting Things Done was a thing, and last spring, a mention in passing on the Teaching in Higher Ed podcast prompted me to give it a try.

I read up a little. The book was right there in the university library, unsurprisingly. I also used a blog post by Alberto Taiuti about doing Getting Things Done in a notebook, and read some other writing by researchers about how they use the method (Robert Talbert and Veronika Cheplygina).

There is enough out there about this already that I won’t make my own attempt to explain the method in full, but here are some of the interesting particulars:

You are supposed to be careful about how you organise your to-do lists. You’re supposed to make sure everything on the list is a clear, unambiguous next action that you can start doing when you see it. Everything else that needs thinking, deciding, mulling over, reflecting etc, goes somewhere else, not on your list of thing to do. This means that you can easily pick something off your list and start work on it.

You are supposed to be careful about your calendar. You’re supposed to only put things in there that have a fixed date and time attached, not random reminders or aspirational scheduling of things you would like to do. This means that you can easily look at your calendar and know what your day, week and month look like.

You are supposed to be careful to record everything you think about that matters. You’re supposed to take a note as soon as you have a potentially important thought and put it in a dedicated place that you will check and go through regularly. This means that you don’t have to keep things in your head.

This sounds pretty straightforward, doesn’t it? Well, despite having to-do lists, calendars and a habit of note-taking for years, I’ve not been very disciplined about any of this before. My to-do list items have often been vague, too big tasks that are hard to get started on. My calendar has often contained aspirational planning entries that didn’t survive contact with the realities of the workday. I often delude myself that I’ll remember an idea or a decision, to have quietly it slip out of my mind.

Have I become more productive, or less stressed? The honest answer is that I don’t know. I don’t have a reliable way to track either productivity or stress levels, and even if I did: the last year has not really been comparable to the year before, for several reasons. However, I feel like thinking more about how I organise my work makes a difference, and I’ve felt a certain joy working on the process, as well as a certain dread when looking at it all organised in one place. Let’s keep going and see where this takes us.

13Dec2020

Against question and answer time

Publicerat i english, the practice of science av mrtnj

Here is a semi-serious suggestion: Let’s do away with questions and answers after talks.

I’ll preface with two examples:

First, a scientist I respect highly had just given a talk. As we were chatting away afterwards, I referred to someone who had asked a question during the talk. The answer: ”I didn’t pay attention. I don’t listen when people talk at me like that.”

Second, Swedish author Göran Hägg had this little joke about question and answer time. I paraphrase from memory: Question time is useless because no reasonable person who has a useful contribution will be socially uninhibited enough to ask a question in a public forum (at least not in Sweden). To phrase it more nicely: Having a useful contribution and feeling comfortable to speak up might not be that well correlated.

I have two intuitions about this. On the one hand, there’s the idea that science thrives on vigorous criticism. I have been at talks where people bounce questions at the speaker, even during the talk and even with pretty serious criticisms, and it works just fine. I presume it has to do both with respect, skill at asking and answering, and the power and knowledge differentials between interlocutors.

On the other hand, we would prefer to have a good conversation and productive arguments, and I’m sure everyone has been in seminar rooms where that wasn’t the case. It’s not a good conversation if, say, question and answers turn into old established guys (sic) shouting down students. In some cases, it seems the asker is not after a productive argument, nor indeed any honest attempt to answer the question. (You might be able to tell by them barking a new question before the respondent has finished.)

Personally, I’ve turned to asking fewer questions. If it’s something I’ve misunderstood, it’s unlikely that I will get the explanation I need without conversation and interaction. If I have a criticism, it’s unlikely that I will get the best possible answer from the speaker on the spot. If I didn’t like the seminar, am upset with the speaker’s advisor, hate it when people mangle the definition of ”epigenetics” or when someone shows a cartoon of left-handed DNA, it’s my problem and not something I need to share with the audience.

I think questions and answers is one of thing that actually has benefitted from a move to digital seminars on a distance, where questions are often written in chat. This might be because of a difference in tone between writing a question down or asking it verbally, or thanks to the filtering capabilities of moderators.

On unicorns and genes

Martin Johnsson's blog about genetics and sundry things

Kategoriarkiv: english

Significance+novelty

The Fulsome Principle

The word ”genome”

A model of polygenic adaptation in an infinite population

The directional selection term

The stabilising selection term

The mutation term

Walking in allele frequency space

Resting in allele frequency space

How big is a big effect? Hur långt är ett snöre?

Look at them go!

The genomic scribe in hyperspace

Shell stuff I didn’t know

Stripping directory and suffix from file names

Logging both standard out and standard error

The dot command (oh my, this is so bad)

2020 blog recap

The next notebook of work

Medium

Contexts

Inboxes

Capture

Agendas

Waiting for

Weekly review

Things I’d like to improve

One notebook’s worth of work

Against question and answer time