The genomic scribe in hyperspace

When I was in school (it must have been in gymnasiet, roughly corresponding to secondary school or high school), I remember giving a presentation on a group project about the human genome project, and using the illiterate copyist analogy. After sequencing the human genome, we are able to blindly copy the text of life; we still need to learn to read it. At this point, I had no clue whatsoever that I would be working in genetics in the future. I certainly felt very clever coming up with that image. I must have read it somewhere.

If it is true that the illiterate scribe is a myth, and they must have had at least some ability to read, that makes the analogy more apt: even in 2003, researchers actually had a fairly good idea of how to read certain aspects of genetics. The genetic code is from 1961, for crying out loud (Yanofsky 2007)!

My classroom moment must have been around 2003, which is the year the ENCODE project started, aiming to do just that: create an encyclopedia (or really, a critical apparatus) of the human genome. It’s still going: a drove of papers from its third phase came out last year, and apparently it’s now in the fourth phase. ENCODE can’t be a project in the usual sense of a planned undertaking with a defined goal, but rather a research programme in the general direction of ”a comprehensive parts list of functional elements in the human genome” (ENCODE FAQ). Along with the phase 3 empirical papers, they published a fun perspective article (The ENCODE Project Consortium et al. 2020).

ENCODE commenced as an ambitious effort to comprehensively annotate the elements in the human genome, such as genes, control elements, and transcript isoforms, and was later expanded to annotate the genomes of several model organisms. Mapping assays identified biochemical activities and thus candidate regulatory elements.

The age means that ENCODE has lived through generations of genomic technologies. Phase 1 was doing functional genomics with microarrays, which now sounds about as quaint as doing it with blots. Nowadays, they have CRISPR-based editing assays and sequencing methods for chromosome 3D structure that just seem to keep adding Cs to their acronyms.

Last time I blogged about the ENCODE project was in 2013 (in Swedish), in connection with the opprobrium about junk DNA. If you care about junk DNA, check out Sean Eddy’s FAQ (Eddy 2012). If you still want to be angry about what percentage of the genome has function, what gene concepts are useful and the relationship between quantitative genetics and genomics, check out this Nature Video. It’s funny, because the video pre-empts some of the conclusions of the perspective article.

The video says: to do many of the potentially useful things we want to do with genomes (like sock cancer in the face, presumably), we need to look at individual differences (”between you, and you, and you”) and how they relate to traits. And an encyclopedia, great as it may be, is not going to capture that.

The perspective says:

It is now apparent that elements that govern transcription, chromatin organization, splicing, and other key aspects of genome control and function are densely encoded in the human genome; however, despite the discovery of many new elements, the annotation of elements that are highly selective for particular cell types or states is lagging behind. For example, very few examples of condition-specific activation or repression of transcriptional control elements are currently annotated in ENCODE. Similarly, information from human fetal tissue, reproductive organs and primary cell types is limited. In addition, although many open chromatin regions have been mapped, the transcription factors that bind to these sequences are largely unknown, and little attention has been devoted to the analysis of repetitive sequences. Finally, although transcript heterogeneity and isoforms have been described in many cell types, full-length transcripts that represent the isoform structure of spliced exons and edits have been described for only a small number of cell types.

That is, the future of genomics is in variation. We want to know about: organismic/developmental background (cell lines vs primary vs induced vs tissue), environmental variation (condition-dependence), genetic variation (gene editing assays that change local genetic variants, the genetic background of different cell line and human genomes), dynamics (time and induction). To put it in plain terms: We need to know how the genome regulation of different cells and individuals are different, and what that does to them. To put it in fancy terms: we are moving towards cellular phenomics, quantitative genomics, and an ever-expanding hypercube of data.

Literature

Eddy, S. R. (2012). The C-value paradox, junk DNA and ENCODE. Current biology, 22(21), R898-R899.

ENCODE Project Consortium, Snyder, M. P., Gingeras, T. R., Moore, J. E., Weng, Z., Gerstein, M. B., Ren, B., … & Myers, R. M. (2020). Perspectives on ENCODE. Nature, 583(7818), 693-698.

Yanofsky, C. (2007). Establishing the triplet nature of the genetic code. Cell, 128(5), 815-818.

Shell stuff I didn’t know

I generally stay away from doing anything more complicated in a shell script than making a directory and running an R script or a single binary, and especially avoid awk and sed as much as possible. However, sometimes the shell actually does offer a certain elegance and convenience (and sometimes deceitful traps).

Here are three things I only learned recently:

Stripping directory and suffix from file names

Imagine we have a project where files are named with the sample ID followed by some extension, like so:

project/data/sample1.g.vcf
project/data/sample2.g.vcf
project/data/sample3.g.vcf

Quite often, we will want to grab all the in a directory and extract the base name without extension and without the whole path leading up to the file. There is a shell command for this called basename:

basename -s .g.vcf project/data/sample*.g.vcf
sample1
sample2
sample3

The -s flag gives the suffix to remove.

This is much nicer than trying to regexp it, for example with R:

library(stringr)

files <- dir("project/data")
basename <- str_match(files, "^.*/(.+)\\.g\\.vcf")

Look at that second argument … ”^.*/(.+)\\.g\\.vcf” What is this?! And let me tell you, that was not my first attempt at writing that regexp either. Those of us who can interpret this gibberish must acknowledge that we have learned to do so only through years of suffering.

For that matter, it’s also than the bash suffix and prefix deletion syntax, which is one of those things I think one has to google every time.

for string in project/data/*.g.vcf; do
    nosuffix=${string%.g.vcf}
    noprefix=${nosuffix#project/data/}
    echo $noprefix
done

Logging both standard out and standard error

When sending jobs off to a server to be run without you looking at them, it’s often convenient to save the output to a file. To redirect standard output to a file, use ”>”, like so:

./script_that_prints_output.sh > out_log.txt

However, there is also another output stream used to record (among other things) error messages (in some programs; this isn’t very consistent). Therefore, we should probably log the standard error stream too. To redirect standard error to a file:

./script_that_prints_output.sh 2> error_log.txt

And to redirect both to the same file:

./script_that_prints_output.sh > combined_log.txt 2>&1

The last bit is telling the shell to redirect the standard error stream to standard out, and then both of them get captured in the file. I didn’t know until recently that one could do this.

The above code contained some dots, and speaking of that, here is a deceitful shell trap to trip up the novice:

The dot command (oh my, this is so bad)

When working on a certain computer system, there is a magic invocation that needs to be in the script to be able to use the module system. It should look like this:

. /etc/profile.d/modules.sh

That means ”source the script found at /etc/profiles.d/modules.sh” — which will activate the module system for you.

It should not look like this:

./etc/profile.d/modules.sh
bash: ./etc/profile.d/modules.sh: No such file or directory

That means that bash tries to find a file called ”etc/profile.d/modules.sh” located in the current directory — which (probably) doesn’t exist.

If there is a space after the dot, it is a command that means the same as source, i.e. run a script from a file. If there is no space after the dot, it means a relative file path — also often used to run a script. I had never actually thought about it until someone took away the space before the dot, and got the above error message (plus something else more confusing, because a module was missing).

2020 blog recap

Dear diary,

During 2020, ”On unicorns and genes” published a total of 29 posts (not including this one, because it’s scheduled for 2021). This means that I kept on schedule for the beginning of the year, then had an extended blog vacation in the fall. I did write a little bit more in Swedish (about an attempt at Crispr debate, a course I took in university pedagogy, and some more about that course) which was one of the ambitions.

Let’s pick one post per month to represent the blogging year of 2020:

January: Things that really don’t matter: megabase or megabasepair. This post deals with a pet peeve of mine: should we write physical distances in genetics as base pairs (bp) or bases?

February: Using R: from plyr to purrr, part 0 out of however many. (Part one might appear at some point, I’m sure.) Finally, the purrr tidyverse package has found a place in my code. It’s still not the first tool I reach for when I need to apply a function, but it’s getting there.

March: Preprint: ”Genetics of recombination rate variation in the pig”. Preprint post about our work with genetic mapping of recombination rate in the pig.

April: Virtual animal breeding journal club: ”An eQTL in the cystathionine beta synthase gene is linked to osteoporosis in laying hens”. The virtual animal breeding journal club, organised by John Cole, was one of the good things that happened in 2020. I don’t know if it will live on in 2021, but if not, it was a treat as long as it lasted. This post contains my slides from when I presented a recent paper, from some colleagues, about the genetics of bone quality in chickens.

May: Robertson on genetic correlation and loss of variation. A post about a paper by Alan Robertson from 1959. This paper is reasonably often cited as a justification for 0.80 as some kind of cut-off for when a genetic correlation is sufficiently different enough from 1 to be important. That is not at really what the paper says.

June: Journal club of one: ”Genomic predictions for crossbred dairy cattle”. My reading on a paper about genomic evaluation for crossbred cattle in the US.

July: Twin lambs with different fathers. An all too brief methods description prompted me to write some R code. This might be my personal favourite of the year.

August: Journal club of one: ”Chromosome-level and haplotype-resolved genome assembly enabled by high-throughput single-cell sequencing of gamete genomes”. Journal club post about a preprint with a neat-looking genome assembly strategy. This is where the posts start becoming sparse.

December: One notebook’s worth of work. Introspective post about my attempts to organise my work.

In other news, trips were cancelled, Zoom teaching happened, and I finally got the hang of working from home. We received funding for a brand new research project about genome dynamics during animal breeding. There will be lots of sequence data. There will be simulations. It starts next year, and I will write more about it later.

Also, Uppsala is sometimes quite beautiful:

The next notebook of work

Dear diary,

The last post was about my attempt to use the Getting Things Done method to bring some more order to research, work, and everything. This post will contain some more details about my system, at a little less than a year into the process, on the off chance that anyone wants to know. This post will use some Getting Things Done jargon without explaining it. There are many useful guides online, plus of course the book itself.

Medium

Most of my system lives in paper notebooks. The main notebook contains my action list, projects list, waiting for list and agendas plus a section for notes. I quickly learned that the someday/maybe lists won’t fit, so I now have a separate (bigger) notebook for those. My calendar is digital. I also use a note taking app for project support material, and as an extra inbox for notes I jot down on my phone. Thus, I guess it’s a paper/digital hybrid.

Contexts

I have five contexts: email/messaging, work computer, writing, office and home. There were more in the beginning, but I gradually took out the ones I didn’t use. They need to be few enough and map cleanly to situations, so that I remember to look at them. I added the writing context because I tend to treat, and schedule, writing tasks separately from other work tasks. The writing context also includes writing-adjacent support tasks such as updating figures, going through reviewer comments or searching for references.

Inboxes

I have a total of nine inboxes, if you include all the email accounts and messenger services where people might contact me about things I need to do. That sounds excessive, but only three of those are where I put things for myself (physical inbox, notes section of notebook, and notes app), and so far they’re all getting checked regularly.

Capture

I do most of my capture in the notes app on my phone (when not at a desk) or on piece of paper (when at my desk). When I get back to having in-person meetings, I assume more notes are going to end up in the physical notebook, because it’s nicer to take meeting notes on paper than on a phone.

Agendas

The biggest thing I changed in the new notebook was to dedicate much more space to agendas, but it’s already almost full! It turns out there are lots of things ”I should talk to X about the next time we’re speaking”, rather than send X an email immediately. Who knew?

Waiting for

This is probably my favourite. It is useful to have a list of who have said they will get back to me, when, and about what. That little date next to their name helps me not feel like a nag when I ask them again after a reasonable time, and makes me appreciate them more when they respond quickly.

Weekly review

I already had the habit of scheduling an appointment with myself on Fridays (or otherwise towards the end of the week) to go over some recurring items. I’ve expanded this appointment to do a weekly review of the notebook, calendar, someday/maybe list, and some other bespoke checklist items. I bribe myself with sweets to support this habit.

Things I’d like to improve

Here are some of the things I want to improve:

  • The project list. A project sensu Getting Things Done can be anything from purchase new shoes to taking over the world. The project list is supposed to keep track of what you’ve undertaken to do, and make sure you have come up with actions that progress them. My project list isn’t very complete, and doesn’t spark new actions very often.
  • Project backlogs. On the other hand, I have some things on the project list that are projects in a greater sense, and will have literally thousands of actions, both from me and others. These obviously need planning ahead beyond the next thing to do. I haven’t yet figured out the best way to keep a backlog of future things to do in a project, potentially with dependencies, and feed them into my list of things to do when they become current.
  • Notes. I have a strong note taking habit, but a weak note reading habit. Essentially, many of my notes are write-only; this feels like a waste. I’ve started my attempts to improve the situation with meeting notes: trying to take five minutes right after a meeting (if possible) to go over the notes, extract any calendar items, actions and waiting-fors, and decide whether I need to save the note or if I can throw it away. What to do about research notes from reading and from seminars is another matter.

One notebook’s worth of work

Image: an Aviagen sponsored notebook from the 100 Years of Genetics meeting in Edinburgh, with post-its sticking out, next to a blue Ballograf pen

Dear diary,

”If could just spend more time doing stuff instead of worrying about it …” (Me, at several points over the years.)

I started this notebook in spring last year and recently filled it up. It contains my first implementation of the system called ”Getting Things Done” (see the book by David Allen with the same name). Let me tell you a little about how it’s going.

The way I organised my work, with to-do lists, calendar, work journal, and routines for dealing with email had pretty much grown organically up until the beginning of this year. I’d gotten some advice, I’d read the odd blog post and column about email and calendar blocking, but beyond some courses in project management (which are a topic for another day), I’d gotten myself very little instruction on how to do any of this. How does one actually keep a good to-do list? Are there principles and best practices? I was aware that Getting Things Done was a thing, and last spring, a mention in passing on the Teaching in Higher Ed podcast prompted me to give it a try.

I read up a little. The book was right there in the university library, unsurprisingly. I also used a blog post by Alberto Taiuti about doing Getting Things Done in a notebook, and read some other writing by researchers about how they use the method (Robert Talbert and Veronika Cheplygina).

There is enough out there about this already that I won’t make my own attempt to explain the method in full, but here are some of the interesting particulars:

You are supposed to be careful about how you organise your to-do lists. You’re supposed to make sure everything on the list is a clear, unambiguous next action that you can start doing when you see it. Everything else that needs thinking, deciding, mulling over, reflecting etc, goes somewhere else, not on your list of thing to do. This means that you can easily pick something off your list and start work on it.

You are supposed to be careful about your calendar. You’re supposed to only put things in there that have a fixed date and time attached, not random reminders or aspirational scheduling of things you would like to do. This means that you can easily look at your calendar and know what your day, week and month look like.

You are supposed to be careful to record everything you think about that matters. You’re supposed to take a note as soon as you have a potentially important thought and put it in a dedicated place that you will check and go through regularly. This means that you don’t have to keep things in your head.

This sounds pretty straightforward, doesn’t it? Well, despite having to-do lists, calendars and a habit of note-taking for years, I’ve not been very disciplined about any of this before. My to-do list items have often been vague, too big tasks that are hard to get started on. My calendar has often contained aspirational planning entries that didn’t survive contact with the realities of the workday. I often delude myself that I’ll remember an idea or a decision, to have quietly it slip out of my mind.

Have I become more productive, or less stressed? The honest answer is that I don’t know. I don’t have a reliable way to track either productivity or stress levels, and even if I did: the last year has not really been comparable to the year before, for several reasons. However, I feel like thinking more about how I organise my work makes a difference, and I’ve felt a certain joy working on the process, as well as a certain dread when looking at it all organised in one place. Let’s keep going and see where this takes us.

Against question and answer time

Here is a semi-serious suggestion: Let’s do away with questions and answers after talks.

I’ll preface with two examples:

First, a scientist I respect highly had just given a talk. As we were chatting away afterwards, I referred to someone who had asked a question during the talk. The answer: ”I didn’t pay attention. I don’t listen when people talk at me like that.”

Second, Swedish author Göran Hägg had this little joke about question and answer time. I paraphrase from memory: Question time is useless because no reasonable person who has a useful contribution will be socially uninhibited enough to ask a question in a public forum (at least not in Sweden). To phrase it more nicely: Having a useful contribution and feeling comfortable to speak up might not be that well correlated.

I have two intuitions about this. On the one hand, there’s the idea that science thrives on vigorous criticism. I have been at talks where people bounce questions at the speaker, even during the talk and even with pretty serious criticisms, and it works just fine. I presume it has to do both with respect, skill at asking and answering, and the power and knowledge differentials between interlocutors.

On the other hand, we would prefer to have a good conversation and productive arguments, and I’m sure everyone has been in seminar rooms where that wasn’t the case. It’s not a good conversation if, say, question and answers turn into old established guys (sic) shouting down students. In some cases, it seems the asker is not after a productive argument, nor indeed any honest attempt to answer the question. (You might be able to tell by them barking a new question before the respondent has finished.)

Personally, I’ve turned to asking fewer questions. If it’s something I’ve misunderstood, it’s unlikely that I will get the explanation I need without conversation and interaction. If I have a criticism, it’s unlikely that I will get the best possible answer from the speaker on the spot. If I didn’t like the seminar, am upset with the speaker’s advisor, hate it when people mangle the definition of ”epigenetics” or when someone shows a cartoon of left-handed DNA, it’s my problem and not something I need to share with the audience.

I think questions and answers is one of thing that actually has benefitted from a move to digital seminars on a distance, where questions are often written in chat. This might be because of a difference in tone between writing a question down or asking it verbally, or thanks to the filtering capabilities of moderators.

My talk at the ChickenStress Genomics and Bioinformatics Workshop

A few months ago I gave a talk at the ChickenStress Genomics and Bioinformatics Workshop about genetic mapping of traits and gene expression.

ChickenStress is a European training network of researchers who study stress in chickens, as you might expect. It brings together people who work with (according to the work package names) environmental factors, early life experiences and genetics. The network is centered on a group of projects by early stage researchers — by the way, I think that’s a really good way to describe the work of a PhD student — and organises activities like this workshop.

I was asked to talk about our work from my PhD on gene expression and behaviour in the chicken (Johnsson & al. 2018, Johnsson & al. 2016), concentrating on concepts and methods rather than results. If I have any recurring readers, they will already know that brief is exactly what I like to do. I talked about the basis of genetic mapping of traits and gene expression, what data one needs to do it, and gave a quick demo for a flavour of an analysis workflow (linear mixed model genome-wide association in GEMMA).

Here are slides, and the git repository of the demo:

Journal club of one: ”Chromosome-level and haplotype-resolved genome assembly enabled by high-throughput single-cell sequencing of gamete genomes”

Genome assembly researchers are still figuring out new wild ways of combining different kinds of data. For example, ”trio binning” took what used to be a problem, — the genetic difference between the two genome copies that a diploid individual carries — and turned it into a feature: if you assemble a hybrid individual with genetically distant parents, you can separate the two copies and get two genomes in one. (I said that admixture was the future of every part of genetics, didn’t I?) This paper (Campoy et al. 2020) describes ”gamete binning” which uses sequencing of gametes perform a similar trick.

Expressed another way, gamete binning means building an individual-specific genetic map and then using it to order and phase the pieces of the assembly. This means two sequence datasets from the same individual — one single cell short read dataset from gametes (10X linked reads) and one long read dataset from the parent (PacBio) — and creatively re-using them in different ways.

This is what they do:

1. Assemble the long reads into a preliminary assembly, which will be a mosaic of the two genome copies (barring gross differences, ”haplotigs”, that can to some extent be removed).

2. Align the single cell short reads to the preliminary assembly and call SNPs. (They also did some tricks to deal with regions without SNPs, separating those that were not variable between genomes and those that were deleted in one genome.) Because the gametes are haploid, they get the phase of the parent’s genotype.

3. Align the long reads again. Now, based on the phased genotype, the long reads can be assigned to the genome copy they belong to. So they can partition the reads into one bin per genome copy and chromosome.

4. Assemble those bins separately. They now get one assembly for each homologous chromosome.

They apply it to an apricot tree, which has a 250 Mbp genome. When they sequence the parents of the tree, it seems to separate the genomes well. The two genome copies have quite a bit of structural variation:

Despite high levels of synteny, the two assemblies revealed large-scale rearrangements (23 inversions, 1,132 translocation/transpositions and 2,477 distal duplications) between the haplotypes making up more than 15% of the assembled sequence (38.3 and 46.2 Mb in each of assemblies; Supplementary Table 1). /…/ Mirroring the huge differences in the sequences, we found the vast amount of 942 and 865 expressed, haplotype-specific genes in each of the haplotypes (Methods; Supplementary Tables 2-3).

They can then go back to the single cell data and look at the recombination landscape and at chromosomal arrangements during meiosis.

This is pretty elegant. I wonder how dependent it is on the level of variation within the individual, and how it compares in cost and finickiness to other assembly strategies.

Literature

Campoy, José A., et al. ”Chromosome-level and haplotype-resolved genome assembly enabled by high-throughput single-cell sequencing of gamete genomes.” BioRxiv (2020).

What is a locus, anyway?

”Locus” is one of those confusing genetics terms (its meaning, not just its pronunciation). We can probably all agree with a dictionary and with Wikipedia that it means a place in the genome, but a place of what and in what sense? We also use place-related word like ”site” and ”region” that one might think were synonymous, but don’t seem to be.

For an example, we can look at this relatively recent preprint (Chebib & Guillaume 2020) about a model of the causes of genetic correlation. They have pairs of linked loci that each affect one trait each (that’s the tight linkage condition), and also a set of loci that affect both traits (the pleiotropic condition), correlated Gaussian stabilising selection, and different levels of mutation, migration and recombination between the linked pairs. A mutation means adding a number to the effect of an allele.

This means that loci in this model can have a large number of alleles with quantitatively different effects. The alleles at a locus share a distribution of mutation effects, that can be either two-dimensional (with pleiotropy) or one-dimensional. They also share a recombination rate with all other loci, which is constant.

What kind of DNA sequences can have these properties? Single nucleotide sites are out of the question, as they can have four, or maybe five alleles if you count a deletion. Larger structural variants, such as inversions or allelic series of indels might work. A protein-coding gene taken as a unit could have a huge number of different alleles, but they would probably have different distributions of mutational effects in different sites, and (relatively small) differences in genetic distance to different sites.

It seems to me that we’re talking about an abstract group of potential alleles that have sufficiently similar effects and that are sufficiently closely linked. This is fine; I’m not saying this to criticise the model, but to explore how strange a locus really is.

They find that there is less genetic correlation with linkage than with pleiotropy, unless the mutation rate is high, which leads to a discussion about mutation rate. This reasoning about the mutation rate of a locus illustrates the issue:

A high rate of mutation (10−3) allows for multiple mutations in both loci in a tightly linked pair to accumulate and maintain levels of genetic covariance near to that of mutations in a single pleiotropic locus, but empirical estimations of mutation rates from varied species like bacteria and humans suggests that per-nucleotide mutation rates are in the order of 10−8 to 10−9 … If a polygenic locus consists of hundreds or thousands of nucleotides, as in the case of many quantitative trait loci (QTLs), then per-locus mutation rates may be as high as 10−5, but the larger the locus the higher the chance of recombination between within-locus variants that are contributing to genetic correlation. This leads us to believe that with empirically estimated levels of mutation and recombination, strong genetic correlation between traits are more likely to be maintained if there is an underlying pleiotropic architecture affecting them than will be maintained due to tight linkage.

I don’t know if it’s me or the authors who are conceptually confused here. If they are referring to QTL mapping, it is true that the quantitative trait loci that we detect in mapping studies often are huge. ”Thousands of nucleotides” is being generous to mapping studies: in many cases, we’re talking millions of them. But the size of a QTL region from a mapping experiment doesn’t tell us how many nucleotides in it that matter to the trait. It reflects our poor resolution in delineating the, one or more, causative variants that give rise to the association signal. That being said, it might be possible to use tricks like saturation mutagenesis to figure out which mutations within a relevant region that could affect a trait. Then, we could actually observe a locus in the above sense.

Another recent theoretical preprint (Chantepie & Chevin 2020) phrases it like this:

[N]ote that the nature of loci is not explicit in this model, but in any case these do not represent single nucleotides or even genes. Rather, they represent large stretches of effectively non-recombining portions of the genome, which may influence the traits by mutation. Since free recombination is also assumed across these loci (consistent with most previous studies), the latter can even be thought of as small chromosomes, for which mutation rates of the order to 10−2 seem reasonable.

Literature

Chebib and Guillaume. ”Pleiotropy or linkage? Their relative contributions to the genetic correlation of quantitative traits and detection by multi-trait GWA studies.” bioRxiv (2019): 656413.

Chantepie and Chevin. ”How does the strength of selection influence genetic correlations?” bioRxiv (2020).

Journal club of one: ”Versatile simulations of admixture and accurate local ancestry inference with mixnmatch and ancestryinfer”

Admixture is the future of every sub-field of genetics, just in case you didn’t know. Both in the wild and domestic animals, populations or even species sometimes cross. This causes different patterns of relatedness than in well-mixed populations. Often we want to estimate ”local ancestry”, that is: what source population a piece of chromosome in an individual originates from. It is one of those genetics problems that is made harder by the absence of any way to observe it directly.

This recent paper (Schumer et al 2020; preprint version, which I read, here) presents a method for simulating admixed sequence data, and a method for inferring local ancestry from it. It does something I like, namely to pair analysis with fake-data simulation to check methods.

The simulation method is a built from four different simulators:

1. macs (Chen, Majoram & Wall 2009), which creates polymorphism data under neutral evolution from a given population history. They use macs to generate starting chromosomes from two ancestral populations.

2. Seq-Gen (Rambaut & Grassly 1997). Chromosomes from macs are strings of 0s and 1s representing the state at biallelic markers. If you want DNA-level realism, with base composition, nucleotide substitution models and so on, you need something else. I don’t really follow how they do this. You can tell from the source code that they use the local trees that macs spits out, which Seq-Gen can then simulate nucleotides from. As they put it, the resulting sequence ”lacks other complexities of real genome sequences such as repetitive elements and local variation in base composition”, but it is a step up from ”0000110100”.

3. SELAM (Corbett-Detig & Jones 2016), which simulates admixture between populations with population history and possibly selection. Here, SELAM‘s role is to simulate the actual recombination and interbreeding to create the patterns of local ancestry, that they will then fill with the sequences they generated before.

4. wgsim, which simulates short reads from a sequence. At this point, mixnmatch has turned a set of population genetic parameters into fasta files. That is pretty cool.

On the one hand, building on tried and true tools seems to be the right thing to do, less wheel-reinventing. It’s great that the phylogenetic simulator Seq-Gen from 1997 can be used in a paper published in 2020. On the other hand, looking at the dependencies for running mixnmatch made me a little pale: seven different bioinformatics or population genetics softwares (not including the dependencies you need to compile them), R, Perl and Python plus Biopython. Computational genetics is an adventure of software installation.

They use the simulator to test the performance of a hidden Markov model for inferring local ancestry (Corbett-Detig & Nielsen 2017) with different population histories and settings, and then apply it to swordtail fish data. In particular, one needs to set thresholds for picking ”ancestry informative” (i.e. sufficiently differentiated) markers between the ancestral populations, and that depends on population history and diversity.

In passing, they use the estimate the swordtail recombination landscape:

We used the locations of observed ancestry transitions in 139 F2 hybrids that we generated between X. birchmanni and X. malinche … to estimate the recombination rate in 5 Mb windows. … We compared inferred recombination rates in this F2 map to a linkage disequilibrium based recombination map for X. birchmanni that we had previously generated (Schumer et al., 2018). As expected, we observed a strong correlation in estimated recombination rate between the linkage disequilibrium based and crossover maps (R=0.82, Figure 4, Supporting Information 8). Simulations suggest that the observed correlation is consistent with the two recombination maps being indistinguishable, given the low resolution of the F2 map (Supporting Information 8).