genome assembly | On unicorns and genes

Genome assembly researchers are still figuring out new wild ways of combining different kinds of data. For example, ”trio binning” took what used to be a problem, — the genetic difference between the two genome copies that a diploid individual carries — and turned it into a feature: if you assemble a hybrid individual with genetically distant parents, you can separate the two copies and get two genomes in one. (I said that admixture was the future of every part of genetics, didn’t I?) This paper (Campoy et al. 2020) describes ”gamete binning” which uses sequencing of gametes perform a similar trick.

Expressed another way, gamete binning means building an individual-specific genetic map and then using it to order and phase the pieces of the assembly. This means two sequence datasets from the same individual — one single cell short read dataset from gametes (10X linked reads) and one long read dataset from the parent (PacBio) — and creatively re-using them in different ways.

This is what they do:

1. Assemble the long reads into a preliminary assembly, which will be a mosaic of the two genome copies (barring gross differences, ”haplotigs”, that can to some extent be removed).

2. Align the single cell short reads to the preliminary assembly and call SNPs. (They also did some tricks to deal with regions without SNPs, separating those that were not variable between genomes and those that were deleted in one genome.) Because the gametes are haploid, they get the phase of the parent’s genotype.

3. Align the long reads again. Now, based on the phased genotype, the long reads can be assigned to the genome copy they belong to. So they can partition the reads into one bin per genome copy and chromosome.

4. Assemble those bins separately. They now get one assembly for each homologous chromosome.

They apply it to an apricot tree, which has a 250 Mbp genome. When they sequence the parents of the tree, it seems to separate the genomes well. The two genome copies have quite a bit of structural variation:

Despite high levels of synteny, the two assemblies revealed large-scale rearrangements (23 inversions, 1,132 translocation/transpositions and 2,477 distal duplications) between the haplotypes making up more than 15% of the assembled sequence (38.3 and 46.2 Mb in each of assemblies; Supplementary Table 1). /…/ Mirroring the huge differences in the sequences, we found the vast amount of 942 and 865 expressed, haplotype-specific genes in each of the haplotypes (Methods; Supplementary Tables 2-3).

They can then go back to the single cell data and look at the recombination landscape and at chromosomal arrangements during meiosis.

This is pretty elegant. I wonder how dependent it is on the level of variation within the individual, and how it compares in cost and finickiness to other assembly strategies.

Literature

Campoy, José A., et al. ”Chromosome-level and haplotype-resolved genome assembly enabled by high-throughput single-cell sequencing of gamete genomes.” BioRxiv (2020).

On unicorns and genes

Martin Johnsson's blog about genetics and sundry things

Etikettarkiv: genome assembly

Journal club of one: ”Chromosome-level and haplotype-resolved genome assembly enabled by high-throughput single-cell sequencing of gamete genomes”