Journal club of one: ”Chromosome-level and haplotype-resolved genome assembly enabled by high-throughput single-cell sequencing of gamete genomes”

Genome assembly researchers are still figuring out new wild ways of combining different kinds of data. For example, ”trio binning” took what used to be a problem, — the genetic difference between the two genome copies that a diploid individual carries — and turned it into a feature: if you assemble a hybrid individual with genetically distant parents, you can separate the two copies and get two genomes in one. (I said that admixture was the future of every part of genetics, didn’t I?) This paper (Campoy et al. 2020) describes ”gamete binning” which uses sequencing of gametes perform a similar trick.

Expressed another way, gamete binning means building an individual-specific genetic map and then using it to order and phase the pieces of the assembly. This means two sequence datasets from the same individual — one single cell short read dataset from gametes (10X linked reads) and one long read dataset from the parent (PacBio) — and creatively re-using them in different ways.

This is what they do:

1. Assemble the long reads into a preliminary assembly, which will be a mosaic of the two genome copies (barring gross differences, ”haplotigs”, that can to some extent be removed).

2. Align the single cell short reads to the preliminary assembly and call SNPs. (They also did some tricks to deal with regions without SNPs, separating those that were not variable between genomes and those that were deleted in one genome.) Because the gametes are haploid, they get the phase of the parent’s genotype.

3. Align the long reads again. Now, based on the phased genotype, the long reads can be assigned to the genome copy they belong to. So they can partition the reads into one bin per genome copy and chromosome.

4. Assemble those bins separately. They now get one assembly for each homologous chromosome.

They apply it to an apricot tree, which has a 250 Mbp genome. When they sequence the parents of the tree, it seems to separate the genomes well. The two genome copies have quite a bit of structural variation:

Despite high levels of synteny, the two assemblies revealed large-scale rearrangements (23 inversions, 1,132 translocation/transpositions and 2,477 distal duplications) between the haplotypes making up more than 15% of the assembled sequence (38.3 and 46.2 Mb in each of assemblies; Supplementary Table 1). /…/ Mirroring the huge differences in the sequences, we found the vast amount of 942 and 865 expressed, haplotype-specific genes in each of the haplotypes (Methods; Supplementary Tables 2-3).

They can then go back to the single cell data and look at the recombination landscape and at chromosomal arrangements during meiosis.

This is pretty elegant. I wonder how dependent it is on the level of variation within the individual, and how it compares in cost and finickiness to other assembly strategies.

Literature

Campoy, José A., et al. ”Chromosome-level and haplotype-resolved genome assembly enabled by high-throughput single-cell sequencing of gamete genomes.” BioRxiv (2020).

Sequencing-based methods called Dart

Some years ago James Hadfield at Enseqlopedia made a spreadsheet of acronyms for sequencing-based methods with some 50 rows. I can only imagine how long it would be today.

The overloading of acronyms is becoming a bit ridiculous. I recently saw a paper about DART-seq, a method for detecting N6-methyladenosine in RNA (Meyer 2019), and thought, ”wait a minute, isn’t DART-seq a reduced representation genotyping method?” It is, only stylised as DArTseq (seriously). Apparently, it’s also a droplet RNA-sequencing method (Saikia et al. 2018).

What are these methods doing?

  • DArT, diversity array technology, is a way to enrich for a part of a genome. It was originally developed with array technology in mind (Jaccoud et al. 2001). They take some DNA, cut it with restriction enzymes, add adapters and amplify regions close to the cut. Then they clone the resulting DNA, and then attach it to a slide, and that gives a custom microarray of anonymous fragments from the genome. For the Dart-seq version, it seems they make a sequencing library instead of going on to cloning (Ren et al. 2015). It falls in the same family as GBS and RAD-seq methods.
  • DART-seq, droplet-assisted RNA targeting, builds on Drop-seq, where they put single cells and beads that carry primers into the same oil droplet. As cells lyse, the RNA sticks to the primer. The beads also have a barcode so they can be identified in sequencing. Then they break the emulsion, reverse transcribe the RNA attached to beads, amplify and sequence. That is cool. However, because they capture the RNA with oligo-dT primers, they sequence from the 3′ end of the RNA. The Dart method adds primers to the beads, so they can target some specific RNAs and amplify more of them. It’s the super-high-tech version of gene-specific primers for reverse transcription..
  • DART-seq, deamination adjacent to RNA modification targets, uses a synthetic fusion protein that combines APOBEC1, which deaminates cytidines, with a protein domain from YTHDF2 which binds N6-methyladenosine. If an RNA has N6-methyladenosine, cytidines that are close to it, as is usually the case with this base modification, will be deaminated to uracil. After RNA-sequencing, this will look like Cs next to As turning into Ts. Neat! It’s a little bit like bisulfite sequencing of methylated DNA, but with RNA.

On the one hand: Don’t people search the internet before they name their methods, or do they not care? On the other hand, realistically, the genotyping method Dart and the single cell RNA-seq method Dart are unlikely to show up in the same work. If you can call your groups ”treatment” and ”control” for the purpose of a paper, maybe you can call your method ”Dart”, and no-one gets too confused.