5 Phasing - coopermkr/sdepressaAssembly GitHub Wiki

In the case of Salicornia depressa, the reference we are assembling is allotetraploid. The plant primarily reproduces by selfing, so the majority of heterozygosity we observe in the species is due to the two diverged genomic lineages. We decided to create a consensus sequence of the two lineages and present them as diverged chromosome pairs. We can do that using kmer counting to detect LINEs and SINEs that are distinct to each ancestral genome with the program SubPhaser: https://github.com/zhangrengang/SubPhaser

First, SubPhaser needs a config file showing which chromosomes we think are paired together so that it can determine which kmers to look for in the other chromosomes. To figure that out, we can make some synteny plots and visually inspect which chromosomes align to one another. We are lucky to have a reference that is not too diverged from Salicornia ramosissima. First we want to create an alignment file of our draft scaffolds against the S. ramosissima reference:

minimap

Then we can create synteny dotplots using the R package pafr:

pafr

Now with the information from pafr, we can create our config file, which looks something like this:

config

And we can call SubPhaser to phase our ancestral genomes and create circos plots. I found that in this case, setting kmer size (-k) to 13 is helpful in detecting enough differential kmers to detect differences between the genomes:

module load miniconda/22.11.1-1
conda activate SubPhaser

subphaser -k 13 -i ../../4.scaffolding/tetra/n18/04.build/tetra.scaff.top18.fasta -c final.config