4_Scaffolding - bennestor/hakea_genome GitHub Wiki

Scaffold contigs into pseudomolecules

I wasn’t able to get Hi-C data for H. prostrata due to its high molecular weight. I used Ragtag to scaffold the contigs based on chromosome sequences of the recently assembled Proteaceae Telopea speciosissima.

Ragtag to scaffold H. prostrata contigs based on T. speciosissima

   #Get sequence lengths 
   cat ragtag.scaffold.fasta | bioawk -c fastx '{ print $name, length($seq) }' | sort -k 2,2nr | head -n 20

ragtag.scaffold.fasta:

num_seqs sum_len min_len avg_len max_len GC (%)
334 712,397,851 485 2,132,927.7 80,311,853 65,037,154 38.81

Scaffolding statistics:

placed_sequences placed_bp unplaced_sequences unplaced_bp
873 665843562 312 46469189

Align Hakea and Telopea pseudomolecules

This analysis was done using the genome_v1 sequence of Hakea prostrata

Conda installations

   #minimap2 (v2.22-r1101)
   #LASTZ v1.02

Rename chromosomes in Hakea and Telopea

   #Rename telopea chromosomes to chr1-11 
   cat telopea_genome.fa | seqkit replace -p '(CM.+)' -r '{kv}' -k alias.txt -U > telopea_chr.fa 
   
   #Rename hakea chromosomes (ragtag.scaffold.fasta renamed to hakea_ragtag_v2.fa) 
   cat hakea_ragtag_v2.fa | seqkit replace -p '(CM.+)' -r '{kv}' -k ../alias.txt -U > hakea_chr_v2.fa

Extract pseudomolecules of Hakea and Telopea

   #extract telopea chromosomes 
   samtools faidx telopea_chr.fa chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 > telopea_justchr.fa

Split into separate fasta files

   #Split Telopea pseudomolecules
   seqkit split telopea_justchr.fa -i 
 
   #rename each fasta file to chr*.fa

Run LASTZ with pairwise chromosome comparisons

    #Optional multiple arguments:
      ../ncbi_genomes/telopea_justchr_split/chr1.fa ../1_ragtag_out/hakea_justchr_v2_split/chr1.fa > chr1.lav
      …
      ../ncbi_genomes/telopea_justchr_split/chr11.fa ../1_ragtag_out/hakea_justchr_v2_split/chr11.fa > chr11.lav

Visualise alignments using Laj plots

Downloaded laj on local computer. Run instructions https://globin.bx.psu.edu/dist/laj/

   #Running laj on linux
      java -Xmx10G -jar /home/ben/Desktop/IRDS/Genome/lav/laj/laj.jar 

There is a weird X shaped plot when running LASTZ with --chain. Using --identity=90 removes this, but also removes some sections bridging gaps in the alignments (making insertions appear). Without –chain it is is very messy and --identity 90 is needed to see any of the main alignments.

⚠️ **GitHub.com Fallback** ⚠️