4_Scaffolding - bennestor/hakea

Scaffold contigs into pseudomolecules

I wasn’t able to get Hi-C data for H. prostrata due to its high molecular weight. I used Ragtag to scaffold the contigs based on chromosome sequences of the recently assembled Proteaceae Telopea speciosissima.

Ragtag to scaffold H. prostrata contigs based on T. speciosissima

   #Get sequence lengths 
   cat ragtag.scaffold.fasta | bioawk -c fastx '{ print $name, length($seq) }' | sort -k 2,2nr | head -n 20

ragtag.scaffold.fasta:

num_seqs	sum_len	min_len	avg_len	max_len	GC (%)
334	712,397,851	485	2,132,927.7	80,311,853	65,037,154	38.81

Scaffolding statistics:

placed_sequences	placed_bp	unplaced_sequences	unplaced_bp
873	665843562	312	46469189

Align Hakea and Telopea pseudomolecules

This analysis was done using the genome_v1 sequence of Hakea prostrata

Conda installations

   #minimap2 (v2.22-r1101)
   #LASTZ v1.02

Rename chromosomes in Hakea and Telopea

   #Rename telopea chromosomes to chr1-11 
   cat telopea_genome.fa | seqkit replace -p '(CM.+)' -r '{kv}' -k alias.txt -U > telopea_chr.fa 
   
   #Rename hakea chromosomes (ragtag.scaffold.fasta renamed to hakea_ragtag_v2.fa) 
   cat hakea_ragtag_v2.fa | seqkit replace -p '(CM.+)' -r '{kv}' -k ../alias.txt -U > hakea_chr_v2.fa

Extract pseudomolecules of Hakea and Telopea

   #extract telopea chromosomes 
   samtools faidx telopea_chr.fa chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 > telopea_justchr.fa

Split into separate fasta files

   #Split Telopea pseudomolecules
   seqkit split telopea_justchr.fa -i 
 
   #rename each fasta file to chr*.fa

Run LASTZ with pairwise chromosome comparisons

    #Optional multiple arguments:
      ../ncbi_genomes/telopea_justchr_split/chr1.fa ../1_ragtag_out/hakea_justchr_v2_split/chr1.fa > chr1.lav
      …
      ../ncbi_genomes/telopea_justchr_split/chr11.fa ../1_ragtag_out/hakea_justchr_v2_split/chr11.fa > chr11.lav

Visualise alignments using Laj plots

Downloaded laj on local computer. Run instructions https://globin.bx.psu.edu/dist/laj/

   #Running laj on linux
      java -Xmx10G -jar /home/ben/Desktop/IRDS/Genome/lav/laj/laj.jar

There is a weird X shaped plot when running LASTZ with --chain. Using --identity=90 removes this, but also removes some sections bridging gaps in the alignments (making insertions appear). Without –chain it is is very messy and --identity 90 is needed to see any of the main alignments.

4_Scaffolding - bennestor/hakea_genome GitHub Wiki

Scaffold contigs into pseudomolecules

Align Hakea and Telopea pseudomolecules

⚠️ GitHub.com Fallback ⚠️

4_Scaffolding - bennestor/hakea_genome GitHub Wiki

Scaffold contigs into pseudomolecules

Align Hakea and Telopea pseudomolecules

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️