3_Polishing_v2 - bennestor/hakea_genome GitHub Wiki

Polishing genome_v2

Inaccuracies in the polishing of the Hakea genome made PHT1 gene prediction difficult. I decided to repolish the genome and redo annotation to make identification of nitrate transporters easier. The new polishing method was based on recent Proteaceae assembly papers for Telopea speciosissima (Chen et al. 2022 https://onlinelibrary.wiley.com/doi/full/10.1111/1755-0998.13574) and Knightia excelsa (McCartney et al. 2021 https://onlinelibrary.wiley.com/doi/full/10.1111/1755-0998.13406).

1. Polish necat.fasta (renamed polished_contigs.fasta) with Racon

   #Map nanopore reads to necat.fasta with Minimap2 (v2.24-r1122) 
   minimap2 -x map-ont -t 16 ../necat.fasta ../../0_raw_reads/nanopore/nanopore.fastq > 1_mapping.paf
   
   #Run Racon: 1_racon.sh on zeus highmemq took 16h
   #longest contig length = 11707852 and genome length = 768992702 so I used --split 11707852 and 100 coverage
   racon_wrapper -m 8 -x -6 -g -8 -w 500 -t 16 --split 11707852 ../../0_raw_reads/nanopore/nanopore.fastq ./1_mapping.paf ../necat.fasta > ./ 1_necat_racon_100.fa
    #I also tried running 4 iterations of Racon with --subsample 768992702 50 for 50% coverage and without –split, but the BUSCO complete score was only 81.7%

BUSCO Embryophyta score: C:98.0%[S:90.3%,D:7.7%],F:1.2%,M:0.8%,n:1614



2. Polish necat.fasta (renamed polished_contigs.fasta) with Medaka

   #Run medaka model r941_prom_high_g4011: 2_medaka.sh on zeus highmemq took ~14h 
   medaka_consensus -i ../0_raw_reads/nanopore/nanopore.fastq -d 1_racon_out/1_necat_racon_100.fa -o 2_medaka_out -t 16 -m r941_prom_high_g4011

consensus.fasta:

num_seqs sum_len min_len avg_len max_len
1,925 771,431,749 402 400,743.8 11,745,170

BUSCO Embryophyta score: C:98.7%[S:90.6%,D:8.1%],F:0.7%,M:0.6%,n:1614

  • Genome_v1 after Medaka was C:98.4%[S:91.1%,D:7.3%],F:0.9%,M:0.7%,n:1614


2. Polish with Pilon (v1.24)

   #Split necat.fasta into 20 folders and append ids to 3_pilon.conf (need to manually edit)
   seqkit split -p 20 necat.fasta 
   cat necat.part_$file.fasta | grep ">" | tr -d ">" | tr '\n' ',' >> ../3_pilon.conf 
   
   #Run Pilon correcting only SNPs/indels (same as Telopea paper): 3_pilon.conf on zeus workq took up to 3 hrs each
   #Main command
   pilon --genome ../2_medaka_out/consensus.fasta --frags consensus_illumina.bam --chunksize 50000 --fix snps,indels
   
   #Optional arguments
   --outdir part_001 --targets bctg00000015,bctg00000013…
   …
   --outdir part_020 --targets bctg00001830,bctg00001828,bctg00001858…

pilon.fa:

num_seqs sum_len min_len avg_len max_len
1,925 769,832,288 405 399,912.9 11,711,808

BUSCO Embryophyta score: C:98.9%[S:90.5%,D:8.4%],F:0.6%,M:0.5%,n:1614

  • Genome_v1 after NextPolish was C:98.7%[S:90.6%,D:8.1%],F:0.7%,M:0.6%,n:1614


3. Run PurgeHaplotigs https://bitbucket.org/mroachawri/purge_haplotigs/src/master/

   #Purge Haplotigs script below: 4_purgehap.sh on zeus took 3h 
   
   #Map Nanopore reads to genome assembly
   minimap2 -t 28 -ax map-ont ../3_pilon_out/pilon.fa ../../0_raw_reads/nanopore/nanopore.fastq --secondary=no | samtools sort -@ 28 -m 2G -o aligned.bam -T tmp.ali
   
   #Make coverage graph
   purge_haplotigs hist -b aligned.bam -g ../3_pilon_out/pilon.fa -threads 28
   
   #Generate dotplots
   purge_haplotigs purge -g ../3_pilon_out/pilon.fa -c coverage_stats.csv -t 28 -dotplots -b aligned.bam
   
   purge_haplotigs purge -g ../3_pilon_out/pilon.fa -o curated_90 -c coverage_stats.csv -t 28 -dotplots -align_cov 90 -b aligned.bam
   
   #Analyse contig coverages. Default -j and -s are 80 
   purge_haplotigs cov -i aligned.bam.gencov -l 15 -m 55 -h 115 -o coverage_stats.csv 
   
   #I also tried PurgeHaplotigs with align_cov 70% (default), 80%, 85%, 95%, but 90% had the best BUSCO score

PurgeHaplotigs historgram: https://github.com/bennestor/hakea_genome/assets/7893322/90ee5bff-eee8-48d7-aaaa-8eb1e34f2932

curated_90.fasta:

num_seqs sum_len avg_len max_len N50
1,185 712,312,751 485 601,107.8 11,711,808

BUSCO Embryophyta score: C:98.9%[S:92.0%,D:6.9%],F:0.6%,M:0.5%,n:1614 Genome_v2 after PurgeHaplotigs was C:96.4%[S:90.3%,D:6.1%],F:2.2%,M:1.4%,n:1614

⚠️ **GitHub.com Fallback** ⚠️