3_Polishing_v2 - bennestor/hakea_genome GitHub Wiki
Inaccuracies in the polishing of the Hakea genome made PHT1 gene prediction difficult. I decided to repolish the genome and redo annotation to make identification of nitrate transporters easier. The new polishing method was based on recent Proteaceae assembly papers for Telopea speciosissima (Chen et al. 2022 https://onlinelibrary.wiley.com/doi/full/10.1111/1755-0998.13574) and Knightia excelsa (McCartney et al. 2021 https://onlinelibrary.wiley.com/doi/full/10.1111/1755-0998.13406).
1. Polish necat.fasta (renamed polished_contigs.fasta) with Racon
#Map nanopore reads to necat.fasta with Minimap2 (v2.24-r1122) minimap2 -x map-ont -t 16 ../necat.fasta ../../0_raw_reads/nanopore/nanopore.fastq > 1_mapping.paf #Run Racon: 1_racon.sh on zeus highmemq took 16h #longest contig length = 11707852 and genome length = 768992702 so I used --split 11707852 and 100 coverage racon_wrapper -m 8 -x -6 -g -8 -w 500 -t 16 --split 11707852 ../../0_raw_reads/nanopore/nanopore.fastq ./1_mapping.paf ../necat.fasta > ./ 1_necat_racon_100.fa #I also tried running 4 iterations of Racon with --subsample 768992702 50 for 50% coverage and without –split, but the BUSCO complete score was only 81.7%
BUSCO Embryophyta score: C:98.0%[S:90.3%,D:7.7%],F:1.2%,M:0.8%,n:1614
2. Polish necat.fasta (renamed polished_contigs.fasta) with Medaka
#Run medaka model r941_prom_high_g4011: 2_medaka.sh on zeus highmemq took ~14h medaka_consensus -i ../0_raw_reads/nanopore/nanopore.fastq -d 1_racon_out/1_necat_racon_100.fa -o 2_medaka_out -t 16 -m r941_prom_high_g4011
consensus.fasta:
num_seqs | sum_len | min_len | avg_len | max_len |
---|---|---|---|---|
1,925 | 771,431,749 | 402 | 400,743.8 | 11,745,170 |
BUSCO Embryophyta score: C:98.7%[S:90.6%,D:8.1%],F:0.7%,M:0.6%,n:1614
- Genome_v1 after Medaka was C:98.4%[S:91.1%,D:7.3%],F:0.9%,M:0.7%,n:1614
2. Polish with Pilon (v1.24)
#Split necat.fasta into 20 folders and append ids to 3_pilon.conf (need to manually edit) seqkit split -p 20 necat.fasta cat necat.part_$file.fasta | grep ">" | tr -d ">" | tr '\n' ',' >> ../3_pilon.conf #Run Pilon correcting only SNPs/indels (same as Telopea paper): 3_pilon.conf on zeus workq took up to 3 hrs each #Main command pilon --genome ../2_medaka_out/consensus.fasta --frags consensus_illumina.bam --chunksize 50000 --fix snps,indels #Optional arguments --outdir part_001 --targets bctg00000015,bctg00000013… … --outdir part_020 --targets bctg00001830,bctg00001828,bctg00001858…
pilon.fa:
num_seqs | sum_len | min_len | avg_len | max_len |
---|---|---|---|---|
1,925 | 769,832,288 | 405 | 399,912.9 | 11,711,808 |
BUSCO Embryophyta score: C:98.9%[S:90.5%,D:8.4%],F:0.6%,M:0.5%,n:1614
- Genome_v1 after NextPolish was C:98.7%[S:90.6%,D:8.1%],F:0.7%,M:0.6%,n:1614
3. Run PurgeHaplotigs https://bitbucket.org/mroachawri/purge_haplotigs/src/master/
#Purge Haplotigs script below: 4_purgehap.sh on zeus took 3h #Map Nanopore reads to genome assembly minimap2 -t 28 -ax map-ont ../3_pilon_out/pilon.fa ../../0_raw_reads/nanopore/nanopore.fastq --secondary=no | samtools sort -@ 28 -m 2G -o aligned.bam -T tmp.ali #Make coverage graph purge_haplotigs hist -b aligned.bam -g ../3_pilon_out/pilon.fa -threads 28 #Generate dotplots purge_haplotigs purge -g ../3_pilon_out/pilon.fa -c coverage_stats.csv -t 28 -dotplots -b aligned.bam purge_haplotigs purge -g ../3_pilon_out/pilon.fa -o curated_90 -c coverage_stats.csv -t 28 -dotplots -align_cov 90 -b aligned.bam #Analyse contig coverages. Default -j and -s are 80 purge_haplotigs cov -i aligned.bam.gencov -l 15 -m 55 -h 115 -o coverage_stats.csv #I also tried PurgeHaplotigs with align_cov 70% (default), 80%, 85%, 95%, but 90% had the best BUSCO score
PurgeHaplotigs historgram: https://github.com/bennestor/hakea_genome/assets/7893322/90ee5bff-eee8-48d7-aaaa-8eb1e34f2932
curated_90.fasta:
num_seqs | sum_len | avg_len | max_len | N50 |
---|---|---|---|---|
1,185 | 712,312,751 | 485 | 601,107.8 | 11,711,808 |
BUSCO Embryophyta score: C:98.9%[S:92.0%,D:6.9%],F:0.6%,M:0.5%,n:1614 Genome_v2 after PurgeHaplotigs was C:96.4%[S:90.3%,D:6.1%],F:2.2%,M:1.4%,n:1614