2_Polishing_v1 - bennestor/hakea_genome GitHub Wiki
I originally polished the assembly following the pipeline for a cherry tree NECAT assembly by Wang 2020 https://www.nature.com/articles/s41438-020-00343-8. The methods for this are listed below. However I later went back and repolished the genome after discovering several regions were incorrectly polished and had resulted in poor gene prediction.
1. Polish necat.fasta (renamed polished_contigs.fasta) with medaka
1. Install Medaka v1.4.3 with conda
2. Run Medaka model r941_prom_high_g4011
#Run Medaka: 1_medaka.sh on zeus longq (28,120G,96h) took ~1 day medaka_consensus -i ../0_raw_reads/nanopore/nanopore.fastq -d necat_assembly/necat.fasta -o 1_medaka_out -t 28 -m r941_prom_high_g4011
consensus.fasta:
num_seqs | sum_len | min_len | avg_len | max_len | N50 |
---|---|---|---|---|---|
1,925 | 770,178,841 | 530 | 400,092.9 | 11,713,575 | 1,049,695 |
BUSCO Embryophyta score: C:98.4%[S:91.1%,D:7.3%],F:0.9%,M:0.7%,n:1614
2. Polish with NextPolish
1. Install NextPolish v1.3.1 as singularity image in 0_programs
#Pull docker image singularity pull docker://pvstodghill/nextpolish #Test NextPolish singularity run -B nextpolish_latest.sif nextPolish test_data/run.cfg
2. Run NextPolish
#Run Nextpolish with my own run.cfg: 2_nextpolish.sh on zeus longq (28,120G,96h) took ~1 day nextPolish run.cfg #https://nextpolish.readthedocs.io/en/latest/QSTART.html#quick-start
genome.nextpolish.fasta:
num_seqs | sum_len | min_len | avg_len | max_len | N50 |
---|---|---|---|---|---|
1,925 | 771,526,636 | 533 | 400,739.1 | 11,725,650 | 1,046,710 |
BUSCO Embryophyta score: C:98.7%[S:90.6%,D:8.1%],F:0.7%,M:0.6%,n:1614
3. Purgehaplotigs to remove heterozygous sequences (alleles)
1. Install PurgeHaplotigs https://bitbucket.org/mroachawri/purge_haplotigs/src/master/
2. Run purgehaplotigs
#Make coverage graph: 4_PurgeHap_cov.sh on magnus (24,2h) took 30min purge_haplotigs hist -b 3_PurgeHap_map_out/aligned.bam -g necat.fasta -threads 24
low = 10, midpoint = 62, high = 140
#Purge contigs: 6_PurgeHap_purge.sh on magnus (24,24h) took 35min purge_haplotigs purge -g ../necat.fasta -c ../5_PurgeHap_contig_out/coverage_stats.csv -t 24 -dotplots -b ../3_PurgeHap_map_out/aligned.bam #Nine suspect contigs in dotplots. I can't really tell if they are haplotigs or not so I will leave them.
curated.fasta:
num_seqs | sum_len | avg_len | max_len | N50 |
---|---|---|---|---|
1,011 | 710,589,062 | 773 | 702,857.6 | 1,180,705 |
BUSCO Embryophyta Score: C:96.4%[S:90.3%,D:6.1%],F:2.2%,M:1.4%,n:1614
4. Remove non-embryophyta sequences from necat_polished.fasta (renamed curated.fasta)
1. Separate fasta file into sub files with 100 sequences
seqkit split necat_polished.fasta -s 100
2. Install blast 2.12.0 into directory
3. Install NT database
#Download: 7a_downloadNT.sh on magnus (24,24h) took 5h ../0_programs/ncbi-blast-2.12.0+/bin/update_blastdb.pl --decompress nt
5. Install taxdump database (not sure if needed)
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
6. Get embryophyta taxids
#Get species tax ids for Embryophyta lineage get_species_taxids.sh -n Embryophyta #gives 3193 get_species_taxids.sh -t 3193 > embryophyta.taxids
7. BLASTn of contigs to NT with 1 max hits and 1 max hsp
#Cat all blast output together cat *.blast > necat_polished.blast #1003 sequences #Find sequences where top hit is Embryophyta cat necat_polished.blast | grep -Fw -f ../embryophyta.taxids | cut -f1 | sort -u > necat_polished_emb.blast #1003 sequences #Filter to Embryophyta sequences seqkit grep -i -f 7b_blastn_out/necat_polished_emb.blast necat_polished.fasta -o necat_polished_emb.fasta
num_seqs | sum_len | min_len | avg_len | max_len | N50 |
---|---|---|---|---|---|
1,003 | 710, 572, 321 | 817 | 708,447 | 11,707,852 | 1,180,705 |
BUSCO Embryophyta score: C:96.4%[S:90.3%,D:6.1%],F:2.2%,M:1.4%,n:1614