2_Polishing_v1 - bennestor/hakea_genome GitHub Wiki

Polishing genome_v1

I originally polished the assembly following the pipeline for a cherry tree NECAT assembly by Wang 2020 https://www.nature.com/articles/s41438-020-00343-8. The methods for this are listed below. However I later went back and repolished the genome after discovering several regions were incorrectly polished and had resulted in poor gene prediction.

1. Polish necat.fasta (renamed polished_contigs.fasta) with medaka

1. Install Medaka v1.4.3 with conda

2. Run Medaka model r941_prom_high_g4011

   #Run Medaka: 1_medaka.sh on zeus longq (28,120G,96h) took ~1 day
   medaka_consensus -i ../0_raw_reads/nanopore/nanopore.fastq -d necat_assembly/necat.fasta -o 1_medaka_out -t 28 -m r941_prom_high_g4011 

consensus.fasta:

num_seqs sum_len min_len avg_len max_len N50
1,925 770,178,841 530 400,092.9 11,713,575 1,049,695

BUSCO Embryophyta score: C:98.4%[S:91.1%,D:7.3%],F:0.9%,M:0.7%,n:1614

2. Polish with NextPolish

1. Install NextPolish v1.3.1 as singularity image in 0_programs

   #Pull docker image
   singularity pull docker://pvstodghill/nextpolish
   #Test NextPolish
   singularity run -B nextpolish_latest.sif nextPolish test_data/run.cfg

2. Run NextPolish

   #Run Nextpolish with my own run.cfg: 2_nextpolish.sh on zeus longq (28,120G,96h) took ~1 day
   nextPolish run.cfg
   #https://nextpolish.readthedocs.io/en/latest/QSTART.html#quick-start

genome.nextpolish.fasta:

num_seqs sum_len min_len avg_len max_len N50
1,925 771,526,636 533 400,739.1 11,725,650 1,046,710

BUSCO Embryophyta score: C:98.7%[S:90.6%,D:8.1%],F:0.7%,M:0.6%,n:1614

3. Purgehaplotigs to remove heterozygous sequences (alleles)

1. Install PurgeHaplotigs https://bitbucket.org/mroachawri/purge_haplotigs/src/master/

2. Run purgehaplotigs

   #Make coverage graph: 4_PurgeHap_cov.sh on magnus (24,2h) took 30min
   purge_haplotigs hist -b 3_PurgeHap_map_out/aligned.bam -g necat.fasta -threads 24
   

low = 10, midpoint = 62, high = 140

   #Purge contigs: 6_PurgeHap_purge.sh on magnus (24,24h) took 35min 
   purge_haplotigs purge -g ../necat.fasta -c ../5_PurgeHap_contig_out/coverage_stats.csv -t 24 -dotplots -b ../3_PurgeHap_map_out/aligned.bam
   #Nine suspect contigs in dotplots. I can't really tell if they are haplotigs or not so I will leave them.

curated.fasta:

num_seqs sum_len avg_len max_len N50
1,011 710,589,062 773 702,857.6 1,180,705

BUSCO Embryophyta Score: C:96.4%[S:90.3%,D:6.1%],F:2.2%,M:1.4%,n:1614

4. Remove non-embryophyta sequences from necat_polished.fasta (renamed curated.fasta)

1. Separate fasta file into sub files with 100 sequences

   seqkit split necat_polished.fasta -s 100 

2. Install blast 2.12.0 into directory

3. Install NT database

   #Download: 7a_downloadNT.sh on magnus (24,24h) took 5h
   ../0_programs/ncbi-blast-2.12.0+/bin/update_blastdb.pl --decompress nt 

5. Install taxdump database (not sure if needed)

   wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz 

6. Get embryophyta taxids

   #Get species tax ids for Embryophyta lineage
   get_species_taxids.sh -n Embryophyta #gives 3193 
   get_species_taxids.sh -t 3193 > embryophyta.taxids 

7. BLASTn of contigs to NT with 1 max hits and 1 max hsp

   #Cat all blast output together
   cat *.blast > necat_polished.blast #1003 sequences 
   
   #Find sequences where top hit is Embryophyta 
   cat necat_polished.blast | grep -Fw -f ../embryophyta.taxids | cut -f1 | sort -u > necat_polished_emb.blast #1003 sequences 
   
   #Filter to Embryophyta sequences
   seqkit grep -i -f 7b_blastn_out/necat_polished_emb.blast necat_polished.fasta -o necat_polished_emb.fasta 
num_seqs sum_len min_len avg_len max_len N50
1,003 710, 572, 321 817 708,447 11,707,852 1,180,705

BUSCO Embryophyta score: C:96.4%[S:90.3%,D:6.1%],F:2.2%,M:1.4%,n:1614

⚠️ **GitHub.com Fallback** ⚠️