5_Annotation - bennestor/hakea_genome GitHub Wiki

Table of Contents

Annotating Hakea genome

I used the repolished Hakea genome_v2 for repeat masking and annotation

Repeat masking

   #Install RepeatMasker v4.1.4 by conda
   
   #Download RepeatMasker v4.1.2-p1 from source [https://www.repeatmasker.org/RepeatMasker/]
   
   #Download Repbase 2018-1026 (Subscriber only at the moment but might be available as shared data on setonix. I got it off Shri)
   #Had to conda install h5py and changed bash shebang in famdb.py to /group/pawsey0149/bnestor/conda/anaconda3/envs/ltr_retriever/bin/python
   
   #perl ./configure in RepeatMasker folder
   #Choose rmblast as default. Path is conda path to bin.
   
   #Run RepeatModeler (repeats library) and RepeatMasker (mask repeats): 6_repeats.sh on zeus highmemq took 36h
      #Construct repeats library
      BuildDatabase -name hakea ../5_ragtag_out/telopea_asm20/ragtag.scaffold.fasta
      RepeatModeler -database hakea -pa 15
   
      #Run repeat masker
      ../0_programs/RepeatMasker/RepeatMasker -pa 4 -lib hakea-families.fa –xsmall ../5_ragtag_out/telopea_asm20/ragtag.scaffold.fasta

ragtag_mask.fa (renamed ragtag.scaffold.fasta.masked):

Sequences 334
Total length 712397851 bp
GC level 38.81 %
Bases masked 425851780 bp ( 59.78 %)

Gene prediction

1. Install GeMoMa v1.9 by conda and download jar file http://www.jstacs.de/index.php/GeMoMa

2. Download NCBI RefSeq "Arabidopsis thaliana", "Macadamia integrifolia", and "Nelumbo nucifera" GFFs, and Chen et al 2021 "Telopea speciosissima" GFF https://datadryad.org/stash/dataset/doi:10.5061/dryad.12jm63xzt

3. Run GeMoMa

      #GeMoMa command
      java -Xms10G -Xmx120G -jar $GM/0_programs/Gemoma/GeMoMa-1.9.jar CLI GeMoMaPipeline threads=16 GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO o=true t=$GM/6_repeats_out/ragtag_mask.fa p=true pc=true pgr=true sc=false outdir=7_gemoma_out s=own i=Ts a=Tspe_v1/Tspe_v1_gemomaannotation.gff g=Tspe_v1/Tspe_v1.fa s=own i=Mi a=GCF_013358625.1_SCU_Mint_v3_genomic.gff g=GCF_013358625.1_SCU_Mint_v3_genomic.fna s=own i=Nn a=GCF_000365185.1_Chinese_Lotus_1.1_genomic.gff g=GCF_000365185.1_Chinese_Lotus_1.1_genomic.fna s=own i=At a=GCF_000001735.4_TAIR10.1_genomic.gff g=GCF_000001735.4_TAIR10.1_genomic.fna
   
   #Combine predictions from different species 
   java -Xms10G -Xmx120G -jar ../0_programs/Gemoma/GeMoMa-1.9.jar CLI GAF g=final_annotation.gff 
   
   #Extract protein sequences 
   java -Xms10G -Xmx120G -jar ../0_programs/Gemoma/GeMoMa-1.9.jar CLI Extractor a=filtered_predictions.gff g=../6_repeats_out/ragtag_mask.fa p=true c=true outdir=1_gaf_extractor_out 
   # -> 52,803 protein sequences (10 were discarded from GFF due to Ns in sequence)

proteins.fasta:

num_seqs sum_len min_len avg_len max_len
52,803 16,691,424 13 316.1 5,444

BUSCO Embryophyta score: C:97.5%[S:90.1%,D:7.4%],F:1.5%,M:1.0%,n:1614

  • Genes predicted from genome_v1 were C:93.7%[S:81.7%,D:12.0%],F:2.7%,M:3.6%,n:1614

Remove contaminate non-plant sequences

Installing databases

   #Get embryophyta taxids  
   get_species_taxids.sh -n Embryophyta #gives 3193 
   get_species_taxids.sh -t 3193 > embryophyta.taxids 
   
   #Split gemoma_gaf.fa (renamed 1_gaf_extractor_out/proteins.fasta) into 50 parts 
   seqkit split -p 50  gemoma_gaf.fa 
   
   #Download NR database files 
   wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip  
   wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.FULL.gz  
   #prot.accession2taxid.FULL took 8h

Run Diamond BLAST

   #Make Diamond BLAST database: 2_makedb.sh took 3h
   ../0_programs/diamond makedb --threads 64 --taxonmap prot.accession2taxid.FULL --taxonnodes nodes.dmp --taxonnames names.dmp --in /scratch/references/diamond/nr.gz --db nr
   #Run BLAST using array script (/software/projects/pawsey0149/groupEnv/ivec/job_templates/setonix_array.sh) : 8_nrblast.array.sh took ~6h
      #Array commands:
       #SBATCH --array=0-4
       #SBATCH --cpus-per-task=24
       #SBATCH --mem=50G   
       FILES=($(\ls *.fa))
       FILENAME=${FILES[$SLURM_ARRAY_TASK_ID
      #Job command:
       0_programs/diamond blastp --db nr_db/nr.dmnd --sensitive --max-hsps 1 --max-target-seqs 1 --outfmt 6 qseqid sseqid staxids evalue --threads 24 --evalue 0.001 -q 7_gemoma_out/gemoma_gaf.fa.split/${FILENAME} --out 8_nrblast_out/${FILENAME}.blast 

Filter predicted proteins

   #Filter predicted proteins for Embryophyta hits
   seqkit grep -i -f nr_top1Emb.blast ../7_gemoma_out/gemoma_gaf.fa -o gemoma_gaf_nr.fa 

gemoma_gaf_nr.fa

num_seqs sum_len min_len avg_len max_len
43,819 16,167,687 21 369 5,444

BUSCO Embryophyta score: C:97.5%[S:90.1%,D:7.4%],F:1.5%,M:1.0%,n:1614

  • Filtered predicted proteins from genome_v1 were C:93.7%[S:81.7%,D:12.0%],F:2.7%,M:3.6%,n:1614

Add functional annotations

BLASTP for functional annotations

   #Run BLASTP with SwissProt database: 2_swissprot.conf on zeus took ~2h
   BLASTDB="/group/data/blastdb_update/blast-2022-11-01/db"
   blastp -db swissprot -max_target_seqs 1 -max_hsps 1 -outfmt "6 qseqid evalue stitle" -num_threads 28 -query ../7_gemoma_out/gemoma_gaf.fa -out gemoma_gaf_swissprot.blast

Add annotations to gemoma_gaf_nr.fa

   #Filter E-values to less than 0.01 
   cat gemoma_gaf_swissprot.blast | awk '$2<0.01{print $0}' | cut -f 1,3  > gemoma_gaf_swissprot_0.01.blast 
   #35691 sequences annotated 
   
   #Filter protein ids to those annotated  
   grep -Fw -f gemoma_gaf.ids gemoma_gaf_swissprot_0.01.blast > gemoma_swissprot_filt.blast 
   
   #Add to fasta file 
   seqkit replace --keep-key -p "^(.+)" -r '$1 {kv}' -k gemoma_swissprot_filt.blast ../7_gemoma_out/gemoma_gaf.fa > gemoma_gaf_SWannot.fa  
   
   #Added to filtered fasta
   seqkit replace --keep-key -p "^(.+)" -r '$1 {kv}' -k gemoma_swissprot_filt.blast ../8_nrblast_out/gemoma_gaf_nr.fa > gemoma_gaf_nr_SWannot.fa 
   #35691 sequences annotated 
⚠️ **GitHub.com Fallback** ⚠️