Discula destructiva Genome Annotation - ShadeNiece/DisculaDestructiva_GenomeAssembly-Annotation GitHub Wiki

Background

  • Annotation will be ran using the genome assembly that has the mitochondrial contig added to the end of the nuclear assembly: /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_GenomeAssembly/analysis/13_final_assembly_stats_postgapclosing/dd_as111_100x_final_nu_final_mt_combined.fasta

Pipeline

01. RepeatModeler (version 2.0.5)

Directory: /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/01_repeatmodeler

Documentation: https://github.com/Dfam-consortium/RepeatModeler

  1. Conda install RepeatModeler
conda create -n repeatmodeler -c conda-forge -c bioconda repeatmodeler
conda activate repeatmodeler
  1. Link the assembly fasta
ln -s /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_GenomeAssembly/analysis/13_final_assembly_stats_postgapclosing/dd_as111_100x_final_nu_final_mt_combined.fasta .
  1. Create a database for RepeatModeler
nano 01_make_database.qsh
#!/bin/bash
#SBATCH --job-name=rm_make_database
#SBATCH --nodes=1
#SBATCH --ntasks=30
#SBATCH --mem=100G
#SBATCH -A ACF-UTK0032
#SBATCH --partition=campus
#SBATCH --qos=campus
#SBATCH --time=5:00:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]

eval "$(conda shell.bash hook)"
conda activate repeatmodeler

BuildDatabase \
        -name discula \
        ../dd_as111_100x_final_nu_final_mt_combined.fasta
  • -name: whatever you want to name the database that you're creating
sbatch 01_make_database.qsh
  1. Run RepeatModeler
nano 02_run_repeatmodeler.qsh
#!/bin/bash
#SBATCH --job-name=rm_make_database
#SBATCH --nodes=1
#SBATCH --ntasks=50
#SBATCH --mem=100G
#SBATCH -A ACF-UTK0032
#SBATCH --partition=campus-bigmem
#SBATCH --qos=campus-bigmem
#SBATCH --time=24:00:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]

eval "$(conda shell.bash hook)"
conda activate repeatmodeler

RepeatModeler \
        -database discula \
        -threads 50 \
        -LTRStruct \
        >& run.out
sbatch 02_run_repeatmodeler.qsh
  • LTRStruct: the LTR structural discovery pipeline ( LTR_Harvest and LTR_retreiver ) gets combined with results from the RepeatScout/RECON pipeline. (this was in the RepeatModeler documentation, so I'm trying this)
  • families.fa file: Consensus sequences for each family identified.
  • families.stk file: Seed alignments for each family identified.
  • rmod.log: Execution log. Useful for reproducing results.
  1. Get stats
nano 03_stats.qsh
# Step 1: Extract sequence names from the FASTA file
grep "^>" discula-families.fa | awk '{print substr($0, 2)}' > repeat_names.txt

# Step 2: Extract repeat IDs and count occurrences
awk 'BEGIN {FS = "#"} {print $2}' repeat_names.txt | awk '{a[$1]++} END {for(k in a) {print k, a[k]}}' > repeat_counts.txt
bash 03_stats.qsh
02. RepeatMasker (version 4.1.3)

Working Directory: /pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/02_RepeatMasker

I tried to install RepeatMasker on Isaac, and I eventually gave up due to how long it was taking to troubleshoot. Also, conda install is not recommended by developers (things get left out of the install), and there are a bunch of dependencies so I just used Sphinx. The below is everything I initially tried on Isaac before moving to Sphinx-just keeping here in case it's needed in the future.

## Download and Unpack RepeatMasker
cd /lustre/isaac/proj/UTK0032/sniece/software
mkdir RepeatMasker
cd RepeatMasker
wget https://www.repeatmasker.org/RepeatMasker/RepeatMasker-4.1.6.tar.gz
tar -xzvf RepeatMasker-4.1.6.tar.gz

## Download and Install TRF
# do the following within the RepeatMasker directory
mkdir trf
cd trf
wget https://github.com/Benson-Genomics-Lab/TRF/archive/refs/tags/v4.09.1.tar.gz
tar -xzvf v4.09.1.tar.gz

## Download RMBlast search engine
cd /lustre/isaac/proj/UTK0032/sniece/software/RepeatMasker/RepeatMasker
wget https://www.repeatmasker.org/rmblast/rmblast-2.14.1+-x64-linux.tar.gz
tar -xzvf rmblast-2.14.1+-x64-linux.tar.gz

## Download the complete DFAM database
# https://www.dfam.org/releases/Dfam_3.8/families/FamDB/README.txt has more info on which specific family db you should download based on your organism. I work with a fungus, so I'm downloading partition 0.
cd /lustre/isaac/proj/UTK0032/sniece/software/RepeatMasker/RepeatMasker/Libraries
wget https://www.dfam.org/releases/Dfam_3.8/families/FamDB/dfam38_full.0.h5.gz
mv dfam38_full.0.h5.gz famdb/
gunzip dfam38_full.0.h5.gz

## Download the RepBase RepeatMasker Edition
# This seems to be a paid subscription, but we have the `RMRB.fasta` on pickett_shared, so I just secure copied that to my Isaac directory.
# In Sphinx: 
cd /pickett_shared/software/RepeatMasker/Libraries
scp *RMRB* '[email protected]:/lustre/isaac/proj/UTK0032/sniece/software/RepeatMasker/RepeatMasker/Libraries'

## Configure RepeatMasker
cd /lustre/isaac/proj/UTK0032/sniece/software/RepeatMasker/RepeatMasker
perl ./configure
  1. Pull out all the fungi sequences from the dfam library. Per Trinity's documentation, Meg made a script to do this for White Oak on Sphinx. I'm just going to copy the entire RepeatMasker folder in pickett_shared to my own software directory in pickett_sphinx and run the command to get the fungi sequences instead of the eudicotyledons.
# In Sphinx
cd /pickett_sphinx/projects/lwy647/software
cp -r /pickett_shared/software/RepeatMasker .
cd RepeatMasker/Libraries
python3 ../famdb.py -i Dfam.h5 families --format fasta_name --include-class-in-name --ancestors --descendants 'fungi' > fungi-rm.fa
ls -lh #to make sure the file isn't empty
scp fungi-rm.fa '[email protected]:/lustre/isaac/proj/UTK0032/sniece/software/RepeatMasker/RepeatMasker/Libraries'
  1. Merge all the repeat libraries
cd /pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/02_RepeatMasker
nano 01_merge_repeat_libs.sh
cat /pickett_sphinx/projects/lwy647/software/RepeatMasker/Libraries/fungi-rm.fa /pickett_shared/software/RepeatMasker/Libraries/RMRB.fasta ./discula-families.fa > discula_totalRepeatLib.fa
  1. Link the genome assembly from Isaac
## In Isaac
cd /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation
scp dd_as111_100x_final_nu_final_mt_combined.fasta '[email protected]:/pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation'
  1. Run RepeatMasker to soft mask the genome
nano 02_run_repeatmasker.qsh
/pickett_shared/software/RepeatMasker/RepeatMasker \
        -lib discula_totalRepeatLib.fa \
        -e rmblast \
        -pa 4 \
        -nolow \
        -xsmall \
        -gff \
        ../dd_as111_100x_final_nu_final_mt_combined.fasta \
        >& dd_1.0.0_RMasker.out
screen -S RepeatMasker
bash 02_run_repeatmasker.qsh
  1. Get % of genome that was soft masked
cp /home/lwy647/scripts/calcPercentMasked_Chr-vs-Scaff.py /pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/02_RepeatMasker
nano 03_calcPercentMasked_Chr-vs-Scaff.sh
python calcPercentMasked_Chr-vs-Scaff.py ../dd_as111_100x_final_nu_final_mt_combined.fasta.masked

kept getting "divide by zero" error but let's look at the dd_as111_100x_final_nu_final_mt_combined.fasta.tbl

==================================================
file name: dd_as111_100x_final_nu_final_mt_combined.fasta
sequences:             9
total length:   46887432 bp  (46887432 bp excl N/X-runs)
GC level:         49.89 %
bases masked:    7212922 bp ( 15.38 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements         4745      5821510 bp   12.42 %
   SINEs:               15          622 bp    0.00 %
   Penelope            125         6389 bp    0.01 %
   LINEs:             1376       360136 bp    0.77 %
    CRE/SLACS           12          622 bp    0.00 %
     L2/CR1/Rex         98         6357 bp    0.01 %
     R1/LOA/Jockey     290        17415 bp    0.04 %
     R2/R4/NeSL         42         3131 bp    0.01 %
     RTE/Bov-B          58         3663 bp    0.01 %
     L1/CIN4           114         6591 bp    0.01 %
   LTR elements:      3354      5460752 bp   11.65 %
     BEL/Pao             0            0 bp    0.00 %
     Ty1/Copia        1385      2964934 bp    6.32 %
     Gypsy/DIRS1       999      2184402 bp    4.66 %
       Retroviral        0            0 bp    0.00 %

DNA transposons       2611       530880 bp    1.13 %
   hobo-Activator      488        26366 bp    0.06 %
   Tc1-IS630-Pogo      591       422869 bp    0.90 %
   En-Spm                0            0 bp    0.00 %
   MULE-MuDR           213        11588 bp    0.02 %
   PiggyBac             19          895 bp    0.00 %
   Tourist/Harbinger    96         5129 bp    0.01 %
   Other (Mirage,       19         1073 bp    0.00 %
    P-element, Transib)

Rolling-circles        204        12753 bp    0.03 %

Unclassified:         4716       778781 bp    1.66 %

Total interspersed repeats:     7131171 bp   15.21 %


Small RNA:              83        61588 bp    0.13 %

Satellites:             88         5929 bp    0.01 %
Simple repeats:         25         1895 bp    0.00 %
Low complexity:          0            0 bp    0.00 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element


RepeatMasker version 4.1.3-p1 , default mode

run with rmblastn version 2.14.0+
The query was compared to classified sequences in "discula_totalRepeatLib.fa"
03. STAR (version 2.7.11b)

Directory: /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/03_STAR

  1. Download STAR
cd /lustre/isaac/proj/UTK0032/sniece/software
mkdir STAR
cd STAR
wget https://github.com/alexdobin/STAR/archive/2.7.11b.tar.gz
tar -xzf 2.7.11b.tar.gz
  • Path to STAR executable: /lustre/isaac/proj/UTK0032/sniece/software/STAR/STAR-2.7.11b/bin/Linux_x86_64_static/STAR
  1. Link all the RNAseq reads for AS111 to the working directory and cat them all into R1 & R2
cd /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/03_STAR/AS111_RNAseq_reads
ln -s /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_RNAseq/analysis/01_read_renaming/*AS111*.fastq.gz .

Now merge all the R1 files together and do the same for the R2 files.

cat *R1.fastq.gz > AS111_merged_R1.fastq.gz
cat *R2.fastq.gz > AS111_merged_R2.fastq.gz

Now remove the symlinked fastq filed since they're no longer needed

rm *_combined_*
  1. Index soft masked genome

First, make a directory to hold the indexed genome

cd /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/03_STAR
mkdir dd_genome_index

Next, secure copy the masked genome assembly from Sphinx to the current directory in Isaac.

# In Sphinx:
cd /pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/02_RepeatMasker
scp dd_as111_100x_final_nu_final_mt_combined.fasta.masked '[email protected]:/lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/03_STAR'

Next, index the genome using STAR

nano 01_index_genome.qsh
#!/bin/bash
#SBATCH --job-name=run_star
#SBATCH --nodes=1
#SBATCH --ntasks=40
#SBATCH --mem=100G
#SBATCH -A ACF-UTK0032
#SBATCH --partition=campus-bigmem
#SBATCH --qos=campus-bigmem
#SBATCH --time=24:00:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]

/lustre/isaac/proj/UTK0032/sniece/software/STAR/STAR-2.7.11b/bin/Linux_x86_64_static/STAR \
        --runMode genomeGenerate \
        --genomeDir dd_genome_index \
        --genomeSAindexNbases 11 \
        --genomeFastaFiles dd_as111_100x_final_nu_final_mt_combined.fasta.masked \
        --runThreadN 40
  • --genomeSAindexNbases: (log2(genome_length) / 2) - 1 => (log2(47,000,000) / 2) - 1 = ~11
sbatch 01_index_genome.qsh
  1. STAR Mapping RNAseq Data
nano 02_STAR_RNAseq_mapping.qsh
#!/bin/bash
#SBATCH --job-name=run_star
#SBATCH --nodes=1
#SBATCH --ntasks=40
#SBATCH --mem=150G
#SBATCH -A ACF-UTK0032
#SBATCH --partition=campus-bigmem
#SBATCH --qos=campus-bigmem
#SBATCH --time=24:00:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]

/lustre/isaac/proj/UTK0032/sniece/software/STAR/STAR-2.7.11b/bin/Linux_x86_64_static/STAR \
        --genomeDir dd_genome_index \
        --readFilesIn ./AS111_RNAseq_reads/AS111_merged_R1.fastq.gz ./AS111_RNAseq_reads/AS111_merged_R2.fastq.gz \
        --readFilesCommand zcat \
        --outFileNamePrefix dd_as111-rna \
        --outSAMtype BAM SortedByCoordinate \
        --outSAMstrandField intronMotif \
        --limitBAMsortRAM 107374182400 \
        --runThreadN 40 \
>& star_dd_as111.out

I had to increase the maximum number of open file descriptors. STAR required a higher limit than the current setting.

## Check the current limit
ulimit -n
# 1024

## Change limit
ulimit -n 4096
sbatch 02_STAR_RNAseq_mapping.qsh
  • Upon successful completion, STAR will generate a file named <prefix>-rnaLog.final.out with stats about mapping.
  • Here is my dd_as111-rnaLog.final.out:
                                 Started job on |	Jun 05 10:42:04
                             Started mapping on |	Jun 05 10:42:04
                                    Finished on |	Jun 05 11:46:07
       Mapping speed, Million of reads per hour |	183.49

                          Number of input reads |	195877513
                      Average input read length |	302
                                    UNIQUE READS:
                   Uniquely mapped reads number |	166801805
                        Uniquely mapped reads % |	85.16%
                          Average mapped length |	297.49
                       Number of splices: Total |	77102939
            Number of splices: Annotated (sjdb) |	0
                       Number of splices: GT/AG |	72226779
                       Number of splices: GC/AG |	4543990
                       Number of splices: AT/AC |	24329
               Number of splices: Non-canonical |	307841
                      Mismatch rate per base, % |	0.33%
                         Deletion rate per base |	0.01%
                        Deletion average length |	1.54
                        Insertion rate per base |	0.00%
                       Insertion average length |	1.37
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |	2002023
             % of reads mapped to multiple loci |	1.02%
        Number of reads mapped to too many loci |	1006559
             % of reads mapped to too many loci |	0.51%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |	0
       % of reads unmapped: too many mismatches |	0.00%
            Number of reads unmapped: too short |	26036620
                 % of reads unmapped: too short |	13.29%
                Number of reads unmapped: other |	30506
                     % of reads unmapped: other |	0.02%
                                  CHIMERIC READS:
                       Number of chimeric reads |	0
                            % of chimeric reads |	0.00%

All of the above commands and outputs for STAR were placed into a directory named AS111_star_outputs at /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/03_STAR. I eventually found out in BRAKER3 that I didn't have enough RNAseq data with just the RNAseq reads from the AS111 isolate, so I decided to concatenate all R1 reads from every isolate/treatment and did the same for R2 in order to have enough RNAseq data for BRAKER3.

  1. Re-run STAR with all RNAseq data
nano 02_STAR_RNAseq_mapping.qsh
#!/bin/bash
#SBATCH --job-name=run_star
#SBATCH --nodes=1
#SBATCH --ntasks=40
#SBATCH --mem=150G
#SBATCH -A ACF-UTK0032
#SBATCH --partition=campus-bigmem
#SBATCH --qos=campus-bigmem
#SBATCH --time=24:00:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]

/lustre/isaac/proj/UTK0032/sniece/software/STAR/STAR-2.7.11b/bin/Linux_x86_64_static/STAR \
        --genomeDir dd_genome_index \
        --readFilesIn /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/03_STAR/all_isolates_RNAseq_reads/all_isoaltes_all_treatments_merged_R1.fastq.gz /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/03_STAR/all_isolates_RNAseq_reads/all_isoaltes_all_treatments_merged_R2.fastq.gz \
        --readFilesCommand zcat \
        --outFileNamePrefix dd_as111-rna \
        --outSAMtype BAM SortedByCoordinate \
        --outSAMstrandField intronMotif \
        --limitBAMsortRAM 107374182400 \
        --runThreadN 40 \
>& star_dd_as111.out

I had to increase the maximum number of open file descriptors. STAR required a higher limit than the current setting.

## Check the current limit
ulimit -n
# 1024

## Change limit
ulimit -n 4096
sbatch 02_STAR_RNAseq_mapping.qsh
  • Upon successful completion, STAR will generate a file named <prefix>-rnaLog.final.out with stats about mapping.
  • Here is my dd_as111-rnaLog.final.out:
                                 Started job on |	Jun 06 17:41:53
                             Started mapping on |	Jun 06 17:41:54
                                    Finished on |	Jun 07 10:02:50
       Mapping speed, Million of reads per hour |	36.67

                          Number of input reads |	599453980
                      Average input read length |	302
                                    UNIQUE READS:
                   Uniquely mapped reads number |	311262036
                        Uniquely mapped reads % |	51.92%
                          Average mapped length |	296.45
                       Number of splices: Total |	132226573
            Number of splices: Annotated (sjdb) |	0
                       Number of splices: GT/AG |	123994077
                       Number of splices: GC/AG |	7580649
                       Number of splices: AT/AC |	62631
               Number of splices: Non-canonical |	589216
                      Mismatch rate per base, % |	0.37%
                         Deletion rate per base |	0.01%
                        Deletion average length |	1.81
                        Insertion rate per base |	0.01%
                       Insertion average length |	1.68
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |	22043572
             % of reads mapped to multiple loci |	3.68%
        Number of reads mapped to too many loci |	18994290
             % of reads mapped to too many loci |	3.17%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |	0
       % of reads unmapped: too many mismatches |	0.00%
            Number of reads unmapped: too short |	246789245
                 % of reads unmapped: too short |	41.17%
                Number of reads unmapped: other |	364837
                     % of reads unmapped: other |	0.06%
                                  CHIMERIC READS:
                       Number of chimeric reads |	0
                            % of chimeric reads |	0.00%
04. BRAKER3 (v3.0.8)

Isaac Directory: /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/04_BRAKER

  • BRAKER3 will be performing gene prediction or structural annotation with the masked genome and bam file that was generated from STAR.

Input files for BRAKER3:

cp /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/03_STAR/dd_as111-rnaAligned.sortedByCoord.out.bam .
cp /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/03_STAR/dd_as111_100x_final_nu_final_mt_combined.fasta.masked .
  • I renamed the mitochondrial scaffold in dd_as111_100x_final_nu_final_mt_combined.fasta.masked to just say "scaffold_mt" because I eventually came to find out that GeneMark-ETP will throw an error that it can't find "scaffold_mt" using the old mitochondrial scaffold naming convention (initially just let as what OATK listed it as).

Now moving over to Sphinx:

Sphinx Directory: /pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/04_BRAKER

chmod 775 dd_as111-rnaAligned.sortedByCoord.out.bam
chmod 775 dd_as111_100x_final_nu_final_mt_combined.fasta.masked
  1. Download the orthoDB protein database for fungi.
wget https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/Fungi.fa.gz

#gunzip
gunzip Fungi.fa.gz
  1. Set the path for BRAKER and AUGUSTUS config files
export BRAKER_SIF=/sphinx_local/images/braker3_latest.sif
export AUGUSTUS_CONFIG_PATH=~/miniconda3/envs/busco/config
echo $AUGUSTUS_CONFIG_PATH
  1. Set path for AUGUSTUS config file in singularity interactive shell
singularity shell -B $PWD $BRAKER_SIF
export AUGUSTUS_CONFIG_PATH=~/miniconda3/envs/busco/config
echo $AUGUSTUS_CONFIG_PATH

#Exit the interactive shell
Ctrl + D
  1. Run BRAKER3
nano run_braker.sh
mkdir braker_outputs

singularity exec -B $PWD /sphinx_local/images/braker3_latest.sif braker.pl --genome=dd_as111_100x_final_nu_final_mt_combined.fasta.masked \
        --bam=dd_as111-rnaAligned.sortedByCoord.out.bam \
        --prot_seq=Fungi.fa \
        --workingdir=braker_outputs \
        --threads 20 \
        --fungus \
        --useexisting \
        --gff3 \
        --AUGUSTUS_CONFIG_PATH $AUGUSTUS_CONFIG_PATH \
        --species=Discula_destructiva
cpulimit -l 2000 -i bash run_braker.sh
  • I'm using cpulimit because BRAKER uses way more CPU than you specify. I installed cpulimit on pickett_shared, so you can copy the executable command to your personal bin file by doing: cp /pickett_shared/software/cpulimit/src/cpulimit /home/user/bin
  • BRAKER3 (for me with a 44Gb .bam file and 46Mb soft masked genome) took ~3 hrs to complete.
  • The main files will be braker.gff3, braker.aa, and braker.codingseq.

Rename the output files to be slightly more descriptive:

mv braker.aa dd_aa-proteins.fasta
mv braker.codingseq dd_genes.fasta
mv braker.gff3 dd.gff3
  1. Check the stats on gff3 file
cat dd.gff3 | awk '{a[$3]++}END{for(k in a){print k,a[k]}}'
mRNA 11480
exon 29610
CDS 29610
intron 18130
gene 10373
start_codon 11473
stop_codon 11473
  1. Run BUSCO on the protein fasta file with the ascomycota and sordariomycetes databases.

The protein file is located here:

/pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/04_BRAKER/braker_outputs/dd_aa-proteins.fasta

Run BUSCO on the protein fasta file for the ascomycota database:

cd /pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/04_BRAKER/braker_outputs
singularity exec -B $PWD /sphinx_local/images/ezlabgva-busco-v5.6.1_cv1.img busco -i dd_aa-proteins.fasta -m proteins -l ascomycota -c 10 -o busco_ascomycota_results

|Results from dataset ascomycota_odb10             |
---------------------------------------------------
|C:97.4%[S:89.4%,D:8.0%],F:0.4%,M:2.2%,n:1706      |
|1663    Complete BUSCOs (C)                       |
|1526    Complete and single-copy BUSCOs (S)       |
|137    Complete and duplicated BUSCOs (D)         |
|7    Fragmented BUSCOs (F)                        |
|36    Missing BUSCOs (M)                          |
|1706    Total BUSCO groups searched               |
---------------------------------------------------

Run BUSCO on the protein fasta file for the sordariomycetes database:

cd /pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/04_BRAKER/braker_outputs
singularity exec -B $PWD /sphinx_local/images/ezlabgva-busco-v5.6.1_cv1.img busco -i dd_aa-proteins.fasta -m proteins -l sordariomycetes -c 10 -o busco_sordariomycetes_results
---------------------------------------------------
|Results from dataset sordariomycetes_odb10        |
---------------------------------------------------
|C:95.5%[S:85.6%,D:9.9%],F:0.4%,M:4.1%,n:3817      |
|3647    Complete BUSCOs (C)                       |
|3268    Complete and single-copy BUSCOs (S)       |
|379    Complete and duplicated BUSCOs (D)         |
|16    Fragmented BUSCOs (F)                       |
|154    Missing BUSCOs (M)                         |
|3817    Total BUSCO groups searched               |
---------------------------------------------------
05. EnTAP (v1.0.0)

  • Documentation: https://github.com/harta55/EnTAP
  • Sphinx Directory: /pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/05_EnTAP
  • EnTAP will perform functional annotation of the protein fasta we made using BRAKER3
  1. Copy the protein fasta file to the working directory
cp ../04_BRAKER/braker_outputs/dd_aa-proteins.fasta .
  1. Load required dependencies
spack load [email protected]%[email protected]
spack load rsem
spack load interproscan
spack load transdecoder
  1. Run EnTAP
nano run_entap.sh
/sphinx_local/software/EnTAP-1.0.0/bin/EnTAP \
--runP \
-i dd_aa-proteins.fasta \
--ini /sphinx_local/software/EnTAP-1.0.0/entap_config_Oct2023.ini \
-d /sphinx_local/software/EnTAP-1.0.0/bin/uniprot_sprot.dmnd \
-t 10
bash run_entap.sh

EnTAP will produce a number of final files in final_results

  • annotated_without_contam.faa: FASTA-formatted amino acid / protein file

  • annotated_without_contam_gene_ontology_terms.tsv: Tab-deliminated file that can be used for Gene Enrichment. Columns are as follows: Sequence ID, Gene Ontology Term ID, Gene Ontology Term, Gene Ontology Category, and Effective Length.

  • annotated_without_contam.tsv

  1. Calculate the number of genes that were annotated using EnTAP
grep '>' annotated_without_contam.faa | wc -l
  • Results: 10505
  • Genes originally annotated using BRAKER3: 10373
  1. Calculate percentage of genes retained from structural annotation (BRAKER3) to functional annotation (EnTAP):

(BRAKER3 gene count) / (EnTAP gene count) *100 => (10373 / 10505) *100 = 98.74%

Some more important stats:

nano log_file_2024Y6M7D-16h38m4s.txt
------------------------------------------------------
Transcriptome Statistics
------------------------------------------------------
Protein sequences found
Total sequences: 11480
Total length of transcriptome(bp): 17202825
Average sequence length(bp): 1498.00
n50: 1770
n90: 852
Longest sequence(bp): 14823 (g1868.t1)
Shortest sequence(bp): 42 (g2578.t1)

------------------------------------------------------
Final Annotation Statistics
------------------------------------------------------
Total Input Sequences: 11480
Similarity Search
	Total unique sequences with an alignment: 6836 (59.55% of total input sequences)
		Total alignments flagged as a contaminant: 0 (0.00% of total unique alignments)
		Total alignments NOT flagged as a contaminant: 6836 (100.00% of total unique alignments)
	Total unique sequences without an alignment: 4644 (40.45% of total input sequences)
Gene Families
	Total unique sequences with family assignment: 10504 (91.50% of total input sequences)
	Total unique sequences without family assignment: 976 (8.50% of total input sequences)
	Total unique sequences with at least one GO term: 8330 (72.56% of total input sequences)
	Total unique sequences with at least one pathway (KEGG) assignment: 2725 (23.74% of total input sequences)
Totals
	Total retained sequences (after filtering and/or frame selection): 11480
	Total unique sequences annotated (similarity search alignments only): 1 (0.01% of total retained)
	Total unique sequences annotated (gene family assignment only): 3669 (31.96% of total retained)
	Total unique sequences annotated (gene family and/or similarity search): 10505 (91.51% of total retained)
		Total alignments flagged as a contaminant (gene family and/or similarity search): 0 (0.00% of total unique alignments)
		Total alignments NOT flagged as a contaminant (gene family and/or similarity search): 6836 (100.00% of total unique alignments)
	Total unique sequences unannotated (gene family and/or similarity search): 975 (8.49% of total retained)

Some other important files/paths:

  • final report from EnTAP: /pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/05_EnTAP/entap_outfiles/final_results/entap_results.tsv
  • final transcriptome assembly: /pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/05_EnTAP/entap_outfiles/transcriptomes/dd_aa-proteins_final.fasta
  • annotated sequences tsv file that were not flagged as a contaminant: /pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/05_EnTAP/entap_outfiles/final_results/annotated_without_contam.tsv
  • annotated sequences fasta file that were not flagged as a contaminant: /pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/05_EnTAP/entap_outfiles/final_results/annotated_without_contam.faa
  • annotated sequences tsv file that were not flagged as a contaminantant and can be used for Gene Enrichment: /pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/05_EnTAP/entap_outfiles/final_results/annotated_without_contam_gene_ontology_terms.tsv
  • various gene ontology figures produced from EggNOG: /pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/05_EnTAP/entap_outfiles/ontology/EggNOG_DMND/figures
06. GO/KEGG Terms

  • Working directory: /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/06_GO_terms
  • I wanted to create a spreadsheet for all of the GO terms and KEGG terms that were annotated from EnTAP.
  1. Copy over the annotated_without_contam_gene_ontology_terms.tsv & entap_results.tsv files.
cp /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/05_EnTAP/entap_outfiles/final_results/annotated_without_contam_gene_ontology_terms.tsv .
cp /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/05_EnTAP/entap_outfiles/final_results/entap_results.tsv .
  1. Create spreadsheets. I used awk and chatgpt to help me pull out all of the gene id's and GO/KEGG terms.
#GO file creation
awk 'BEGIN { FS = OFS = "\t" } { for (i = 1; i <= NF; i++) { printf "%s%s", $i, (i % 4 ? OFS : ORS) } }' annotated_without_contam_gene_ontology_terms.tsv > dd-GO.tsv

#KEGG file creation
awk -F'\t' '{print $1,\t,$32}' entap_results.tsv > dd-KEGG.tsv
  • this produced 2 files:
dd-GO.tsv
dd-KEGG.tsv
07. SignalP (v5)

  1. Install SignalP.
  • In order to install, you will have to email a download request on their website. I just installed in my software directory: /lustre/isaac/proj/UTK0032/sniece/software/signalp-5.0b
  1. Run SignalP.
#!/bin/bash
#SBATCH --job-name=run_signalp
#SBATCH --nodes=1
#SBATCH --ntasks=30
#SBATCH --mem=100G
#SBATCH -A ACF-UTK0032
#SBATCH --partition=short
#SBATCH --qos=short
#SBATCH --time=03:00:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]

#Link executable here because pathing directly to it doesn't work...
ln -s /lustre/isaac/proj/UTK0032/sniece/software/signalp-5.0b/bin/signalp .

# Run signalp
./signalp \
	-fasta dd_aa-proteins.fasta \
        -format long \
        -gff3 \
        -mature \
        -org euk \
        -plot png \
        -prefix dd_signalp


#  -batch int
#       Number of sequences that the tool will run simultaneously. Decrease or increase size depending on your system memory. (default 10000)
#  -fasta string
#       Input file in fasta format.
#  -format string
#       Output format. 'long' for generating the predictions with plots, 'short' for the predictions without plots. (default "short")
#  -gff3
#       Make gff3 file of processed sequences.
#  -mature
#       Make fasta file with mature sequence. (this will contain only those proteins that are part of the secretome)
#  -org string
#       Organism. Archaea: 'arch', Gram-positive: 'gram+', Gram-negative: 'gram-' or Eukarya: 'euk' (default "euk")
#  -plot string
#       Plots output format. When long output selected, choose between 'png', 'eps' or 'none' to get just a tabular file. (default "png")
#  -prefix string
#       Output files prefix. (default "Input file prefix")
#  -stdout
#       Write the prediction summary to the STDOUT.
#  -tmp string
#       Specify temporary file directory. (default "System default tmpdir")
#  -verbose
#       Verbose output. Specify '-verbose=false' to avoid printing. (default true)
#  -version
#       Prints version.
  • This will produce a few files with your predicted secretome and a spreadsheet of predicted signal peptides.
08. CAZyme Detection using dbCAN (version 3.0)

  1. Run dcCAN3 on the web server.
  • You will get an email with the results. Mine are saved as dbCAN3_output_CAZyme_prediction.xlsx.
  • They will give you CAZyme subfamily predictions along with how many lines of evidence each of those predicted CAZymes has. Per the author recommendations, predicted CAZymes with at least 2 or 3 tools should considered:
 Gene ID  |    EC#   |   HMMER   | dbCAN_sub | DIAMOND | Signalp | #ofTools
--------- | -------- | ------------- | ------- | ----- | ------- | --
g10011.t1 | 3.2.1.78 | GH5_7(76-362) | GH5_e89 | GH5_7 | Y(1-24) | 3

09. EffectorP (version 3.0)

  • Working Directory: /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/09_effectorP/EffectorP-3.0
  1. Install EffectorP
mkdir 09_effectorP
cd 09_effectorP

git clone https://github.com/JanaSperschneider/EffectorP-3.0.git
cd EffectorP-3.0
unzip weka-3-8-4.zip
  1. Secure copy your protein fasta file Sphinx from genome annotation for EffectorP to use as input
scp '[email protected]:/pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/04_BRAKER/braker_outputs/dd_aa-proteins.fasta' .
  1. Run EffectorP (I think there were some miscommunication between SLURM and finding the java file for one of the dependencies, so I just bash ran the script since it isn't computationally intensive and takes very little time (finished in 2 minutes).
python EffectorP.py \
        -f \
        -E dd_effectors \
        -o dd_effectors_output \
        -i dd_aa-proteins.fasta
  • -f: run in "fungal" mode (EffectorP is also used for oomycetes, so that's why I ran in fungal mode)
  • -E: name of new fasta file that contains only effectors
  • -o: output file name
  • -i: input protein fasta to be used

Output:

EffectorP results were saved to output file: dd_effectors_output
-----------------
11480 proteins were provided as input in the following file: dd_aa-proteins.fasta
-----------------
Number of predicted effectors: 2763
Number of predicted cytoplasmic effectors: 2392
Number of predicted apoplastic effectors: 371
-----------------
24.1 percent are predicted effectors.
20.8 percent are predicted cytoplasmic effectors.
3.2 percent are predicted apoplastic effectors.
-----------------
NOTE: EffectorP was run in fungal mode.
-----------------
  • dd_effectors_output is the tsv file that contains all the information for each individual protein.
⚠️ **GitHub.com Fallback** ⚠️