Discula destructiva Genome Annotation - ShadeNiece/DisculaDestructiva

Background

Annotation will be ran using the genome assembly that has the mitochondrial contig added to the end of the nuclear assembly: /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_GenomeAssembly/analysis/13_final_assembly_stats_postgapclosing/dd_as111_100x_final_nu_final_mt_combined.fasta

Pipeline

01. RepeatModeler (version 2.0.5)

Directory: /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/01_repeatmodeler

Documentation: https://github.com/Dfam-consortium/RepeatModeler

Conda install RepeatModeler

conda create -n repeatmodeler -c conda-forge -c bioconda repeatmodeler
conda activate repeatmodeler

Link the assembly fasta

ln -s /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_GenomeAssembly/analysis/13_final_assembly_stats_postgapclosing/dd_as111_100x_final_nu_final_mt_combined.fasta .

Create a database for RepeatModeler

nano 01_make_database.qsh

#!/bin/bash
#SBATCH --job-name=rm_make_database
#SBATCH --nodes=1
#SBATCH --ntasks=30
#SBATCH --mem=100G
#SBATCH -A ACF-UTK0032
#SBATCH --partition=campus
#SBATCH --qos=campus
#SBATCH --time=5:00:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]

eval "$(conda shell.bash hook)"
conda activate repeatmodeler

BuildDatabase \
        -name discula \
        ../dd_as111_100x_final_nu_final_mt_combined.fasta

-name: whatever you want to name the database that you're creating

sbatch 01_make_database.qsh

Run RepeatModeler

nano 02_run_repeatmodeler.qsh

#!/bin/bash
#SBATCH --job-name=rm_make_database
#SBATCH --nodes=1
#SBATCH --ntasks=50
#SBATCH --mem=100G
#SBATCH -A ACF-UTK0032
#SBATCH --partition=campus-bigmem
#SBATCH --qos=campus-bigmem
#SBATCH --time=24:00:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]

eval "$(conda shell.bash hook)"
conda activate repeatmodeler

RepeatModeler \
        -database discula \
        -threads 50 \
        -LTRStruct \
        >& run.out

sbatch 02_run_repeatmodeler.qsh

LTRStruct: the LTR structural discovery pipeline ( LTR_Harvest and LTR_retreiver ) gets combined with results from the RepeatScout/RECON pipeline. (this was in the RepeatModeler documentation, so I'm trying this)
families.fa file: Consensus sequences for each family identified.
families.stk file: Seed alignments for each family identified.
rmod.log: Execution log. Useful for reproducing results.

Get stats

nano 03_stats.qsh

# Step 1: Extract sequence names from the FASTA file
grep "^>" discula-families.fa | awk '{print substr($0, 2)}' > repeat_names.txt

# Step 2: Extract repeat IDs and count occurrences
awk 'BEGIN {FS = "#"} {print $2}' repeat_names.txt | awk '{a[$1]++} END {for(k in a) {print k, a[k]}}' > repeat_counts.txt

bash 03_stats.qsh

02. RepeatMasker (version 4.1.3)

Working Directory: /pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/02_RepeatMasker

I tried to install RepeatMasker on Isaac, and I eventually gave up due to how long it was taking to troubleshoot. Also, conda install is not recommended by developers (things get left out of the install), and there are a bunch of dependencies so I just used Sphinx. The below is everything I initially tried on Isaac before moving to Sphinx-just keeping here in case it's needed in the future.

## Download and Unpack RepeatMasker
cd /lustre/isaac/proj/UTK0032/sniece/software
mkdir RepeatMasker
cd RepeatMasker
wget https://www.repeatmasker.org/RepeatMasker/RepeatMasker-4.1.6.tar.gz
tar -xzvf RepeatMasker-4.1.6.tar.gz

## Download and Install TRF
# do the following within the RepeatMasker directory
mkdir trf
cd trf
wget https://github.com/Benson-Genomics-Lab/TRF/archive/refs/tags/v4.09.1.tar.gz
tar -xzvf v4.09.1.tar.gz

## Download RMBlast search engine
cd /lustre/isaac/proj/UTK0032/sniece/software/RepeatMasker/RepeatMasker
wget https://www.repeatmasker.org/rmblast/rmblast-2.14.1+-x64-linux.tar.gz
tar -xzvf rmblast-2.14.1+-x64-linux.tar.gz

## Download the complete DFAM database
# https://www.dfam.org/releases/Dfam_3.8/families/FamDB/README.txt has more info on which specific family db you should download based on your organism. I work with a fungus, so I'm downloading partition 0.
cd /lustre/isaac/proj/UTK0032/sniece/software/RepeatMasker/RepeatMasker/Libraries
wget https://www.dfam.org/releases/Dfam_3.8/families/FamDB/dfam38_full.0.h5.gz
mv dfam38_full.0.h5.gz famdb/
gunzip dfam38_full.0.h5.gz

## Download the RepBase RepeatMasker Edition
# This seems to be a paid subscription, but we have the `RMRB.fasta` on pickett_shared, so I just secure copied that to my Isaac directory.
# In Sphinx: 
cd /pickett_shared/software/RepeatMasker/Libraries
scp *RMRB* '[email protected]:/lustre/isaac/proj/UTK0032/sniece/software/RepeatMasker/RepeatMasker/Libraries'

## Configure RepeatMasker
cd /lustre/isaac/proj/UTK0032/sniece/software/RepeatMasker/RepeatMasker
perl ./configure

Pull out all the fungi sequences from the dfam library. Per Trinity's documentation, Meg made a script to do this for White Oak on Sphinx. I'm just going to copy the entire RepeatMasker folder in pickett_shared to my own software directory in pickett_sphinx and run the command to get the fungi sequences instead of the eudicotyledons.

# In Sphinx
cd /pickett_sphinx/projects/lwy647/software
cp -r /pickett_shared/software/RepeatMasker .
cd RepeatMasker/Libraries
python3 ../famdb.py -i Dfam.h5 families --format fasta_name --include-class-in-name --ancestors --descendants 'fungi' > fungi-rm.fa
ls -lh #to make sure the file isn't empty
scp fungi-rm.fa '[email protected]:/lustre/isaac/proj/UTK0032/sniece/software/RepeatMasker/RepeatMasker/Libraries'

Merge all the repeat libraries

cd /pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/02_RepeatMasker

nano 01_merge_repeat_libs.sh

cat /pickett_sphinx/projects/lwy647/software/RepeatMasker/Libraries/fungi-rm.fa /pickett_shared/software/RepeatMasker/Libraries/RMRB.fasta ./discula-families.fa > discula_totalRepeatLib.fa

Link the genome assembly from Isaac

## In Isaac
cd /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation
scp dd_as111_100x_final_nu_final_mt_combined.fasta '[email protected]:/pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation'

Run RepeatMasker to soft mask the genome

nano 02_run_repeatmasker.qsh

/pickett_shared/software/RepeatMasker/RepeatMasker \
        -lib discula_totalRepeatLib.fa \
        -e rmblast \
        -pa 4 \
        -nolow \
        -xsmall \
        -gff \
        ../dd_as111_100x_final_nu_final_mt_combined.fasta \
        >& dd_1.0.0_RMasker.out

screen -S RepeatMasker

bash 02_run_repeatmasker.qsh

Get % of genome that was soft masked

cp /home/lwy647/scripts/calcPercentMasked_Chr-vs-Scaff.py /pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/02_RepeatMasker

nano 03_calcPercentMasked_Chr-vs-Scaff.sh

python calcPercentMasked_Chr-vs-Scaff.py ../dd_as111_100x_final_nu_final_mt_combined.fasta.masked

kept getting "divide by zero" error but let's look at the dd_as111_100x_final_nu_final_mt_combined.fasta.tbl

==================================================
file name: dd_as111_100x_final_nu_final_mt_combined.fasta
sequences:             9
total length:   46887432 bp  (46887432 bp excl N/X-runs)
GC level:         49.89 %
bases masked:    7212922 bp ( 15.38 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements         4745      5821510 bp   12.42 %
   SINEs:               15          622 bp    0.00 %
   Penelope            125         6389 bp    0.01 %
   LINEs:             1376       360136 bp    0.77 %
    CRE/SLACS           12          622 bp    0.00 %
     L2/CR1/Rex         98         6357 bp    0.01 %
     R1/LOA/Jockey     290        17415 bp    0.04 %
     R2/R4/NeSL         42         3131 bp    0.01 %
     RTE/Bov-B          58         3663 bp    0.01 %
     L1/CIN4           114         6591 bp    0.01 %
   LTR elements:      3354      5460752 bp   11.65 %
     BEL/Pao             0            0 bp    0.00 %
     Ty1/Copia        1385      2964934 bp    6.32 %
     Gypsy/DIRS1       999      2184402 bp    4.66 %
       Retroviral        0            0 bp    0.00 %

DNA transposons       2611       530880 bp    1.13 %
   hobo-Activator      488        26366 bp    0.06 %
   Tc1-IS630-Pogo      591       422869 bp    0.90 %
   En-Spm                0            0 bp    0.00 %
   MULE-MuDR           213        11588 bp    0.02 %
   PiggyBac             19          895 bp    0.00 %
   Tourist/Harbinger    96         5129 bp    0.01 %
   Other (Mirage,       19         1073 bp    0.00 %
    P-element, Transib)

Rolling-circles        204        12753 bp    0.03 %

Unclassified:         4716       778781 bp    1.66 %

Total interspersed repeats:     7131171 bp   15.21 %


Small RNA:              83        61588 bp    0.13 %

Satellites:             88         5929 bp    0.01 %
Simple repeats:         25         1895 bp    0.00 %
Low complexity:          0            0 bp    0.00 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element


RepeatMasker version 4.1.3-p1 , default mode

run with rmblastn version 2.14.0+
The query was compared to classified sequences in "discula_totalRepeatLib.fa"

03. STAR (version 2.7.11b)

Directory: /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/03_STAR

Download STAR

cd /lustre/isaac/proj/UTK0032/sniece/software
mkdir STAR
cd STAR
wget https://github.com/alexdobin/STAR/archive/2.7.11b.tar.gz
tar -xzf 2.7.11b.tar.gz

Path to STAR executable: /lustre/isaac/proj/UTK0032/sniece/software/STAR/STAR-2.7.11b/bin/Linux_x86_64_static/STAR

Link all the RNAseq reads for AS111 to the working directory and cat them all into R1 & R2

cd /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/03_STAR/AS111_RNAseq_reads
ln -s /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_RNAseq/analysis/01_read_renaming/*AS111*.fastq.gz .

Now merge all the R1 files together and do the same for the R2 files.

cat *R1.fastq.gz > AS111_merged_R1.fastq.gz
cat *R2.fastq.gz > AS111_merged_R2.fastq.gz

Now remove the symlinked fastq filed since they're no longer needed

rm *_combined_*

Index soft masked genome

First, make a directory to hold the indexed genome

cd /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/03_STAR
mkdir dd_genome_index

Next, secure copy the masked genome assembly from Sphinx to the current directory in Isaac.

# In Sphinx:
cd /pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/02_RepeatMasker
scp dd_as111_100x_final_nu_final_mt_combined.fasta.masked '[email protected]:/lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/03_STAR'

Next, index the genome using STAR

nano 01_index_genome.qsh

#!/bin/bash
#SBATCH --job-name=run_star
#SBATCH --nodes=1
#SBATCH --ntasks=40
#SBATCH --mem=100G
#SBATCH -A ACF-UTK0032
#SBATCH --partition=campus-bigmem
#SBATCH --qos=campus-bigmem
#SBATCH --time=24:00:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]

/lustre/isaac/proj/UTK0032/sniece/software/STAR/STAR-2.7.11b/bin/Linux_x86_64_static/STAR \
        --runMode genomeGenerate \
        --genomeDir dd_genome_index \
        --genomeSAindexNbases 11 \
        --genomeFastaFiles dd_as111_100x_final_nu_final_mt_combined.fasta.masked \
        --runThreadN 40

--genomeSAindexNbases: (log2(genome_length) / 2) - 1 => (log2(47,000,000) / 2) - 1 = ~11

sbatch 01_index_genome.qsh

STAR Mapping RNAseq Data

nano 02_STAR_RNAseq_mapping.qsh

#!/bin/bash
#SBATCH --job-name=run_star
#SBATCH --nodes=1
#SBATCH --ntasks=40
#SBATCH --mem=150G
#SBATCH -A ACF-UTK0032
#SBATCH --partition=campus-bigmem
#SBATCH --qos=campus-bigmem
#SBATCH --time=24:00:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]

/lustre/isaac/proj/UTK0032/sniece/software/STAR/STAR-2.7.11b/bin/Linux_x86_64_static/STAR \
        --genomeDir dd_genome_index \
        --readFilesIn ./AS111_RNAseq_reads/AS111_merged_R1.fastq.gz ./AS111_RNAseq_reads/AS111_merged_R2.fastq.gz \
        --readFilesCommand zcat \
        --outFileNamePrefix dd_as111-rna \
        --outSAMtype BAM SortedByCoordinate \
        --outSAMstrandField intronMotif \
        --limitBAMsortRAM 107374182400 \
        --runThreadN 40 \
>& star_dd_as111.out

I had to increase the maximum number of open file descriptors. STAR required a higher limit than the current setting.

## Check the current limit
ulimit -n
# 1024

## Change limit
ulimit -n 4096

sbatch 02_STAR_RNAseq_mapping.qsh

Upon successful completion, STAR will generate a file named <prefix>-rnaLog.final.out with stats about mapping.
Here is my dd_as111-rnaLog.final.out:

                                 Started job on |	Jun 05 10:42:04
                             Started mapping on |	Jun 05 10:42:04
                                    Finished on |	Jun 05 11:46:07
       Mapping speed, Million of reads per hour |	183.49

                          Number of input reads |	195877513
                      Average input read length |	302
                                    UNIQUE READS:
                   Uniquely mapped reads number |	166801805
                        Uniquely mapped reads % |	85.16%
                          Average mapped length |	297.49
                       Number of splices: Total |	77102939
            Number of splices: Annotated (sjdb) |	0
                       Number of splices: GT/AG |	72226779
                       Number of splices: GC/AG |	4543990
                       Number of splices: AT/AC |	24329
               Number of splices: Non-canonical |	307841
                      Mismatch rate per base, % |	0.33%
                         Deletion rate per base |	0.01%
                        Deletion average length |	1.54
                        Insertion rate per base |	0.00%
                       Insertion average length |	1.37
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |	2002023
             % of reads mapped to multiple loci |	1.02%
        Number of reads mapped to too many loci |	1006559
             % of reads mapped to too many loci |	0.51%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |	0
       % of reads unmapped: too many mismatches |	0.00%
            Number of reads unmapped: too short |	26036620
                 % of reads unmapped: too short |	13.29%
                Number of reads unmapped: other |	30506
                     % of reads unmapped: other |	0.02%
                                  CHIMERIC READS:
                       Number of chimeric reads |	0
                            % of chimeric reads |	0.00%

All of the above commands and outputs for STAR were placed into a directory named AS111_star_outputs at /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/03_STAR. I eventually found out in BRAKER3 that I didn't have enough RNAseq data with just the RNAseq reads from the AS111 isolate, so I decided to concatenate all R1 reads from every isolate/treatment and did the same for R2 in order to have enough RNAseq data for BRAKER3.

Re-run STAR with all RNAseq data

nano 02_STAR_RNAseq_mapping.qsh

#!/bin/bash
#SBATCH --job-name=run_star
#SBATCH --nodes=1
#SBATCH --ntasks=40
#SBATCH --mem=150G
#SBATCH -A ACF-UTK0032
#SBATCH --partition=campus-bigmem
#SBATCH --qos=campus-bigmem
#SBATCH --time=24:00:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]

/lustre/isaac/proj/UTK0032/sniece/software/STAR/STAR-2.7.11b/bin/Linux_x86_64_static/STAR \
        --genomeDir dd_genome_index \
        --readFilesIn /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/03_STAR/all_isolates_RNAseq_reads/all_isoaltes_all_treatments_merged_R1.fastq.gz /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/03_STAR/all_isolates_RNAseq_reads/all_isoaltes_all_treatments_merged_R2.fastq.gz \
        --readFilesCommand zcat \
        --outFileNamePrefix dd_as111-rna \
        --outSAMtype BAM SortedByCoordinate \
        --outSAMstrandField intronMotif \
        --limitBAMsortRAM 107374182400 \
        --runThreadN 40 \
>& star_dd_as111.out

I had to increase the maximum number of open file descriptors. STAR required a higher limit than the current setting.

## Check the current limit
ulimit -n
# 1024

## Change limit
ulimit -n 4096

sbatch 02_STAR_RNAseq_mapping.qsh

Upon successful completion, STAR will generate a file named <prefix>-rnaLog.final.out with stats about mapping.
Here is my dd_as111-rnaLog.final.out:

                                 Started job on |	Jun 06 17:41:53
                             Started mapping on |	Jun 06 17:41:54
                                    Finished on |	Jun 07 10:02:50
       Mapping speed, Million of reads per hour |	36.67

                          Number of input reads |	599453980
                      Average input read length |	302
                                    UNIQUE READS:
                   Uniquely mapped reads number |	311262036
                        Uniquely mapped reads % |	51.92%
                          Average mapped length |	296.45
                       Number of splices: Total |	132226573
            Number of splices: Annotated (sjdb) |	0
                       Number of splices: GT/AG |	123994077
                       Number of splices: GC/AG |	7580649
                       Number of splices: AT/AC |	62631
               Number of splices: Non-canonical |	589216
                      Mismatch rate per base, % |	0.37%
                         Deletion rate per base |	0.01%
                        Deletion average length |	1.81
                        Insertion rate per base |	0.01%
                       Insertion average length |	1.68
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |	22043572
             % of reads mapped to multiple loci |	3.68%
        Number of reads mapped to too many loci |	18994290
             % of reads mapped to too many loci |	3.17%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |	0
       % of reads unmapped: too many mismatches |	0.00%
            Number of reads unmapped: too short |	246789245
                 % of reads unmapped: too short |	41.17%
                Number of reads unmapped: other |	364837
                     % of reads unmapped: other |	0.06%
                                  CHIMERIC READS:
                       Number of chimeric reads |	0
                            % of chimeric reads |	0.00%

04. BRAKER3 (v3.0.8)

Isaac Directory: /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/04_BRAKER

BRAKER3 will be performing gene prediction or structural annotation with the masked genome and bam file that was generated from STAR.

Input files for BRAKER3:

cp /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/03_STAR/dd_as111-rnaAligned.sortedByCoord.out.bam .
cp /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/03_STAR/dd_as111_100x_final_nu_final_mt_combined.fasta.masked .

I renamed the mitochondrial scaffold in dd_as111_100x_final_nu_final_mt_combined.fasta.masked to just say "scaffold_mt" because I eventually came to find out that GeneMark-ETP will throw an error that it can't find "scaffold_mt" using the old mitochondrial scaffold naming convention (initially just let as what OATK listed it as).

Now moving over to Sphinx:

Sphinx Directory: /pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/04_BRAKER

chmod 775 dd_as111-rnaAligned.sortedByCoord.out.bam
chmod 775 dd_as111_100x_final_nu_final_mt_combined.fasta.masked

Download the orthoDB protein database for fungi.

wget https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/Fungi.fa.gz

#gunzip
gunzip Fungi.fa.gz

Set the path for BRAKER and AUGUSTUS config files

export BRAKER_SIF=/sphinx_local/images/braker3_latest.sif
export AUGUSTUS_CONFIG_PATH=~/miniconda3/envs/busco/config
echo $AUGUSTUS_CONFIG_PATH

Set path for AUGUSTUS config file in singularity interactive shell

singularity shell -B $PWD $BRAKER_SIF
export AUGUSTUS_CONFIG_PATH=~/miniconda3/envs/busco/config
echo $AUGUSTUS_CONFIG_PATH

#Exit the interactive shell
Ctrl + D

Run BRAKER3

nano run_braker.sh

mkdir braker_outputs

singularity exec -B $PWD /sphinx_local/images/braker3_latest.sif braker.pl --genome=dd_as111_100x_final_nu_final_mt_combined.fasta.masked \
        --bam=dd_as111-rnaAligned.sortedByCoord.out.bam \
        --prot_seq=Fungi.fa \
        --workingdir=braker_outputs \
        --threads 20 \
        --fungus \
        --useexisting \
        --gff3 \
        --AUGUSTUS_CONFIG_PATH $AUGUSTUS_CONFIG_PATH \
        --species=Discula_destructiva

cpulimit -l 2000 -i bash run_braker.sh

I'm using cpulimit because BRAKER uses way more CPU than you specify. I installed cpulimit on pickett_shared, so you can copy the executable command to your personal bin file by doing: cp /pickett_shared/software/cpulimit/src/cpulimit /home/user/bin
BRAKER3 (for me with a 44Gb .bam file and 46Mb soft masked genome) took ~3 hrs to complete.
The main files will be braker.gff3, braker.aa, and braker.codingseq.

Rename the output files to be slightly more descriptive:

mv braker.aa dd_aa-proteins.fasta
mv braker.codingseq dd_genes.fasta
mv braker.gff3 dd.gff3

Check the stats on gff3 file

cat dd.gff3 | awk '{a[$3]++}END{for(k in a){print k,a[k]}}'

mRNA 11480
exon 29610
CDS 29610
intron 18130
gene 10373
start_codon 11473
stop_codon 11473

Run BUSCO on the protein fasta file with the ascomycota and sordariomycetes databases.

The protein file is located here:

/pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/04_BRAKER/braker_outputs/dd_aa-proteins.fasta

Run BUSCO on the protein fasta file for the ascomycota database:

cd /pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/04_BRAKER/braker_outputs
singularity exec -B $PWD /sphinx_local/images/ezlabgva-busco-v5.6.1_cv1.img busco -i dd_aa-proteins.fasta -m proteins -l ascomycota -c 10 -o busco_ascomycota_results

|Results from dataset ascomycota_odb10             |
---------------------------------------------------
|C:97.4%[S:89.4%,D:8.0%],F:0.4%,M:2.2%,n:1706      |
|1663    Complete BUSCOs (C)                       |
|1526    Complete and single-copy BUSCOs (S)       |
|137    Complete and duplicated BUSCOs (D)         |
|7    Fragmented BUSCOs (F)                        |
|36    Missing BUSCOs (M)                          |
|1706    Total BUSCO groups searched               |
---------------------------------------------------

Run BUSCO on the protein fasta file for the sordariomycetes database:

cd /pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/04_BRAKER/braker_outputs
singularity exec -B $PWD /sphinx_local/images/ezlabgva-busco-v5.6.1_cv1.img busco -i dd_aa-proteins.fasta -m proteins -l sordariomycetes -c 10 -o busco_sordariomycetes_results

---------------------------------------------------
|Results from dataset sordariomycetes_odb10        |
---------------------------------------------------
|C:95.5%[S:85.6%,D:9.9%],F:0.4%,M:4.1%,n:3817      |
|3647    Complete BUSCOs (C)                       |
|3268    Complete and single-copy BUSCOs (S)       |
|379    Complete and duplicated BUSCOs (D)         |
|16    Fragmented BUSCOs (F)                       |
|154    Missing BUSCOs (M)                         |
|3817    Total BUSCO groups searched               |
---------------------------------------------------

05. EnTAP (v1.0.0)

Documentation: https://github.com/harta55/EnTAP
Sphinx Directory: /pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/05_EnTAP
EnTAP will perform functional annotation of the protein fasta we made using BRAKER3

Copy the protein fasta file to the working directory

cp ../04_BRAKER/braker_outputs/dd_aa-proteins.fasta .

Load required dependencies

spack load [email protected]%[email protected]
spack load rsem
spack load interproscan
spack load transdecoder

Run EnTAP

nano run_entap.sh

/sphinx_local/software/EnTAP-1.0.0/bin/EnTAP \
--runP \
-i dd_aa-proteins.fasta \
--ini /sphinx_local/software/EnTAP-1.0.0/entap_config_Oct2023.ini \
-d /sphinx_local/software/EnTAP-1.0.0/bin/uniprot_sprot.dmnd \
-t 10

bash run_entap.sh

EnTAP will produce a number of final files in final_results

annotated_without_contam.faa: FASTA-formatted amino acid / protein file
annotated_without_contam_gene_ontology_terms.tsv: Tab-deliminated file that can be used for Gene Enrichment. Columns are as follows: Sequence ID, Gene Ontology Term ID, Gene Ontology Term, Gene Ontology Category, and Effective Length.
annotated_without_contam.tsv

Calculate the number of genes that were annotated using EnTAP

grep '>' annotated_without_contam.faa | wc -l

Results: 10505
Genes originally annotated using BRAKER3: 10373

Calculate percentage of genes retained from structural annotation (BRAKER3) to functional annotation (EnTAP):

(BRAKER3 gene count) / (EnTAP gene count) *100 => (10373 / 10505) *100 = 98.74%

Some more important stats:

nano log_file_2024Y6M7D-16h38m4s.txt

------------------------------------------------------
Transcriptome Statistics
------------------------------------------------------
Protein sequences found
Total sequences: 11480
Total length of transcriptome(bp): 17202825
Average sequence length(bp): 1498.00
n50: 1770
n90: 852
Longest sequence(bp): 14823 (g1868.t1)
Shortest sequence(bp): 42 (g2578.t1)

------------------------------------------------------
Final Annotation Statistics
------------------------------------------------------
Total Input Sequences: 11480
Similarity Search
	Total unique sequences with an alignment: 6836 (59.55% of total input sequences)
		Total alignments flagged as a contaminant: 0 (0.00% of total unique alignments)
		Total alignments NOT flagged as a contaminant: 6836 (100.00% of total unique alignments)
	Total unique sequences without an alignment: 4644 (40.45% of total input sequences)
Gene Families
	Total unique sequences with family assignment: 10504 (91.50% of total input sequences)
	Total unique sequences without family assignment: 976 (8.50% of total input sequences)
	Total unique sequences with at least one GO term: 8330 (72.56% of total input sequences)
	Total unique sequences with at least one pathway (KEGG) assignment: 2725 (23.74% of total input sequences)
Totals
	Total retained sequences (after filtering and/or frame selection): 11480
	Total unique sequences annotated (similarity search alignments only): 1 (0.01% of total retained)
	Total unique sequences annotated (gene family assignment only): 3669 (31.96% of total retained)
	Total unique sequences annotated (gene family and/or similarity search): 10505 (91.51% of total retained)
		Total alignments flagged as a contaminant (gene family and/or similarity search): 0 (0.00% of total unique alignments)
		Total alignments NOT flagged as a contaminant (gene family and/or similarity search): 6836 (100.00% of total unique alignments)
	Total unique sequences unannotated (gene family and/or similarity search): 975 (8.49% of total retained)

Some other important files/paths:

final report from EnTAP: /pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/05_EnTAP/entap_outfiles/final_results/entap_results.tsv
final transcriptome assembly: /pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/05_EnTAP/entap_outfiles/transcriptomes/dd_aa-proteins_final.fasta
annotated sequences tsv file that were not flagged as a contaminant: /pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/05_EnTAP/entap_outfiles/final_results/annotated_without_contam.tsv
annotated sequences fasta file that were not flagged as a contaminant: /pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/05_EnTAP/entap_outfiles/final_results/annotated_without_contam.faa
annotated sequences tsv file that were not flagged as a contaminantant and can be used for Gene Enrichment: /pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/05_EnTAP/entap_outfiles/final_results/annotated_without_contam_gene_ontology_terms.tsv
various gene ontology figures produced from EggNOG: /pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/05_EnTAP/entap_outfiles/ontology/EggNOG_DMND/figures

06. GO/KEGG Terms

Working directory: /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/06_GO_terms
I wanted to create a spreadsheet for all of the GO terms and KEGG terms that were annotated from EnTAP.

Copy over the annotated_without_contam_gene_ontology_terms.tsv & entap_results.tsv files.

cp /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/05_EnTAP/entap_outfiles/final_results/annotated_without_contam_gene_ontology_terms.tsv .
cp /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/05_EnTAP/entap_outfiles/final_results/entap_results.tsv .

Create spreadsheets. I used awk and chatgpt to help me pull out all of the gene id's and GO/KEGG terms.

#GO file creation
awk 'BEGIN { FS = OFS = "\t" } { for (i = 1; i <= NF; i++) { printf "%s%s", $i, (i % 4 ? OFS : ORS) } }' annotated_without_contam_gene_ontology_terms.tsv > dd-GO.tsv

#KEGG file creation
awk -F'\t' '{print $1,\t,$32}' entap_results.tsv > dd-KEGG.tsv

this produced 2 files:

dd-GO.tsv
dd-KEGG.tsv

07. SignalP (v5)

Working directory: /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/07_signalp_secretome
SignalP Documentation: https://services.healthtech.dtu.dk/services/SignalP-5.0/

Install SignalP.

In order to install, you will have to email a download request on their website. I just installed in my software directory: /lustre/isaac/proj/UTK0032/sniece/software/signalp-5.0b

Run SignalP.

#!/bin/bash
#SBATCH --job-name=run_signalp
#SBATCH --nodes=1
#SBATCH --ntasks=30
#SBATCH --mem=100G
#SBATCH -A ACF-UTK0032
#SBATCH --partition=short
#SBATCH --qos=short
#SBATCH --time=03:00:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]

#Link executable here because pathing directly to it doesn't work...
ln -s /lustre/isaac/proj/UTK0032/sniece/software/signalp-5.0b/bin/signalp .

# Run signalp
./signalp \
	-fasta dd_aa-proteins.fasta \
        -format long \
        -gff3 \
        -mature \
        -org euk \
        -plot png \
        -prefix dd_signalp


#  -batch int
#       Number of sequences that the tool will run simultaneously. Decrease or increase size depending on your system memory. (default 10000)
#  -fasta string
#       Input file in fasta format.
#  -format string
#       Output format. 'long' for generating the predictions with plots, 'short' for the predictions without plots. (default "short")
#  -gff3
#       Make gff3 file of processed sequences.
#  -mature
#       Make fasta file with mature sequence. (this will contain only those proteins that are part of the secretome)
#  -org string
#       Organism. Archaea: 'arch', Gram-positive: 'gram+', Gram-negative: 'gram-' or Eukarya: 'euk' (default "euk")
#  -plot string
#       Plots output format. When long output selected, choose between 'png', 'eps' or 'none' to get just a tabular file. (default "png")
#  -prefix string
#       Output files prefix. (default "Input file prefix")
#  -stdout
#       Write the prediction summary to the STDOUT.
#  -tmp string
#       Specify temporary file directory. (default "System default tmpdir")
#  -verbose
#       Verbose output. Specify '-verbose=false' to avoid printing. (default true)
#  -version
#       Prints version.

This will produce a few files with your predicted secretome and a spreadsheet of predicted signal peptides.

08. CAZyme Detection using dbCAN (version 3.0)

Working Directory: /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/08_CAZyme_dbCAN
dbCAN documentation: https://dbcan.readthedocs.io/en/latest/installation.html
I ended up running dbCAN3 via the web server because I was having issues running it with conda. You can also annotate CAZymes using your proteome here: https://bcb.unl.edu/dbCAN2/blast.php

Run dcCAN3 on the web server.

You will get an email with the results. Mine are saved as dbCAN3_output_CAZyme_prediction.xlsx.
They will give you CAZyme subfamily predictions along with how many lines of evidence each of those predicted CAZymes has. Per the author recommendations, predicted CAZymes with at least 2 or 3 tools should considered:

 Gene ID  |    EC#   |   HMMER   | dbCAN_sub | DIAMOND | Signalp | #ofTools
--------- | -------- | ------------- | ------- | ----- | ------- | --
g10011.t1 | 3.2.1.78 | GH5_7(76-362) | GH5_e89 | GH5_7 | Y(1-24) | 3

09. EffectorP (version 3.0)

Working Directory: /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/09_effectorP/EffectorP-3.0

Install EffectorP

mkdir 09_effectorP
cd 09_effectorP

git clone https://github.com/JanaSperschneider/EffectorP-3.0.git
cd EffectorP-3.0
unzip weka-3-8-4.zip

Secure copy your protein fasta file Sphinx from genome annotation for EffectorP to use as input

scp '[email protected]:/pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/04_BRAKER/braker_outputs/dd_aa-proteins.fasta' .

Run EffectorP (I think there were some miscommunication between SLURM and finding the java file for one of the dependencies, so I just bash ran the script since it isn't computationally intensive and takes very little time (finished in 2 minutes).

python EffectorP.py \
        -f \
        -E dd_effectors \
        -o dd_effectors_output \
        -i dd_aa-proteins.fasta

-f: run in "fungal" mode (EffectorP is also used for oomycetes, so that's why I ran in fungal mode)
-E: name of new fasta file that contains only effectors
-o: output file name
-i: input protein fasta to be used

Output:

EffectorP results were saved to output file: dd_effectors_output
-----------------
11480 proteins were provided as input in the following file: dd_aa-proteins.fasta
-----------------
Number of predicted effectors: 2763
Number of predicted cytoplasmic effectors: 2392
Number of predicted apoplastic effectors: 371
-----------------
24.1 percent are predicted effectors.
20.8 percent are predicted cytoplasmic effectors.
3.2 percent are predicted apoplastic effectors.
-----------------
NOTE: EffectorP was run in fungal mode.
-----------------

dd_effectors_output is the tsv file that contains all the information for each individual protein.

Discula destructiva Genome Annotation - ShadeNiece/DisculaDestructiva_GenomeAssembly-Annotation GitHub Wiki

Background

Pipeline

⚠️ GitHub.com Fallback ⚠️

Discula destructiva Genome Annotation - ShadeNiece/DisculaDestructiva_GenomeAssembly-Annotation GitHub Wiki

Background

Pipeline

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️