Discula destructiva Genome Annotation - ShadeNiece/DisculaDestructiva_GenomeAssembly-Annotation GitHub Wiki
- Annotation will be ran using the genome assembly that has the mitochondrial contig added to the end of the nuclear assembly:
/lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_GenomeAssembly/analysis/13_final_assembly_stats_postgapclosing/dd_as111_100x_final_nu_final_mt_combined.fasta
01. RepeatModeler (version 2.0.5)
Directory: /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/01_repeatmodeler
Documentation: https://github.com/Dfam-consortium/RepeatModeler
- Conda install RepeatModeler
conda create -n repeatmodeler -c conda-forge -c bioconda repeatmodeler
conda activate repeatmodeler
- Link the assembly fasta
ln -s /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_GenomeAssembly/analysis/13_final_assembly_stats_postgapclosing/dd_as111_100x_final_nu_final_mt_combined.fasta .
- Create a database for RepeatModeler
nano 01_make_database.qsh
#!/bin/bash
#SBATCH --job-name=rm_make_database
#SBATCH --nodes=1
#SBATCH --ntasks=30
#SBATCH --mem=100G
#SBATCH -A ACF-UTK0032
#SBATCH --partition=campus
#SBATCH --qos=campus
#SBATCH --time=5:00:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]
eval "$(conda shell.bash hook)"
conda activate repeatmodeler
BuildDatabase \
-name discula \
../dd_as111_100x_final_nu_final_mt_combined.fasta
-
-name
: whatever you want to name the database that you're creating
sbatch 01_make_database.qsh
- Run RepeatModeler
nano 02_run_repeatmodeler.qsh
#!/bin/bash
#SBATCH --job-name=rm_make_database
#SBATCH --nodes=1
#SBATCH --ntasks=50
#SBATCH --mem=100G
#SBATCH -A ACF-UTK0032
#SBATCH --partition=campus-bigmem
#SBATCH --qos=campus-bigmem
#SBATCH --time=24:00:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]
eval "$(conda shell.bash hook)"
conda activate repeatmodeler
RepeatModeler \
-database discula \
-threads 50 \
-LTRStruct \
>& run.out
sbatch 02_run_repeatmodeler.qsh
-
LTRStruct
: the LTR structural discovery pipeline ( LTR_Harvest and LTR_retreiver ) gets combined with results from the RepeatScout/RECON pipeline. (this was in the RepeatModeler documentation, so I'm trying this) -
families.fa
file: Consensus sequences for each family identified. -
families.stk
file: Seed alignments for each family identified. -
rmod.log
: Execution log. Useful for reproducing results.
- Get stats
nano 03_stats.qsh
# Step 1: Extract sequence names from the FASTA file
grep "^>" discula-families.fa | awk '{print substr($0, 2)}' > repeat_names.txt
# Step 2: Extract repeat IDs and count occurrences
awk 'BEGIN {FS = "#"} {print $2}' repeat_names.txt | awk '{a[$1]++} END {for(k in a) {print k, a[k]}}' > repeat_counts.txt
bash 03_stats.qsh
02. RepeatMasker (version 4.1.3)
Working Directory: /pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/02_RepeatMasker
I tried to install RepeatMasker on Isaac, and I eventually gave up due to how long it was taking to troubleshoot. Also, conda install is not recommended by developers (things get left out of the install), and there are a bunch of dependencies so I just used Sphinx. The below is everything I initially tried on Isaac before moving to Sphinx-just keeping here in case it's needed in the future.
## Download and Unpack RepeatMasker
cd /lustre/isaac/proj/UTK0032/sniece/software
mkdir RepeatMasker
cd RepeatMasker
wget https://www.repeatmasker.org/RepeatMasker/RepeatMasker-4.1.6.tar.gz
tar -xzvf RepeatMasker-4.1.6.tar.gz
## Download and Install TRF
# do the following within the RepeatMasker directory
mkdir trf
cd trf
wget https://github.com/Benson-Genomics-Lab/TRF/archive/refs/tags/v4.09.1.tar.gz
tar -xzvf v4.09.1.tar.gz
## Download RMBlast search engine
cd /lustre/isaac/proj/UTK0032/sniece/software/RepeatMasker/RepeatMasker
wget https://www.repeatmasker.org/rmblast/rmblast-2.14.1+-x64-linux.tar.gz
tar -xzvf rmblast-2.14.1+-x64-linux.tar.gz
## Download the complete DFAM database
# https://www.dfam.org/releases/Dfam_3.8/families/FamDB/README.txt has more info on which specific family db you should download based on your organism. I work with a fungus, so I'm downloading partition 0.
cd /lustre/isaac/proj/UTK0032/sniece/software/RepeatMasker/RepeatMasker/Libraries
wget https://www.dfam.org/releases/Dfam_3.8/families/FamDB/dfam38_full.0.h5.gz
mv dfam38_full.0.h5.gz famdb/
gunzip dfam38_full.0.h5.gz
## Download the RepBase RepeatMasker Edition
# This seems to be a paid subscription, but we have the `RMRB.fasta` on pickett_shared, so I just secure copied that to my Isaac directory.
# In Sphinx:
cd /pickett_shared/software/RepeatMasker/Libraries
scp *RMRB* '[email protected]:/lustre/isaac/proj/UTK0032/sniece/software/RepeatMasker/RepeatMasker/Libraries'
## Configure RepeatMasker
cd /lustre/isaac/proj/UTK0032/sniece/software/RepeatMasker/RepeatMasker
perl ./configure
- Pull out all the fungi sequences from the dfam library. Per Trinity's documentation, Meg made a script to do this for White Oak on Sphinx. I'm just going to copy the entire RepeatMasker folder in pickett_shared to my own software directory in pickett_sphinx and run the command to get the fungi sequences instead of the eudicotyledons.
# In Sphinx
cd /pickett_sphinx/projects/lwy647/software
cp -r /pickett_shared/software/RepeatMasker .
cd RepeatMasker/Libraries
python3 ../famdb.py -i Dfam.h5 families --format fasta_name --include-class-in-name --ancestors --descendants 'fungi' > fungi-rm.fa
ls -lh #to make sure the file isn't empty
scp fungi-rm.fa '[email protected]:/lustre/isaac/proj/UTK0032/sniece/software/RepeatMasker/RepeatMasker/Libraries'
- Merge all the repeat libraries
cd /pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/02_RepeatMasker
nano 01_merge_repeat_libs.sh
cat /pickett_sphinx/projects/lwy647/software/RepeatMasker/Libraries/fungi-rm.fa /pickett_shared/software/RepeatMasker/Libraries/RMRB.fasta ./discula-families.fa > discula_totalRepeatLib.fa
- Link the genome assembly from Isaac
## In Isaac
cd /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation
scp dd_as111_100x_final_nu_final_mt_combined.fasta '[email protected]:/pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation'
- Run RepeatMasker to soft mask the genome
nano 02_run_repeatmasker.qsh
/pickett_shared/software/RepeatMasker/RepeatMasker \
-lib discula_totalRepeatLib.fa \
-e rmblast \
-pa 4 \
-nolow \
-xsmall \
-gff \
../dd_as111_100x_final_nu_final_mt_combined.fasta \
>& dd_1.0.0_RMasker.out
screen -S RepeatMasker
bash 02_run_repeatmasker.qsh
- Get % of genome that was soft masked
cp /home/lwy647/scripts/calcPercentMasked_Chr-vs-Scaff.py /pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/02_RepeatMasker
nano 03_calcPercentMasked_Chr-vs-Scaff.sh
python calcPercentMasked_Chr-vs-Scaff.py ../dd_as111_100x_final_nu_final_mt_combined.fasta.masked
kept getting "divide by zero" error but let's look at the dd_as111_100x_final_nu_final_mt_combined.fasta.tbl
==================================================
file name: dd_as111_100x_final_nu_final_mt_combined.fasta
sequences: 9
total length: 46887432 bp (46887432 bp excl N/X-runs)
GC level: 49.89 %
bases masked: 7212922 bp ( 15.38 %)
==================================================
number of length percentage
elements* occupied of sequence
--------------------------------------------------
Retroelements 4745 5821510 bp 12.42 %
SINEs: 15 622 bp 0.00 %
Penelope 125 6389 bp 0.01 %
LINEs: 1376 360136 bp 0.77 %
CRE/SLACS 12 622 bp 0.00 %
L2/CR1/Rex 98 6357 bp 0.01 %
R1/LOA/Jockey 290 17415 bp 0.04 %
R2/R4/NeSL 42 3131 bp 0.01 %
RTE/Bov-B 58 3663 bp 0.01 %
L1/CIN4 114 6591 bp 0.01 %
LTR elements: 3354 5460752 bp 11.65 %
BEL/Pao 0 0 bp 0.00 %
Ty1/Copia 1385 2964934 bp 6.32 %
Gypsy/DIRS1 999 2184402 bp 4.66 %
Retroviral 0 0 bp 0.00 %
DNA transposons 2611 530880 bp 1.13 %
hobo-Activator 488 26366 bp 0.06 %
Tc1-IS630-Pogo 591 422869 bp 0.90 %
En-Spm 0 0 bp 0.00 %
MULE-MuDR 213 11588 bp 0.02 %
PiggyBac 19 895 bp 0.00 %
Tourist/Harbinger 96 5129 bp 0.01 %
Other (Mirage, 19 1073 bp 0.00 %
P-element, Transib)
Rolling-circles 204 12753 bp 0.03 %
Unclassified: 4716 778781 bp 1.66 %
Total interspersed repeats: 7131171 bp 15.21 %
Small RNA: 83 61588 bp 0.13 %
Satellites: 88 5929 bp 0.01 %
Simple repeats: 25 1895 bp 0.00 %
Low complexity: 0 0 bp 0.00 %
==================================================
* most repeats fragmented by insertions or deletions
have been counted as one element
RepeatMasker version 4.1.3-p1 , default mode
run with rmblastn version 2.14.0+
The query was compared to classified sequences in "discula_totalRepeatLib.fa"
03. STAR (version 2.7.11b)
Directory: /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/03_STAR
- Download STAR
cd /lustre/isaac/proj/UTK0032/sniece/software
mkdir STAR
cd STAR
wget https://github.com/alexdobin/STAR/archive/2.7.11b.tar.gz
tar -xzf 2.7.11b.tar.gz
- Path to STAR executable:
/lustre/isaac/proj/UTK0032/sniece/software/STAR/STAR-2.7.11b/bin/Linux_x86_64_static/STAR
- Link all the RNAseq reads for AS111 to the working directory and cat them all into R1 & R2
cd /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/03_STAR/AS111_RNAseq_reads
ln -s /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_RNAseq/analysis/01_read_renaming/*AS111*.fastq.gz .
Now merge all the R1 files together and do the same for the R2 files.
cat *R1.fastq.gz > AS111_merged_R1.fastq.gz
cat *R2.fastq.gz > AS111_merged_R2.fastq.gz
Now remove the symlinked fastq filed since they're no longer needed
rm *_combined_*
- Index soft masked genome
First, make a directory to hold the indexed genome
cd /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/03_STAR
mkdir dd_genome_index
Next, secure copy the masked genome assembly from Sphinx to the current directory in Isaac.
# In Sphinx:
cd /pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/02_RepeatMasker
scp dd_as111_100x_final_nu_final_mt_combined.fasta.masked '[email protected]:/lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/03_STAR'
Next, index the genome using STAR
nano 01_index_genome.qsh
#!/bin/bash
#SBATCH --job-name=run_star
#SBATCH --nodes=1
#SBATCH --ntasks=40
#SBATCH --mem=100G
#SBATCH -A ACF-UTK0032
#SBATCH --partition=campus-bigmem
#SBATCH --qos=campus-bigmem
#SBATCH --time=24:00:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]
/lustre/isaac/proj/UTK0032/sniece/software/STAR/STAR-2.7.11b/bin/Linux_x86_64_static/STAR \
--runMode genomeGenerate \
--genomeDir dd_genome_index \
--genomeSAindexNbases 11 \
--genomeFastaFiles dd_as111_100x_final_nu_final_mt_combined.fasta.masked \
--runThreadN 40
-
--genomeSAindexNbases
: (log2(genome_length) / 2) - 1 => (log2(47,000,000) / 2) - 1 = ~11
sbatch 01_index_genome.qsh
- STAR Mapping RNAseq Data
nano 02_STAR_RNAseq_mapping.qsh
#!/bin/bash
#SBATCH --job-name=run_star
#SBATCH --nodes=1
#SBATCH --ntasks=40
#SBATCH --mem=150G
#SBATCH -A ACF-UTK0032
#SBATCH --partition=campus-bigmem
#SBATCH --qos=campus-bigmem
#SBATCH --time=24:00:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]
/lustre/isaac/proj/UTK0032/sniece/software/STAR/STAR-2.7.11b/bin/Linux_x86_64_static/STAR \
--genomeDir dd_genome_index \
--readFilesIn ./AS111_RNAseq_reads/AS111_merged_R1.fastq.gz ./AS111_RNAseq_reads/AS111_merged_R2.fastq.gz \
--readFilesCommand zcat \
--outFileNamePrefix dd_as111-rna \
--outSAMtype BAM SortedByCoordinate \
--outSAMstrandField intronMotif \
--limitBAMsortRAM 107374182400 \
--runThreadN 40 \
>& star_dd_as111.out
I had to increase the maximum number of open file descriptors. STAR required a higher limit than the current setting.
## Check the current limit
ulimit -n
# 1024
## Change limit
ulimit -n 4096
sbatch 02_STAR_RNAseq_mapping.qsh
- Upon successful completion, STAR will generate a file named
<prefix>-rnaLog.final.out
with stats about mapping. - Here is my
dd_as111-rnaLog.final.out
:
Started job on | Jun 05 10:42:04
Started mapping on | Jun 05 10:42:04
Finished on | Jun 05 11:46:07
Mapping speed, Million of reads per hour | 183.49
Number of input reads | 195877513
Average input read length | 302
UNIQUE READS:
Uniquely mapped reads number | 166801805
Uniquely mapped reads % | 85.16%
Average mapped length | 297.49
Number of splices: Total | 77102939
Number of splices: Annotated (sjdb) | 0
Number of splices: GT/AG | 72226779
Number of splices: GC/AG | 4543990
Number of splices: AT/AC | 24329
Number of splices: Non-canonical | 307841
Mismatch rate per base, % | 0.33%
Deletion rate per base | 0.01%
Deletion average length | 1.54
Insertion rate per base | 0.00%
Insertion average length | 1.37
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 2002023
% of reads mapped to multiple loci | 1.02%
Number of reads mapped to too many loci | 1006559
% of reads mapped to too many loci | 0.51%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 26036620
% of reads unmapped: too short | 13.29%
Number of reads unmapped: other | 30506
% of reads unmapped: other | 0.02%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
All of the above commands and outputs for STAR were placed into a directory named AS111_star_outputs
at /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/03_STAR
. I eventually found out in BRAKER3 that I didn't have enough RNAseq data with just the RNAseq reads from the AS111 isolate, so I decided to concatenate all R1 reads from every isolate/treatment and did the same for R2 in order to have enough RNAseq data for BRAKER3.
- Re-run STAR with all RNAseq data
nano 02_STAR_RNAseq_mapping.qsh
#!/bin/bash
#SBATCH --job-name=run_star
#SBATCH --nodes=1
#SBATCH --ntasks=40
#SBATCH --mem=150G
#SBATCH -A ACF-UTK0032
#SBATCH --partition=campus-bigmem
#SBATCH --qos=campus-bigmem
#SBATCH --time=24:00:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]
/lustre/isaac/proj/UTK0032/sniece/software/STAR/STAR-2.7.11b/bin/Linux_x86_64_static/STAR \
--genomeDir dd_genome_index \
--readFilesIn /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/03_STAR/all_isolates_RNAseq_reads/all_isoaltes_all_treatments_merged_R1.fastq.gz /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/03_STAR/all_isolates_RNAseq_reads/all_isoaltes_all_treatments_merged_R2.fastq.gz \
--readFilesCommand zcat \
--outFileNamePrefix dd_as111-rna \
--outSAMtype BAM SortedByCoordinate \
--outSAMstrandField intronMotif \
--limitBAMsortRAM 107374182400 \
--runThreadN 40 \
>& star_dd_as111.out
I had to increase the maximum number of open file descriptors. STAR required a higher limit than the current setting.
## Check the current limit
ulimit -n
# 1024
## Change limit
ulimit -n 4096
sbatch 02_STAR_RNAseq_mapping.qsh
- Upon successful completion, STAR will generate a file named
<prefix>-rnaLog.final.out
with stats about mapping. - Here is my
dd_as111-rnaLog.final.out
:
Started job on | Jun 06 17:41:53
Started mapping on | Jun 06 17:41:54
Finished on | Jun 07 10:02:50
Mapping speed, Million of reads per hour | 36.67
Number of input reads | 599453980
Average input read length | 302
UNIQUE READS:
Uniquely mapped reads number | 311262036
Uniquely mapped reads % | 51.92%
Average mapped length | 296.45
Number of splices: Total | 132226573
Number of splices: Annotated (sjdb) | 0
Number of splices: GT/AG | 123994077
Number of splices: GC/AG | 7580649
Number of splices: AT/AC | 62631
Number of splices: Non-canonical | 589216
Mismatch rate per base, % | 0.37%
Deletion rate per base | 0.01%
Deletion average length | 1.81
Insertion rate per base | 0.01%
Insertion average length | 1.68
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 22043572
% of reads mapped to multiple loci | 3.68%
Number of reads mapped to too many loci | 18994290
% of reads mapped to too many loci | 3.17%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 246789245
% of reads unmapped: too short | 41.17%
Number of reads unmapped: other | 364837
% of reads unmapped: other | 0.06%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
04. BRAKER3 (v3.0.8)
Isaac Directory: /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/04_BRAKER
- BRAKER3 will be performing gene prediction or structural annotation with the masked genome and bam file that was generated from STAR.
Input files for BRAKER3:
cp /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/03_STAR/dd_as111-rnaAligned.sortedByCoord.out.bam .
cp /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/03_STAR/dd_as111_100x_final_nu_final_mt_combined.fasta.masked .
- I renamed the mitochondrial scaffold in
dd_as111_100x_final_nu_final_mt_combined.fasta.masked
to just say "scaffold_mt" because I eventually came to find out that GeneMark-ETP will throw an error that it can't find "scaffold_mt" using the old mitochondrial scaffold naming convention (initially just let as what OATK listed it as).
Now moving over to Sphinx:
Sphinx Directory: /pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/04_BRAKER
chmod 775 dd_as111-rnaAligned.sortedByCoord.out.bam
chmod 775 dd_as111_100x_final_nu_final_mt_combined.fasta.masked
- Download the orthoDB protein database for fungi.
wget https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/Fungi.fa.gz
#gunzip
gunzip Fungi.fa.gz
- Set the path for BRAKER and AUGUSTUS config files
export BRAKER_SIF=/sphinx_local/images/braker3_latest.sif
export AUGUSTUS_CONFIG_PATH=~/miniconda3/envs/busco/config
echo $AUGUSTUS_CONFIG_PATH
- Set path for AUGUSTUS config file in singularity interactive shell
singularity shell -B $PWD $BRAKER_SIF
export AUGUSTUS_CONFIG_PATH=~/miniconda3/envs/busco/config
echo $AUGUSTUS_CONFIG_PATH
#Exit the interactive shell
Ctrl + D
- Run BRAKER3
nano run_braker.sh
mkdir braker_outputs
singularity exec -B $PWD /sphinx_local/images/braker3_latest.sif braker.pl --genome=dd_as111_100x_final_nu_final_mt_combined.fasta.masked \
--bam=dd_as111-rnaAligned.sortedByCoord.out.bam \
--prot_seq=Fungi.fa \
--workingdir=braker_outputs \
--threads 20 \
--fungus \
--useexisting \
--gff3 \
--AUGUSTUS_CONFIG_PATH $AUGUSTUS_CONFIG_PATH \
--species=Discula_destructiva
cpulimit -l 2000 -i bash run_braker.sh
- I'm using
cpulimit
because BRAKER uses way more CPU than you specify. I installed cpulimit on pickett_shared, so you can copy the executable command to your personal bin file by doing:cp /pickett_shared/software/cpulimit/src/cpulimit /home/user/bin
- BRAKER3 (for me with a 44Gb .bam file and 46Mb soft masked genome) took ~3 hrs to complete.
- The main files will be
braker.gff3
,braker.aa
, andbraker.codingseq
.
Rename the output files to be slightly more descriptive:
mv braker.aa dd_aa-proteins.fasta
mv braker.codingseq dd_genes.fasta
mv braker.gff3 dd.gff3
- Check the stats on gff3 file
cat dd.gff3 | awk '{a[$3]++}END{for(k in a){print k,a[k]}}'
mRNA 11480
exon 29610
CDS 29610
intron 18130
gene 10373
start_codon 11473
stop_codon 11473
- Run BUSCO on the protein fasta file with the ascomycota and sordariomycetes databases.
The protein file is located here:
/pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/04_BRAKER/braker_outputs/dd_aa-proteins.fasta
Run BUSCO on the protein fasta file for the ascomycota database:
cd /pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/04_BRAKER/braker_outputs
singularity exec -B $PWD /sphinx_local/images/ezlabgva-busco-v5.6.1_cv1.img busco -i dd_aa-proteins.fasta -m proteins -l ascomycota -c 10 -o busco_ascomycota_results
|Results from dataset ascomycota_odb10 |
---------------------------------------------------
|C:97.4%[S:89.4%,D:8.0%],F:0.4%,M:2.2%,n:1706 |
|1663 Complete BUSCOs (C) |
|1526 Complete and single-copy BUSCOs (S) |
|137 Complete and duplicated BUSCOs (D) |
|7 Fragmented BUSCOs (F) |
|36 Missing BUSCOs (M) |
|1706 Total BUSCO groups searched |
---------------------------------------------------
Run BUSCO on the protein fasta file for the sordariomycetes database:
cd /pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/04_BRAKER/braker_outputs
singularity exec -B $PWD /sphinx_local/images/ezlabgva-busco-v5.6.1_cv1.img busco -i dd_aa-proteins.fasta -m proteins -l sordariomycetes -c 10 -o busco_sordariomycetes_results
---------------------------------------------------
|Results from dataset sordariomycetes_odb10 |
---------------------------------------------------
|C:95.5%[S:85.6%,D:9.9%],F:0.4%,M:4.1%,n:3817 |
|3647 Complete BUSCOs (C) |
|3268 Complete and single-copy BUSCOs (S) |
|379 Complete and duplicated BUSCOs (D) |
|16 Fragmented BUSCOs (F) |
|154 Missing BUSCOs (M) |
|3817 Total BUSCO groups searched |
---------------------------------------------------
05. EnTAP (v1.0.0)
- Documentation: https://github.com/harta55/EnTAP
- Sphinx Directory:
/pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/05_EnTAP
- EnTAP will perform functional annotation of the protein fasta we made using BRAKER3
- Copy the protein fasta file to the working directory
cp ../04_BRAKER/braker_outputs/dd_aa-proteins.fasta .
- Load required dependencies
spack load [email protected]%[email protected]
spack load rsem
spack load interproscan
spack load transdecoder
- Run EnTAP
nano run_entap.sh
/sphinx_local/software/EnTAP-1.0.0/bin/EnTAP \
--runP \
-i dd_aa-proteins.fasta \
--ini /sphinx_local/software/EnTAP-1.0.0/entap_config_Oct2023.ini \
-d /sphinx_local/software/EnTAP-1.0.0/bin/uniprot_sprot.dmnd \
-t 10
bash run_entap.sh
EnTAP will produce a number of final files in final_results
-
annotated_without_contam.faa
: FASTA-formatted amino acid / protein file -
annotated_without_contam_gene_ontology_terms.tsv
: Tab-deliminated file that can be used for Gene Enrichment. Columns are as follows: Sequence ID, Gene Ontology Term ID, Gene Ontology Term, Gene Ontology Category, and Effective Length. -
annotated_without_contam.tsv
- Calculate the number of genes that were annotated using EnTAP
grep '>' annotated_without_contam.faa | wc -l
- Results: 10505
- Genes originally annotated using BRAKER3: 10373
- Calculate percentage of genes retained from structural annotation (BRAKER3) to functional annotation (EnTAP):
(BRAKER3 gene count) / (EnTAP gene count) *100 => (10373 / 10505) *100 = 98.74%
Some more important stats:
nano log_file_2024Y6M7D-16h38m4s.txt
------------------------------------------------------
Transcriptome Statistics
------------------------------------------------------
Protein sequences found
Total sequences: 11480
Total length of transcriptome(bp): 17202825
Average sequence length(bp): 1498.00
n50: 1770
n90: 852
Longest sequence(bp): 14823 (g1868.t1)
Shortest sequence(bp): 42 (g2578.t1)
------------------------------------------------------
Final Annotation Statistics
------------------------------------------------------
Total Input Sequences: 11480
Similarity Search
Total unique sequences with an alignment: 6836 (59.55% of total input sequences)
Total alignments flagged as a contaminant: 0 (0.00% of total unique alignments)
Total alignments NOT flagged as a contaminant: 6836 (100.00% of total unique alignments)
Total unique sequences without an alignment: 4644 (40.45% of total input sequences)
Gene Families
Total unique sequences with family assignment: 10504 (91.50% of total input sequences)
Total unique sequences without family assignment: 976 (8.50% of total input sequences)
Total unique sequences with at least one GO term: 8330 (72.56% of total input sequences)
Total unique sequences with at least one pathway (KEGG) assignment: 2725 (23.74% of total input sequences)
Totals
Total retained sequences (after filtering and/or frame selection): 11480
Total unique sequences annotated (similarity search alignments only): 1 (0.01% of total retained)
Total unique sequences annotated (gene family assignment only): 3669 (31.96% of total retained)
Total unique sequences annotated (gene family and/or similarity search): 10505 (91.51% of total retained)
Total alignments flagged as a contaminant (gene family and/or similarity search): 0 (0.00% of total unique alignments)
Total alignments NOT flagged as a contaminant (gene family and/or similarity search): 6836 (100.00% of total unique alignments)
Total unique sequences unannotated (gene family and/or similarity search): 975 (8.49% of total retained)
Some other important files/paths:
- final report from EnTAP:
/pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/05_EnTAP/entap_outfiles/final_results/entap_results.tsv
- final transcriptome assembly:
/pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/05_EnTAP/entap_outfiles/transcriptomes/dd_aa-proteins_final.fasta
- annotated sequences tsv file that were not flagged as a contaminant:
/pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/05_EnTAP/entap_outfiles/final_results/annotated_without_contam.tsv
- annotated sequences fasta file that were not flagged as a contaminant:
/pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/05_EnTAP/entap_outfiles/final_results/annotated_without_contam.faa
- annotated sequences tsv file that were not flagged as a contaminantant and can be used for Gene Enrichment:
/pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/05_EnTAP/entap_outfiles/final_results/annotated_without_contam_gene_ontology_terms.tsv
- various gene ontology figures produced from EggNOG:
/pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/05_EnTAP/entap_outfiles/ontology/EggNOG_DMND/figures
06. GO/KEGG Terms
- Working directory:
/lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/06_GO_terms
- I wanted to create a spreadsheet for all of the GO terms and KEGG terms that were annotated from EnTAP.
- Copy over the
annotated_without_contam_gene_ontology_terms.tsv
&entap_results.tsv
files.
cp /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/05_EnTAP/entap_outfiles/final_results/annotated_without_contam_gene_ontology_terms.tsv .
cp /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/05_EnTAP/entap_outfiles/final_results/entap_results.tsv .
- Create spreadsheets. I used awk and chatgpt to help me pull out all of the gene id's and GO/KEGG terms.
#GO file creation
awk 'BEGIN { FS = OFS = "\t" } { for (i = 1; i <= NF; i++) { printf "%s%s", $i, (i % 4 ? OFS : ORS) } }' annotated_without_contam_gene_ontology_terms.tsv > dd-GO.tsv
#KEGG file creation
awk -F'\t' '{print $1,\t,$32}' entap_results.tsv > dd-KEGG.tsv
- this produced 2 files:
dd-GO.tsv
dd-KEGG.tsv
07. SignalP (v5)
- Working directory:
/lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/07_signalp_secretome
- SignalP Documentation: https://services.healthtech.dtu.dk/services/SignalP-5.0/
- Install SignalP.
- In order to install, you will have to email a download request on their website. I just installed in my software directory:
/lustre/isaac/proj/UTK0032/sniece/software/signalp-5.0b
- Run SignalP.
#!/bin/bash
#SBATCH --job-name=run_signalp
#SBATCH --nodes=1
#SBATCH --ntasks=30
#SBATCH --mem=100G
#SBATCH -A ACF-UTK0032
#SBATCH --partition=short
#SBATCH --qos=short
#SBATCH --time=03:00:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]
#Link executable here because pathing directly to it doesn't work...
ln -s /lustre/isaac/proj/UTK0032/sniece/software/signalp-5.0b/bin/signalp .
# Run signalp
./signalp \
-fasta dd_aa-proteins.fasta \
-format long \
-gff3 \
-mature \
-org euk \
-plot png \
-prefix dd_signalp
# -batch int
# Number of sequences that the tool will run simultaneously. Decrease or increase size depending on your system memory. (default 10000)
# -fasta string
# Input file in fasta format.
# -format string
# Output format. 'long' for generating the predictions with plots, 'short' for the predictions without plots. (default "short")
# -gff3
# Make gff3 file of processed sequences.
# -mature
# Make fasta file with mature sequence. (this will contain only those proteins that are part of the secretome)
# -org string
# Organism. Archaea: 'arch', Gram-positive: 'gram+', Gram-negative: 'gram-' or Eukarya: 'euk' (default "euk")
# -plot string
# Plots output format. When long output selected, choose between 'png', 'eps' or 'none' to get just a tabular file. (default "png")
# -prefix string
# Output files prefix. (default "Input file prefix")
# -stdout
# Write the prediction summary to the STDOUT.
# -tmp string
# Specify temporary file directory. (default "System default tmpdir")
# -verbose
# Verbose output. Specify '-verbose=false' to avoid printing. (default true)
# -version
# Prints version.
- This will produce a few files with your predicted secretome and a spreadsheet of predicted signal peptides.
08. CAZyme Detection using dbCAN (version 3.0)
- Working Directory:
/lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/08_CAZyme_dbCAN
- dbCAN documentation: https://dbcan.readthedocs.io/en/latest/installation.html
- I ended up running dbCAN3 via the web server because I was having issues running it with conda. You can also annotate CAZymes using your proteome here: https://bcb.unl.edu/dbCAN2/blast.php
- Run dcCAN3 on the web server.
- You will get an email with the results. Mine are saved as
dbCAN3_output_CAZyme_prediction.xlsx
. - They will give you CAZyme subfamily predictions along with how many lines of evidence each of those predicted CAZymes has. Per the author recommendations, predicted CAZymes with at least 2 or 3 tools should considered:
Gene ID | EC# | HMMER | dbCAN_sub | DIAMOND | Signalp | #ofTools
--------- | -------- | ------------- | ------- | ----- | ------- | --
g10011.t1 | 3.2.1.78 | GH5_7(76-362) | GH5_e89 | GH5_7 | Y(1-24) | 3
09. EffectorP (version 3.0)
- Working Directory:
/lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_Annotation/09_effectorP/EffectorP-3.0
- Install EffectorP
mkdir 09_effectorP
cd 09_effectorP
git clone https://github.com/JanaSperschneider/EffectorP-3.0.git
cd EffectorP-3.0
unzip weka-3-8-4.zip
- Secure copy your protein fasta file Sphinx from genome annotation for EffectorP to use as input
scp '[email protected]:/pickett_sphinx/projects/lwy647/DisculaDestructiva_Annotation/04_BRAKER/braker_outputs/dd_aa-proteins.fasta' .
- Run EffectorP (I think there were some miscommunication between SLURM and finding the java file for one of the dependencies, so I just bash ran the script since it isn't computationally intensive and takes very little time (finished in 2 minutes).
python EffectorP.py \
-f \
-E dd_effectors \
-o dd_effectors_output \
-i dd_aa-proteins.fasta
-
-f
: run in "fungal" mode (EffectorP is also used for oomycetes, so that's why I ran in fungal mode) -
-E
: name of new fasta file that contains only effectors -
-o
: output file name -
-i
: input protein fasta to be used
Output:
EffectorP results were saved to output file: dd_effectors_output
-----------------
11480 proteins were provided as input in the following file: dd_aa-proteins.fasta
-----------------
Number of predicted effectors: 2763
Number of predicted cytoplasmic effectors: 2392
Number of predicted apoplastic effectors: 371
-----------------
24.1 percent are predicted effectors.
20.8 percent are predicted cytoplasmic effectors.
3.2 percent are predicted apoplastic effectors.
-----------------
NOTE: EffectorP was run in fungal mode.
-----------------
-
dd_effectors_output
is the tsv file that contains all the information for each individual protein.