Predicting secreted proteins from genome - Environmentalpublichealth/Fungal_PFASremediation GitHub Wiki

Secretome annotation is following Pellegrin et al 2015.

All analyses stored at $SCRATCH/PFAS/genomes/Secretome_prediction_Feb2023 on Grace.

Predict proteins contain signal peptides

We will use SignalP 6.0 under CPU mode. It takes protein sequence fasta file as input. I run it with the default setting.

ln -s ../annotationS12_Nov2022/gFACs/braker_filtered_function_filtered.genes.faa S12_proteins.fa # symbolic link the protein files to the working directory.

# load SignalP
module load GCC/11.2.0  OpenMPI/4.1.1
module load SignalP/6.0g.fast

signalp6 -h # get help page

export FAST_MODEL_PATH=/scratch/user/jialiyu/signalp6_fast/signalp-6-package/models/distilled_model_signalp6.pt

signalp6 --fastafile S12_proteins.fa \
--organism eukarya \
--output_dir SignalP_results \
--format txt \
--write_procs 8

Takes ~ 2 hours and prodicted 704 proteins have signal peptides. Score results in prediction_results.txt and positive protein sequences are output in processed_entries.fasta.

Next, we use TMHMM to predict transmembrane regions from these 704 proteins and filter out possible TM proteins.

TMHMM

module load GCC/11.2.0  OpenMPI/4.1.1
module load TMHMM/2.0c
mkdir TMHMM
cd TMHMM
tmhmm ../SignalP_results/processed_entries.fasta > tmhmm_outputs.txt

Finished in a few minutes. It outputs text about whether detect transmembrane domains in the protein. Use a python script to pull out those without TM domain.

Obtained 570 proteins without TM.

Extract the 570 protein sequences by ID from last step. Each ID is listed in one line.

module purge
module load GCC/10.2.0
module load seqtk
seqtk subseq ../SignalP_results/processed_entries.fasta out.txt > noTM_proteins.fasta

Then we will prediction the subcellular location of these protein to further filter them.

TargetP and WolfPsort

I can't find targetP module on HPC, then I used the webtool: https://services.healthtech.dtu.dk/service.php?TargetP-2.0

7 proteins predicted as mTP

Run WolfPsort

module load WoLFPSort/0.2
wolfPredict -h

Doesn't work and cannot find any manual about this standalone program...Use a new protein called Deeploc2.0 (https://academic.oup.com/bioinformatics/article/33/21/3387/3931857) on the web server. It takes max 500 proteins, so I split the 570 proteins into two files, and submit to the web tool.

head -n 500 noTM_proteins.fasta > noTM_proteins1-500.fasta
tail -n +501 noTM_proteins.fasta > noTM_proteins500-end.fasta