4 Documentation - Bio2Byte/simsapiper GitHub Wiki
Extensive representation of pipeline workflow
Input files
Magic_align.sh
Launch file: - Creates log file
*.nflog
with working directories for all subjobs, error messages, and execution hash for resuming - Add more flags to standardize runs, alternatively create profiles in
nextflow.config
file
#!/bin/bash
#Adapt this line:
house=full/path/to/your/data/directory
#Adapt this line:
data=data_folder_containing_seqs_and_structures
now=`date +"%Y_%m_%d_%H_%M_%S"`
#Adapt this line:
output_name=${data}_${now}_description
output_folder=$house/results/$output_name
mkdir -p $house/results/
mkdir -p $house/results/$output
#Adapt here:
/path/to/software/nextflow run /path/to/simsapiper.nf \
-profile select_infrastructure,select_container \
--data $house/$data \
--any_flag value_followed_by_\
--magic \
--outName magicMsa \
--outFolder $output_folder \
|& tee $output_folder/run_report_$output_name.nflog
sessionName=$(sed -n '2s/.*\[\(.*\)\].*/\1/p' $output_folder/run_report_$output_name.nflog)
#Adapt this line:
/path/to/software/nextflow log | grep $sessionName >> $output_folder/run_report_$output_name.nflog
#remove all comment lines before launching
Nextflow.config
Config file: - Holds all variables needed to adapt SIMSApiper to your system
- Local, server and HPC-SLURM execution profiles are prepared
- Edit to change standard parameters (as listed under available flags)
- Learn how to make Nextflow profiles here
Sequence Input (--seqs)
Input: data/seqs/
- Sequence file(s):
- One file with all sequences
- Multiple sequence files (can be subsets for T-Coffee with --use_subsets)
- Sequence format: If not FASTA formatted, provide file format with --seqFormat according to biopython nomenclature
- Sequence IDs: If --retrieve: use Uniprot ID as sequence ID to enable retrieval from AlphaFold Database
Notes: No spaces in the filename(s) allowed, use '_' instead
Structural input (--structures)
Input: data/structures/
- Structural input is optional
- Must be in .pdb format
- Sequence labels and the structure filenames must match exactly!
>P25106 will only match P25106.pdb - Can be experimentally generated structures, from the PDB or modeled structures
Good to know:
- Multiple chains are not allowed by T-Coffee. Extract the chain of interest using: pdb-tools (also make sure you extract your chain of interest in the SEQRES section)
- Mutations in 3D structure will not impact MSA quality IF mutations do not impact the overall organisation of the protein (pdb_min_sim parameter of T-Coffee set to 1 by default)
- Step squeeze will fail if more than 5% of your protein structure is mutated compared to its sequence.
- The sequence to align can be a section of the sequence of the structure (pdb_min_cov parameter of T-Coffee set to 1 by default)
- Modeled structure information (even when of lower quality) has been shown to improve the quality of MSAs
- Omit possible issues by computing/retrieving the predicted model of your sequences:
- Using SIMSApiper:
- Retrieve AlphaFold2 models from AlphaFoldDB (only if protein in AlphaFoldDB).
- Generate ESMFold models using ESMFold API (only for proteins smaller than 400 residues).
- Generate ESMFold models locally (GPUs required).
- Other options:
- ColabFold (very user friendly)
- AlphaFold2 software
- Using SIMSApiper:
Output directory (--outFolder)
- Nextflow creates directory
results
in Nextflow directory - Every run creates a unique subdirectory
- Set name using --outFolder
Output file name (--outName)
- Select name for the generated MSA using --outName
1. Preprocessing
1.1 Convert Sequence Files
Output: results/outFolder/seqs/
- Parse sequence files using biopython to provide sequences in fasta-2-line format
Stop alignment of identical sequences (--stopHyperconserved)
Why?
- When aligning automatically annotated sets of ortholog proteins, occasionally all residues are conserved in every sequence. No need to align!
How?
- If CD-Hit clustering using 100% sequence indentity cutoff returns only one cluster, the SIMSApiper run is stopped.
1.2 Reduce dataset size ( --dropSimilar)
Output: results/outFolder/seqs/CD-HIT/
Why?
- Calculation time scales exponentially with number of input sequences
- Set to:
- 90% to remove redundant sequences
- 70% to have a more general overview of the protein family and address sampling bias
How?
- Use CD-Hit with high sequence identity cutoff
- Keep cluster representatives
- Proteins inside clusters will be excluded from the downstream processes
- Add important proteins to the representatives with --favoriteSeqs
- Find in which cluster a specific protein is in
results/outFolder/seqs/CD-HIT/*.clstr
- Find excluded similar sequences in
results/outFolder/seqs/CD-HIT/removed_*_perc_similar.fasta
1.3 Quality control for input sequences (--seqQC)
Output: results/outFolder/seqs/too_many_unknown_characters.fasta
Why?
- Too many non-standard amino acids can result in a poor alignment
How?
- Remove all sequences containing more than --seqQC % non-standard amino acids
Sequence length filter (--seqLen)
Output: results/outFolder/seqs/too_short_seqs.fasta
Why?
- Short sequences cover only a part of the protein, thereby providing limited information and adding many gaps to the alignment
How?
- Remove all sequences shorter than --seqLen (standard: 50) amino acids
2. Match sequence with structure information
2.1 Identify missing models
- Match sequence ID and structure model filenames
Common Issues:
- Models not matched to a sequence present in input file can be caused by spelling or exclusion by --dropSimilar or --seqQC
2.2 Retrieve models from AlphaFold Protein Structure Database (--retrieve)
Output: data/structures/uniprot_id.pdb
Why?
- AlphaFold2 models are considered the most accurate protein structure predictions
How?
- Only possible if sequence ID starts with the Uniprot ID
>UniprotID_more_information
Common Issues:
- Entries in Uniprot that are not on the AlphaFold Database: generate model yourself and add manually to
data/structures/
2.3 Identify missing models
(same as step 2.1)
2.4 Model Sequences with ESMFold (--model)
Output: data/structures/sequence_id.pdb
Why?
- Very fast prediction of novel protein structure models from sequence
- No need to know the Uniprot ID
How?
- Submit sequence to ESMFold API for modelling, retrieve model on completion
Common Issues:
- Sequences are longer then 400 residues
- Sequence contains any characters not representing an amino acid
- API rate limit causes jobs to randomly fail: rerun SIMSApiper a few times to ensure maximal amount of models retrieved OR generate model yourself and add manually to
data/structures/
2.4 Alternative: Model Sequences with ESMFold (--localModel 1)
Output: data/structures/sequence_id.pdb
, data/structures/sequence_id.pae
, data/structures/esm_fold_statistics.csv
Why?
- Very fast prediction of novel protein structure models from sequence
- No need to know the Uniprot ID
- Automatically generate structure quality metrics on a per-residue level (*.pae files) and globally (esm_fold_statistics.csv)
How?
- Run structure predictions with ESMfold on local GPUs
Common Issues:
- --localModel needs to be set to a number (e.g. 0.5, 1, 23) which corresponds to the amount of resources (walltime in h on a HPC) allocated to the task. --localModel true returns an error.
- Model calculation time increases exponentially with sequence length
- Requires graphics card with at least 16GB GPU memory (! not CPU memory)
- for proteins with >900 AA we needed even more GPU memory
- 8GB GPU memory is enough to predict proteins <250 AA
2.5 Identify missing models
(same as step 2.1)
2.6 Assess number of structure models and sequences to be aligned (--strucQC)
Output: results/outFolder/seqs/seqs_to_align.fasta
, results/outFolder/seqs/structureless_seqs.fasta
Why?
- Alignment quality decreases if too many sequences are not matched to a model
How?
- Job fails if more than --strucQC % of the sequences are structureless before starting the alignment step
- Sequences matched to a model:
seqs_to_align.fasta
- Sequences not matched to a model:
structureless_seqs.fasta
3. Subsetting - Division of the dataset in different subsets
Why?
- Calculation time and space requirements of T-Coffee alignments scale exponentially with number of sequences
How?
- Sequences shorter than 400 residues: maximally 100 sequences per subset
- Sequences longer than 400 residues: maximally 50 sequences per subset
- Sequences that do not fit a subset: label "orphan", not aligned with T-Coffee but with MAFFT based on sequence
3.1 User-provided subsets (--useSubsets)
Input: multiple sequence files in data/seqs/
Output: results/outFolder/seqs/CD-HIT_for_t-coffee/*.fasta
- Use phylogenetic relationships or sequence identity to manually create subsets
- Sequences that do not fit in a subset can be collected in one file with “orphan” in filename
- Input for T-Coffee will be filtered for sequences matched to a protein structure model
3.2 Automatically generated subsets (--createSubsets)
Output: results/outFolder/seqs/CD-HIT_for_tcoffee/subset*.fasta
- Automatically cluster sequences with PSI-CD-HIT --createSubsets % sequence identity into a subset
- Set appropriate maximum number of members per cluster based on sequence length --maxSubsetSize
- Singleton clusters will be labeled with “orphan“
Common Issues:
- Subsets are too large:
- Increase initial clustering threshold --createSubsets
- Clusters get split randomly to attain evenly distributed clusters smaller then --maxSubsetSize
- More then 5% of the clusters are singletons:
- Threshold will interactively decrease by 5% until --minSubsetIdentity
- Set --minSubsetIdentity "min" to reduce overall number of subsets by collating small clusters and singletons until --maxSubsetSize reached
4. Run T-Coffee (--tcoffeeParams)
Output: results/outFolder/msas/t-coffee/aligned_subset_*.aln
Why?
- Shown to be the state-of-the-art structure-informed alignment method
- Combination of algorithms TMalign and SAP have shown to provide the best results
How?
- Each subset is aligned individually by T-Coffee
- This is the step that takes the longest (likely hours) but can run in parallel depending on available resources
- In -profile hpc, each subset is submitted up to 3 times with dynamically increasing resource requests to minimize queue time
- Subsets labelled “orphan” will not be aligned with T-Coffee but added to the structure informed MSA with MAFFT based on their sequence
- Temporary files will be created in
work/.t_coffee/tmp/tco/
- T-Coffee parameters in this pipeline:
t_coffee -in=subset_0.fasta -method TMalign_pair
-evaluate_mode=t_coffee_slow -mode=3dcoffee -pdb_min_sim=1 -pdb_min_cov=1 -thread 0
-outfile="output/t-coffee/subset_0_aligned.aln"
- Mode 3DCoffee uses method sap_pair by default
- Append any other T-Coffee flag with --tcoffeeParams “-quiet”
5. Run Mafft (--mafftParams)
Output: results/outFolder/msas/merged_finalmsa_alignment.fasta
Why?
- Combine T-Coffee subalignments
- Add orphan and/or structureless sequences
How?
- Mafft merge mode is conserving the subalignment structure, mainly adds gaps
- Mafft parameters in this pipeline:
mafft --merge tableMAFFT.txt input > prefinalMSA.fasta
- Append any other mafft flag or alignment modes with --mafftParams “--localpair --maxiterate 100”
- Find appropriate flags here
6. Run DSSP (--dssp)
Output: results/data/dssp/model_name.dssp
- Translate structure models into 2D secondary structure nomenclature using the DSSP codes
7. Improve MSA
7.1 Map DSSP to MSA
Output: results/outFolder/msas/dssp_merged_finalmsa_alignment.fasta
- Map DSSP codes on sequences of the MSA, conserving the gaps
Common Issues:
- Sequence of the model and the sequence to align have:
- Same length but more than 5% different (point mutations)
- Different lengths (deletions/insertions) in the middle of one of the sequences (only allowed at extremeties)
- Your structure file contains more than one chain: extract a specific chain using pdb-tools or use predicted models
- Find these troublesome sequences in
results/outFolder/msas/unmappable_dssp.fasta
and unconverted in the mapped alignment
7.2 Squeeze MSA towards conserved secondary structure elements (--squeeze)
Output: results/outFolder/msas/squeezed_dssp_merged_finalmsa_alignment.fasta
Why?
- MSA of unstructured or less conserved regions (e.g. loops) are usually very disperse
How?
- Identify conserved secondary structure categories such as helices and β-sheets with --squeeze “H,E”, and squeeze MSA towards these regions
- DSSP codes representing helices, i.e, H, G and I, are considered the same by SIMSApiper
- Select threshold for region to be considered 'conserved' with --squeezePerc
- Must have at least 3 consecutive conserved columns to be considered a conserved region
7.3 Map DSSP to squeezed MSA
Output: results/outFolder/msas/dssp_squeezed_dssp_merged_finalmsa_alignment.fasta
- Map DSSP codes on sequences of the squeezed MSA, conserving the gaps.
Common Issues: see above
8. Reorder MSA (--reorder)
Output: results/outFolder/reordered_*_merged_finalmsa_alignment.fasta
- Order MSA according to the order of sequences in the input files. If more then one sequence input file is provided, order MSA based on the order given in the files organized alphabetically. Instead of alphabetically, can also select explicitly how to organize the files with --reorder "gamma.fasta,delta.fasta”
9. Convert MSA
Output: results/outFolder/seqs/converted_*_merged_finalmsa_alignment.fasta
- Convert MSA using biopython from fasta-2-line format
10. Calculate phylogenetic tree
Output: results/outFolder/tree/*.nw
- Calculate phylogenetic tree using IQ-TREE2 automatic model finder and ultrafast bootstrap
Log files
Summary
Output: results/outFolder/simsapiper_summary.md
- Reports settings of all parameters
- Reports results of intermediate steps, such as number of sequences
- in the input files
- excluded because invalid
- matched to a protein structure model
- not matched to a protein structure model
- in the final aligned file
- Paths to all intermediate output files
- MSA statistics
- Gap reduction by secondary structure squeezing
- Conserved, gapless and total positions in the MSA
- Average pairwise sequence identity (Gaps were treated as missmatch, therefore generally underestimating the SI compared to other tools)
- Plots:
- Pairwise sequence identity histogram (all-vs-all)
Sequence_identity_histogram.pdf
,msa_stats/average_identity_per_sequence.csv
- Average sequence identity histogram per input sequence
Average_sequence_identity_histogram.pdf
,msa_stats/pairwise_identity_matrix.csv
- Occupation and sequence conservation (based on Shannon Entropy) per MSA column
ShannonEntropyConservation_occupency.pdf
,msa_stats/conservation_occupancy_scores.csv
- Cartoon representation of consensus secondary structure elements in the MSA
ConsensusSecondaryStructure_alignment.pdf
,msa_stats/SecondaryStructure_rawdata.csv
- Occurrence of secondary structure elements (Helix, Sheet, Loop) per MSA column
DSSPcodes_aligned.pdf
,msa_stats/DSSPcodes_frequencies.csv
- Pairwise sequence identity histogram (all-vs-all)
Resource log
Output: results/outFolder/nextflow_report_outName.html
, resources_outName.txt
- Log file created by Nextflow pipeline
- Shows execution times and resource usage per job
Execution log
Output: results/outFolder/run_id_time.nflog
- Created ONLY when using the launch file
- Reports all --flag settings
- Captures terminal output of Nextflow pipeline for error tracing
- Captures unique execution hash and flags for resuming the specific job