5 Tips, tricks and common issues - Bio2Byte/simsapiper GitHub Wiki
Data things
- Sequence labels and the structure filenames must match exactly!
- Modeling with ESMFold has low yield:
- Sequences longer than 400 residues cannot be modeled: try ColabFold to generate your own models
- ESM Atlas was asked to model too many sequences at once, resume the job
Pipeline things
-
Cancel a running Nextflow job: Crtl + C
-
Pipeline failed to complete:
- to rerun the last job: append -resume to the launch command
- to rerun a specific job: check the
*.nflog
files last line to get the unique hashnextflow run simsapiper.nf -resume 9ae6b81a-47ba-4a37-a746-cdb3500bee0f
Attention: last state will be permanently overwritten
-
All intermediate results are unique subdirectory of the directory
work
Find directory hash for each step in*.nflog
-
Run in the background: launch SIMSApiper in a screen
screen -S nextflowalign bash -c ./magic_align.sh
Hit Crtl + A and Crtl + D to put it in the background
-
Launch file does not work:
- try:
and rerun.chmod +x magic_align.sh
- check for spaces behind
\
in the launch file, there can not be any. - On MacOS, you may have to replace
|& tee
with>>
- try:
-
SIMSApiper crashes at CD-Hit stage:
- try:
and rerun.chmod +x bin/psi-cd-hit.pl chmod +x bin/psi-cd-hit-local.pl
- Shorten the sequence IDs to <= 30 characters
- Assess if you are trying to align <10 sequences. Do not use CD-Hit subsetting in this case, choose --useSubsets instead.
- try:
-
SIMSApiper crashes at T-Coffee stage:
- Error contains "sap_pair error" and -model true: possibly ESMFold prediction error
- Remove the ESMFold model of protein in error message from
data/structures
- Set --model false to add protein to the MSA based on sequence
- --retrieve AF model of the protein from AFDB
- Generate model with ColabFold and add manually
- Remove protein from the data/seqs/*.fasta file to removed from the dataset entirely
- Remove the ESMFold model of protein in error message from
- Error contains "proba_pair": T-Coffee could not find the structures and uses a sequence based alignemnt method
- Use complete path to data/structure directory
- Check protein model files
- Error contains "sap_pair error" and -model true: possibly ESMFold prediction error
HPC things
- Install SIMSApiper in a shared location / HPC
- ensure that Nextflow and Apptainer are available and loaded
- we observed that simsapiper and the data directory need to share a common root folder eg.:
nextflow run /opt/software/simsapiper/simsapiper.nf -profile server,withsingularity --data /home/user/simsatest/toy_example/data --magic
does not worknextflow run /home/user/simsapiper/simsapiper.nf -profile server,withsingularity --data /home/user/simsatest/toy_example/data --magic
works
- provide a shared Apptainer cache location (--apptainerPath) to avoid keeping many copies of the GB-sized apptainer images
General advice
- Learn how to adapt this pipeline using Nextflow here