5 Tips, tricks and common issues - Bio2Byte/simsapiper GitHub Wiki

Data things

Sequence labels and the structure filenames must match exactly!
Modeling with ESMFold has low yield:
- Sequences longer than 400 residues cannot be modeled: try ColabFold to generate your own models
- ESM Atlas was asked to model too many sequences at once, resume the job

Pipeline things

Cancel a running Nextflow job: Crtl + C
Pipeline failed to complete:
- to rerun the last job: append -resume to the launch command
- to rerun a specific job: check the *.nflog files last line to get the unique hash
```
nextflow run simsapiper.nf -resume 9ae6b81a-47ba-4a37-a746-cdb3500bee0f 
```
Attention: last state will be permanently overwritten
All intermediate results are unique subdirectory of the directory work
Find directory hash for each step in *.nflog
Run in the background: launch SIMSApiper in a screen
```
screen -S nextflowalign bash -c ./magic_align.sh
```
Hit Crtl + A and Crtl + D to put it in the background
Launch file does not work:
- try:
```
chmod +x magic_align.sh
```
  and rerun.
- check for spaces behind \ in the launch file, there can not be any.
- On MacOS, you may have to replace |& tee with >>
SIMSApiper crashes at CD-Hit stage:
- try:
```
chmod +x bin/psi-cd-hit.pl
chmod +x bin/psi-cd-hit-local.pl
```
  and rerun.
- Shorten the sequence IDs to <= 30 characters
- Assess if you are trying to align <10 sequences. Do not use CD-Hit subsetting in this case, choose --useSubsets instead.
SIMSApiper crashes at T-Coffee stage:
- Error contains "sap_pair error" and -model true: possibly ESMFold prediction error
  - Remove the ESMFold model of protein in error message from data/structures
    - Set --model false to add protein to the MSA based on sequence
    - --retrieve AF model of the protein from AFDB
    - Generate model with ColabFold and add manually
  - Remove protein from the data/seqs/*.fasta file to removed from the dataset entirely
- Error contains "proba_pair": T-Coffee could not find the structures and uses a sequence based alignemnt method
  - Use complete path to data/structure directory
  - Check protein model files

HPC things

Install SIMSApiper in a shared location / HPC
- ensure that Nextflow and Apptainer are available and loaded
- we observed that simsapiper and the data directory need to share a common root folder eg.:
  - nextflow run /opt/software/simsapiper/simsapiper.nf -profile server,withsingularity --data /home/user/simsatest/toy_example/data --magic does not work
  - nextflow run /home/user/simsapiper/simsapiper.nf -profile server,withsingularity --data /home/user/simsatest/toy_example/data --magic works
- provide a shared Apptainer cache location (--apptainerPath) to avoid keeping many copies of the GB-sized apptainer images

General advice

Learn how to adapt this pipeline using Nextflow here