5 Tips, tricks and common issues - Bio2Byte/simsapiper GitHub Wiki

Data things

  • Sequence labels and the structure filenames must match exactly!
  • Modeling with ESMFold has low yield:
    • Sequences longer than 400 residues cannot be modeled: try ColabFold to generate your own models
    • ESM Atlas was asked to model too many sequences at once, resume the job

Pipeline things

  • Cancel a running Nextflow job: Crtl + C

  • Pipeline failed to complete:

    • to rerun the last job: append -resume to the launch command
    • to rerun a specific job: check the *.nflog files last line to get the unique hash
      nextflow run simsapiper.nf -resume 9ae6b81a-47ba-4a37-a746-cdb3500bee0f 
      

    Attention: last state will be permanently overwritten

  • All intermediate results are unique subdirectory of the directory work
    Find directory hash for each step in *.nflog

  • Run in the background: launch SIMSApiper in a screen

    screen -S nextflowalign bash -c ./magic_align.sh
    

    Hit Crtl + A and Crtl + D to put it in the background

  • Launch file does not work:

    • try:
      chmod +x magic_align.sh
      
      and rerun.
    • check for spaces behind \ in the launch file, there can not be any.
    • On MacOS, you may have to replace |& tee with >>
  • SIMSApiper crashes at CD-Hit stage:

    • try:
      chmod +x bin/psi-cd-hit.pl
      chmod +x bin/psi-cd-hit-local.pl
      
      and rerun.
    • Shorten the sequence IDs to <= 30 characters
    • Assess if you are trying to align <10 sequences. Do not use CD-Hit subsetting in this case, choose --useSubsets instead.
  • SIMSApiper crashes at T-Coffee stage:

    • Error contains "sap_pair error" and -model true: possibly ESMFold prediction error
      • Remove the ESMFold model of protein in error message from data/structures
        • Set --model false to add protein to the MSA based on sequence
        • --retrieve AF model of the protein from AFDB
        • Generate model with ColabFold and add manually
      • Remove protein from the data/seqs/*.fasta file to removed from the dataset entirely
    • Error contains "proba_pair": T-Coffee could not find the structures and uses a sequence based alignemnt method
      • Use complete path to data/structure directory
      • Check protein model files

HPC things

  • Install SIMSApiper in a shared location / HPC
    • ensure that Nextflow and Apptainer are available and loaded
    • we observed that simsapiper and the data directory need to share a common root folder eg.:
      • nextflow run /opt/software/simsapiper/simsapiper.nf -profile server,withsingularity --data /home/user/simsatest/toy_example/data --magic does not work
      • nextflow run /home/user/simsapiper/simsapiper.nf -profile server,withsingularity --data /home/user/simsatest/toy_example/data --magic works
    • provide a shared Apptainer cache location (--apptainerPath) to avoid keeping many copies of the GB-sized apptainer images

General advice

  • Learn how to adapt this pipeline using Nextflow here