Troubleshooting‐Tutorial - Pas-Kapli/CoME-Tutorials GitHub Wiki

Introduction

Scientific computer software, pipelines, and scripts used in phylogenomics are written by fellow researchers with a specific task or set of tasks in mind. As users, prior to integrating any such tool in our analysis it's imperative that we:

  1. Gain a comprehensive understanding of the tool's intended purpose and verify that it aligns with our research objectives.
  2. Familiarise with the input files (type of data/control files/formats), the options and command line used as well as have a clear expectation of what the output files should be.
  3. Keep in mind that tools are often NOT exhaustively tested, this means that error messages can often be unhelpful/non-specific or non-existing or in worst case scenario our output can be somehow wrong. To ensure correct usage of the tool it is advisable to i) read the relevant documentation (manuals, INFO files, tutorials) and use test cases that might be provided with the tool, ii) resort to relevant user forums, i.e., google groups, GitHub pages where we can find help for similar problems by other users or ask directly the developers about potential problems we are facing.

Troubleshooting tutorial

In this tutorial, we present nine exercises, each deliberately designed to include specific problems for you to identify and, if possible, resolve. The primary objective is to cultivate a troubleshooting mindset for common issues encountered in phylogenomics analyses.

Most exercises are based on popular software that will be often used throughout the workshop (e.g., Muscle, MAFFT, RAxML-NG, PAML) as well as scripts and lesser-known programs like mPTP. We'll address the first two problems together, followed by independent work on exercises 3 to 7. Exercises 8 and 9 pose greater challenges, and are optional.

Demo Exercises:

Exercise 1
You want to estimate a time calibrated phylogeny of several primates using mcmctree, a program in paml. The program requires a control file which specifies the names of several input files, such as a file with the tree topology, model choices, priors, and MCMC parameters. A detailed description can be found in the manual. There are two ways to downloads the files. You can clone the github repository:
git clone https://github.com/Pas-Kapli/CoME-Tutorials.git
cd CoME-Tutorials

If you clone the repository, it will include all of the files for this tutorial. You do not need to do it for each exercise. Alternatively, you can download the necessary files and then extract them from the tar file.

wget https://raw.githubusercontent.com/Pas-Kapli/CoME-Tutorials/main/ex_1.tar.gz
tar -xzvf ex_1.tar.gz

You try to run the program using the code below, but you get an error message. (You may need to change the path to the mcmctree executable depending how you installed the program.)

mcmctree mcmctree.ctl

The wiki for mcmctree can be found at https://github.com/abacus-gene/paml/wiki/MCMCtree. Mcmctree is part of the paml package, the wiki for all of paml can be found here.

Determine the cause of the error and correct it to estimate a time calibrated tree.

Exercise 2
You want to run a species delimitation analysis using mPTP a tool that clusters the tips of a phylogeny into putative species. The only input required is a binary phylogenetic tree. Check the documentation here.

A basic command for performing the delimitation is the following:

mptp --ml --multi --tree_file Gallotia.con.tre --output_file delimitation --outgroup Lacerta_lepida

"Gallotia.tre" is the input tree file and "Lacerta_lepida" is one of the outgroups included in the dataset.

You can download the input tree file with the following command in your working directory:

wget https://raw.githubusercontent.com/Pas-Kapli/CoME-Tutorials/main/ex_2/Gallotia.con.tre

If the analysis runs correctly we should see something like the following printed on the screen:

mptp 0.2.5_linux_x86_64, 15GB RAM, 8 cores
https://github.com/Pas-Kapli/mptp
Parsing tree file...
Loaded unrooted tree...
Converting to rooted tree...
Number of edges greater than minimum branch length: 93 / 128
Score Null Model: 267.532663
Best score for multi coalescent rate: 379.572833
LRT computed p-value: 0.000000
Writing delimitation file delimitation.txt ...
Number of delimited species: 10
Creating SVG delimitation file delimitation.svg ...
Done...

There should also be two output files in your home directory: delimitation.svg delimitation.txt

If that's not your outcome can you figure out what the problem might be?

Work on your own

Exercise on your own time. We know there are errors! That's the point!

You can use:

  • Manuals
  • Google
  • Google groups/github
  • Original papers (might not be that helpful depending on the issue)
  • Your neighbour (if you've tried the manuals and google)

Note that while intentional problems are embedded within each exercise, additional and varied issues may arise due to differences in operating systems, installations, and executable program paths. This diversity adds depth to the practical.

Exercise 3
You have several sequence files created with data from the Thousand Genomes Project that you first need to combine into one file and then align. The sequence files are listed in sequenceFiles.txt. Download the files and untar them with the following commands.
wget https://raw.githubusercontent.com/Pas-Kapli/CoME-Tutorials/main/ex_3.tar.gz
tar -xzvf ex_3.tar.gz

Take a look at the list of files in sequenceFiles.txt and some of the individual sequence files. The script below is in the file example3.sh. It is supposed to combine the files and then align the sequences using muscle. The script does not produce your alignment, msa_align.fa. Note, you may have to change the path to muscle depending where you installed it.

#!/bin/bash

# This is included so you don't duplicate all the sequences
# if you run this script multiple times. 

rm -f sequenceToAlign.txt

for seq in $(cat sequenceFiles.txt)
do
        # Add sequence to file
        cat ${seq} >> sequenceToAlign.txt

done

sed -i 's/\t/\n/g' sequenceToAlign.txt

muscle -align sequenceToAlign.txt -output msa_align.fa &> outMuscle

Determine why the script does not produce the expected output.

Exercise 4
You want to create a multiple sequence alignment using the sequences listed in sequenceFiles.txt, similar to example 3. Download the files and untar them using the following commands.
wget https://raw.githubusercontent.com/Pas-Kapli/CoME-Tutorials/main/ex_4.tar.gz
tar -xzvf ex_4.tar.gz

You run the script example4.sh, shown below. You notice a warning message in outMuscle.

#!/bin/bash

# This is so the script can be run multiple times
# without needing to remove the sequence file 
# created previously.
rm -f sequenceToAlign.txt

for seq in $(cat sequenceFiles.txt)
do

        # Add sequence to file
        cat ${seq} >> sequenceToAlign.txt

done

# Formatting the sequences for muscle
sed -i 's/\t/\n/g' sequenceToAlign.txt

muscle -align sequenceToAlign.txt -output msa_align.fa &> outMuscle

Determine the cause of the warning message and how you would run the program without warnings.

Exercise 5

You want to infer a phylogeny based on COI sequences for the genus Gallotia.

Download the relevant dataset with the following command:

mkdir exe5
cd exe5
wget https://raw.githubusercontent.com/Pas-Kapli/CoME-Tutorials/main/ex_5/Gallotia.COI.fasta

Try inferring a gene tree with the following raxml-ng command:

raxml-ng --search1 --msa Gallotia.COI.fasta --model GTR+G 

If the analysis runs correctly you should get the following output files in your working directory:

Gallotia.COI.raxml.bestModel Gallotia.COI.raxml.log Gallotia.COI.raxml.reduced.phy Gallotia.COI.raxml.bestTree Gallotia.COI.raxml.rba Gallotia.COI.raxml.startTree

If that's not the case can you identify what the problem is?

Exercise 6

You want to run a bash script that executes a chain of other tools as follows:

  1. It downloads a fasta file
  2. It aligns the sequences
  3. It runs a raxml phylogenetic inference
  4. It performs a species delimitation with mPTP

Download the script with:

mkdir exe6
cd exe6
wget https://raw.githubusercontent.com/Pas-Kapli/CoME-Tutorials/main/ex_6/bash-pipe.sh

Make the script executable:

chmod u+x bash-pipe.sh

If the script runs successfully it should print:

Step1: Downloading data.. This alignment contains 65 sequences
Step2: Aligning sequences with mafft..
Step3: Running RAxML..
Step4: Running Species delimitation with mptp:
Final result: The number of delimited species is 10

Can you identify what went wrong?

Exercise 7
You want to estimate a time calibrated phylogeny of HIV sequences using mcmctree, a program in paml. First, download and untar the necessary files.
wget https://raw.githubusercontent.com/Pas-Kapli/CoME-Tutorials/main/ex_7.tar.gz
tar -xzvf ex_7.tar.gz

You try to run the program using the code below, but you get an error message. Again, you may need to change the path to the mcmctree executable depending how you installed the program. The manuals for paml and mcmctree are listed in Exercise 1.

mcmctree mcmctree.ctl

Determine the cause of the error and correct it to estimate a time calibrated tree.

Challenge Problems

Exercise 8
You want to create maximum likelihood phylogeny using raxml-ng with data from the Thousand Genomes Project. Download and untar the files.
wget https://raw.githubusercontent.com/Pas-Kapli/CoME-Tutorials/main/ex_8.tar.gz
tar -xzvf ex_8.tar.gz

You run the script, whose contents are shown below, which combines all the sequences into one file and then runs raxml-ng. The program runs, but produces many warning messages.

#!/bin/bash

# This is so you can run the script multiple times
# with adding the sequences multiple times
rm -f msa.txt

for seqfile in $(cat sequenceFiles.txt)
do
	name=$(cat $seqfile | awk '{ print $1}')

	# Check to make sure sequence isn't empty
	if [[  $(cat $seqfile | awk '{print $2}' | grep [^-]) ]]; then

		seq=$(cat $seqfile | awk '{print $2}')

	fi

	# Add sequence to file
	echo ${name} >> msa.txt
	echo ${seq} >> msa.txt


done


raxml-ng --msa msa.txt --model JC --msa-format FASTA

Some of the warning messages reflect the variation in the data, while some of them are due to mistakes. What is the mistake?

Exercise 9

You want to infer a phylogeny using this small alignment "seq-new.fasta"

mkdir exe9
cd exe9
wget https://raw.githubusercontent.com/Pas-Kapli/CoME-Tutorials/main/ex_9/seq-new.txt

Try inferring a gene-tree with the following command:

raxml-ng --search1 --msa seq-new.fasta --model GTR+G --force

If the analysis runs correctly you should get the following output files in your working directory: seq-new.raxml.bestModel seq-new.raxml.log seq-new.raxml.reduced.phy seq-new.raxml.bestTree seq-new.raxml.rba seq-new.raxml.startTree

If that's not the case can you identify what the problem is?

⚠️ **GitHub.com Fallback** ⚠️