Tutorial2‐DataFiltering - Pas-Kapli/CoME-Tutorials GitHub Wiki
In contemporary phylogenomic analysis, managing hundreds of loci is common, making it impractical to evaluate each one individually. This often leads to the automation of analysis processes and a reliance on pipelines to filter data before conducting phylogenetic analyses. However, it's crucial to comprehensively understand each analysis step and its implications for our data. In this tutorial, we'll track our data through various alignment filtering stages and evaluate the impact of each on the phylogenetic signal. Our approach mirrors that of Tan et al., 2015 highlighting that conventional filtering pipelines may not consistently enhance our data.
We will use 98 samples corresponding to 21 phyla spanning the animal phylogeny. All samples used are publicly available in NCBI and other sources. The total amount of loci we will be using is 978. (A brief description of how the data were generated is provided here.).
Step 1: We will first get a sense of the data, how many loci are there, how many samples per locus etc.
Step 2: We will monitor how 3 different filtering options affect the phylogenetic signal in these data.
- mild filtering for removing extreme gappyness
- mainstream site-filtering with trimal
- outlier filtering with PhylteR
We will evaluate the effect of each of these steps using the average monophyly score. The monophyly score measures how well, on average, each of the gene-trees reconstructs each of the phyla (i.e. clades that we are certain of their monophyly) included in the dataset. Here is the assignment of samples to phyla for this dataset: phyla-groups.txt
Ultimately we will fill out the following table:
Phase | Number of Trees | Average Monophyly Score |
---|---|---|
Original | ? | ? |
Mild filtering | ? | ? |
Trimal | ? | ? |
PhylteR | ? | ? |
The exercise also serves for practising bash commands and scripting with files and analyses related to phylogenomics. For several of the tasks, scripting is required to find the answer - in all cases, it will be possible to do it with bash otherwise a script will be provided. The commands are given, but trying to understand the commands or even constructing them on your own is recommended. In most cases, there is more than one way of completing the task.
Setup a directory for this tutorial and download the mafft alignments, the necessary python3 scripts and the "phyla-groups.txt" file that contains the grouping of taxa into phyla:
mkdir practical2
cd practical2
#Download the assignment of samples to phyla
wget https://raw.githubusercontent.com/Pas-Kapli/CoME-Tutorials/main/data-filtering/phyla-groups.txt
mkdir mafft
cd mafft
#Download the mafft alignments and extract them from the compressed file
wget https://github.com/Pas-Kapli/CoME-Tutorials/raw/main/data-filtering/alignments/mafft.tar.gz
tar -xvzf mafft.tar.gz
rm mafft.tar.gz
cd ../
mkdir scripts
cd scripts
#Download a set of scripts we will need later
wget https://github.com/Pas-Kapli/CoME-Tutorials/raw/main/data-filtering/scripts/scripts.tar.gz
tar -xvzf scripts.tar.gz
rm scripts.tar.gz
Command
cd ../mafft
for i in *mafft; do grep ">" $i ; done | sort -u | wc -l
Command
for i in *mafft; do echo -n "$i "; grep ">" $i | wc -l; done | sort -n -k2
Let's get rid of the loci with less than five sequences:
mkdir discarded
mv EOG091G04FV.mafft EOG091G0BWJ.mafft EOG091G0I85.mafft EOG091G18YA.mafft EOG091G05N4.mafft discarded
Hint: you will need to use the "phyla-groups.txt" that you downloaded earlier and use a nested for loop.
Command
for j in *mafft; do for i in `grep ">" $j | cut -f2 -d">"`; do grep $i ../phyla-groups.txt; done | cut -f1 -d$'\t' | sort -u | wc -l; done | sort -n
(..takes a couple of minutes)
Use this alternative if you want the filename to show up:
for j in *mafft; do echo -n "$j "; for i in `grep ">" $j | cut -f2 -d">"`; do grep $i ../phyla-groups.txt; done | cut -f1 -d$'\t' | sort -u | wc -l; done | sort -n -k 2
To calculate the average monophyly score we need the gene trees, therefore we need to perform a tree search. Even with the simplest tree search command shown below it will take hours to infer all 978 trees. You can do one of the inferences and download the remaining of trees with the wget command provided below
cd ..
mkdir original-trees
cd original-trees
raxml-ng --search1 --msa ../mafft/EOG091G1CLN.mafft --model WAG+G --prefix EOG091G1CLN.mafft
rm *
#Download the pre-calculated ML trees
wget https://github.com/Pas-Kapli/CoME-Tutorials/raw/main/data-filtering/trees/original-trees.tar.gz
tar -xvzf original-trees.tar.gz
rm original-trees.tar.gz
Tip: if you have access to a server you can parallelise the gene-tree searches with the bash command "parallel". Example command:
parallel --jobs 15 "raxml-ng --search1 --msa {}.mafft --model WAG+G" < Accs.txt
We can now calculate the monophyly score
for i in *bestTree; do python3 ../scripts/monophyly-score.py $i ../phyla-groups.txt; done > ../monophyly_scores-original.txt
cd ../
awk '{ total += $3 } END { if (NR > 0) print total / NR }' monophyly_scores-original.txt
Takes a couple of minutes..
In case you had trouble creating it here is the monophyly_scores-original.txt
cd mafft
for i in *mafft; do python3 ../scripts/filter_gappy_taxa.py $i 0.8; done
###Remove loci that contain < 10 phyla representatives.
mkdir discarded2
for j in *nogappy; do echo -n "$j "; for i in `grep ">" $j | cut -f2 -d">"`; do grep $i ../phyla-groups.txt; done | cut -f1 -d$'\t' | sort | uniq -c | wc -l; done | awk '$2 < 10' | cut -f 1 -d" " | paste -s -d" " > files-to-move.txt
mv `cat files-to-move.txt` discarded2/
rm files-to-move.txt
for i in *nogappy; do python3 ../scripts/filter_alignment-sites.py $i 1; done
rm *nogappy
cd ..
mkdir mild-filtering-trees
cd mild-filtering-trees
wget https://github.com/Pas-Kapli/CoME-Tutorials/raw/main/data-filtering/trees/mild-filt-trees.tar.gz
tar -xvzf mild-filt-trees.tar.gz
rm mild-filt-trees.tar.gz
for i in *bestTree; do python3 ../scripts/monophyly-score.py $i ../phyla-groups.txt; done > ../monophyly_scores-mild-filtering.txt
cd ..
awk '{ total += $3 } END { if (NR > 0) print total / NR }' monophyly_scores-mild-filtering.txt
In case you had trouble creating it here is the monophyly_scores-original.txt
Alignment site filtering is most often based on ad hoc criteria regarding alignment quality such as gappyness and sequence similarity or by retaining only the alignment positions that are robust to changes to alignment parameters. Nevertheless, reports on the impact of alignment trimming on the quality of downstream phylogenetic analysis vary and hence trimming should be used cautiously.
Here we will use a popular alignment tool caller Trimal and we will try to understand the effect of this filtering to the phylogenetic signal of our gene-trees based on the monophyly score.
cd mafft
for i in *mafft; do trimal -automated1 -in $i -out $i.trimal; done
cd ../
mkdir trimal-trees
cd mkdir trimal-trees
wget https://github.com/Pas-Kapli/CoME-Tutorials/raw/main/data-filtering/trees/trimal-trees.tar.gz
tar -xvcf trimal-trees.tar.gz
for i in *.bestTree; do python3 ../scripts/monophyly-score.py $i ../phyla-groups.txt; done > ../monophyly_scores-trimal.txt
cd ../
awk '{ total += $3 } END { if (NR > 0) print total / NR }' monophyly_scores-trimal.txt
In case you had trouble creating it here is the monophyly_scores-original.txt
Besides alignment filtering, there is often a step of identifying outlier sequences either due to artifactual (e.g. contaminations, misalignments) or real biological reasons (biases in state composition, rate of evolution etc.). There are several pipelines aiming to tackle such problems, however, the effect of data removal is not well understood.
We will perform a filtering step with PhylteR and we will try to understand the effect of this filtering to the phylogenetic signal of our gene-trees based on the monophyly score.
cd mild-filtering-trees
cat *bestTree > all-trees-mild-filtering.txt
R
within R
> library(phylter)
> trees <- ape::read.tree("all-trees-mild-filtering.txt")
> results <- phylter(trees, bvalue = 0, distance = "patristic", k = 3, k2 = 3, Norm = "median", Norm.cutoff = 0.001, gene.names = NULL, test.island = TRUE, verbose = TRUE, stop.criteria = 1e-5, InitialOnly = FALSE, normalizeby = "row", parallel = TRUE)
> results$Final$Outliers
> write.phylter(results, file = "phylter.out")
ctl+d
cd ../
mkdir phylteR-trees
cd mkdir phylteR-trees
wget https://github.com/Pas-Kapli/CoME-Tutorials/raw/main/data-filtering/trees/phylteR-trees.tar.gz
tar -xvzf phylteR-trees.tar.gz
for i in *bestTree; do python3 ../scripts/monophyly-score.py $i ../phyla-groups.txt; done > ../monophyly_scores-phylteR.txt
cd ../
awk '{ total += $3 } END { if (NR > 0) print total / NR }' monophyly_scores-phylteR.txt
In case you had trouble creating it here is the monophyly_scores-original.txt