Tutorial2‐DataFiltering - Pas-Kapli/CoME-Tutorials GitHub Wiki

Introduction

In contemporary phylogenomic analysis, managing hundreds of loci is common, making it impractical to evaluate each one individually. This often leads to the automation of analysis processes and a reliance on pipelines to filter data before conducting phylogenetic analyses. However, it's crucial to comprehensively understand each analysis step and its implications for our data. In this tutorial, we'll track our data through various alignment filtering stages and evaluate the impact of each on the phylogenetic signal. Our approach mirrors that of Tan et al., 2015 highlighting that conventional filtering pipelines may not consistently enhance our data.

Overview of exercise:

We will use 98 samples corresponding to 21 phyla spanning the animal phylogeny. All samples used are publicly available in NCBI and other sources. The total amount of loci we will be using is 978. (A brief description of how the data were generated is provided here.).

Step 1: We will first get a sense of the data, how many loci are there, how many samples per locus etc.

Step 2: We will monitor how 3 different filtering options affect the phylogenetic signal in these data.

  1. mild filtering for removing extreme gappyness
  2. mainstream site-filtering with trimal
  3. outlier filtering with PhylteR

We will evaluate the effect of each of these steps using the average monophyly score. The monophyly score measures how well, on average, each of the gene-trees reconstructs each of the phyla (i.e. clades that we are certain of their monophyly) included in the dataset. Here is the assignment of samples to phyla for this dataset: phyla-groups.txt

Ultimately we will fill out the following table:

Phase Number of Trees Average Monophyly Score
Original ? ?
Mild filtering ? ?
Trimal ? ?
PhylteR ? ?

The exercise also serves for practising bash commands and scripting with files and analyses related to phylogenomics. For several of the tasks, scripting is required to find the answer - in all cases, it will be possible to do it with bash otherwise a script will be provided. The commands are given, but trying to understand the commands or even constructing them on your own is recommended. In most cases, there is more than one way of completing the task.

Set up the directory and download the data

Setup a directory for this tutorial and download the mafft alignments, the necessary python3 scripts and the "phyla-groups.txt" file that contains the grouping of taxa into phyla:

mkdir practical2
cd practical2

#Download the assignment of samples to phyla
wget https://raw.githubusercontent.com/Pas-Kapli/CoME-Tutorials/main/data-filtering/phyla-groups.txt

mkdir mafft
cd mafft 

#Download the mafft alignments and extract them from the compressed file
wget https://github.com/Pas-Kapli/CoME-Tutorials/raw/main/data-filtering/alignments/mafft.tar.gz
tar -xvzf mafft.tar.gz
rm mafft.tar.gz

cd ../
mkdir scripts
cd scripts

#Download a set of scripts we will need later
wget https://github.com/Pas-Kapli/CoME-Tutorials/raw/main/data-filtering/scripts/scripts.tar.gz
tar -xvzf scripts.tar.gz
rm scripts.tar.gz

Step 1 - data assessment

Q: How many samples are in total across all loci?

Command
cd ../mafft
for i in *mafft; do grep ">" $i ; done | sort -u | wc -l

Q: How many sequences are there per locus?

Command
for i in *mafft; do echo -n "$i "; grep ">" $i | wc -l; done | sort -n -k2

Let's get rid of the loci with less than five sequences:

mkdir discarded
mv EOG091G04FV.mafft EOG091G0BWJ.mafft EOG091G0I85.mafft EOG091G18YA.mafft EOG091G05N4.mafft discarded 

Q: How well represented are the animal phyla per locus?

Hint: you will need to use the "phyla-groups.txt" that you downloaded earlier and use a nested for loop.

Command
for j in *mafft; do for i in `grep ">" $j | cut -f2 -d">"`; do grep $i ../phyla-groups.txt; done | cut -f1 -d$'\t' | sort -u | wc -l; done | sort -n

(..takes a couple of minutes)

Use this alternative if you want the filename to show up:

for j in *mafft; do echo -n "$j "; for i in `grep ">" $j | cut -f2 -d">"`; do grep $i ../phyla-groups.txt; done | cut -f1 -d$'\t' | sort -u | wc -l; done | sort -n -k 2

Q: Should we worry about phyla representation with these data?

Q: How do the alignments look like? (open a few with seaview or something similar)

Q: Do you think it's ok to use them as they are?

Calculate the average monophyly score across loci

To calculate the average monophyly score we need the gene trees, therefore we need to perform a tree search. Even with the simplest tree search command shown below it will take hours to infer all 978 trees. You can do one of the inferences and download the remaining of trees with the wget command provided below

cd ..
mkdir original-trees
cd original-trees
raxml-ng --search1 --msa ../mafft/EOG091G1CLN.mafft --model WAG+G --prefix EOG091G1CLN.mafft
rm *

#Download the pre-calculated ML trees
wget https://github.com/Pas-Kapli/CoME-Tutorials/raw/main/data-filtering/trees/original-trees.tar.gz
tar -xvzf original-trees.tar.gz
rm original-trees.tar.gz

Tip: if you have access to a server you can parallelise the gene-tree searches with the bash command "parallel". Example command:

parallel --jobs 15 "raxml-ng --search1 --msa {}.mafft --model WAG+G" < Accs.txt

We can now calculate the monophyly score

for i in *bestTree; do python3 ../scripts/monophyly-score.py $i ../phyla-groups.txt; done >  ../monophyly_scores-original.txt
cd ../
awk '{ total += $3 } END { if (NR > 0) print total / NR }' monophyly_scores-original.txt

Takes a couple of minutes..

In case you had trouble creating it here is the monophyly_scores-original.txt

Step 2 - filtering

Mild site/taxa filtering

Remove sequences that have >80% missing data.

cd mafft
for i in *mafft; do python3 ../scripts/filter_gappy_taxa.py $i 0.8; done

###Remove loci that contain < 10 phyla representatives.

mkdir discarded2
for j in *nogappy; do echo -n "$j "; for i in `grep ">" $j | cut -f2 -d">"`; do grep $i ../phyla-groups.txt; done | cut -f1 -d$'\t' | sort | uniq -c | wc -l; done | awk '$2 < 10'  | cut -f 1 -d" " | paste -s -d" " > files-to-move.txt
mv `cat files-to-move.txt` discarded2/
rm files-to-move.txt

Remove columns with all missing data but a single taxon.

for i in *nogappy; do python3 ../scripts/filter_alignment-sites.py $i 1; done
rm *nogappy

Download the pre-calculated gene-trees

cd ..
mkdir mild-filtering-trees
cd mild-filtering-trees
wget https://github.com/Pas-Kapli/CoME-Tutorials/raw/main/data-filtering/trees/mild-filt-trees.tar.gz
tar -xvzf mild-filt-trees.tar.gz
rm mild-filt-trees.tar.gz

Recalculate monophyly score

for i in *bestTree; do python3 ../scripts/monophyly-score.py $i ../phyla-groups.txt; done > ../monophyly_scores-mild-filtering.txt
cd ..
awk '{ total += $3 } END { if (NR > 0) print total / NR }' monophyly_scores-mild-filtering.txt

In case you had trouble creating it here is the monophyly_scores-original.txt

Trimal site filtering

Alignment site filtering is most often based on ad hoc criteria regarding alignment quality such as gappyness and sequence similarity or by retaining only the alignment positions that are robust to changes to alignment parameters. Nevertheless, reports on the impact of alignment trimming on the quality of downstream phylogenetic analysis vary and hence trimming should be used cautiously.

Here we will use a popular alignment tool caller Trimal and we will try to understand the effect of this filtering to the phylogenetic signal of our gene-trees based on the monophyly score.

Run Trimal

cd mafft
for i in *mafft; do trimal -automated1 -in $i -out $i.trimal; done

Download gene-trees

cd ../
mkdir trimal-trees
cd mkdir trimal-trees
wget https://github.com/Pas-Kapli/CoME-Tutorials/raw/main/data-filtering/trees/trimal-trees.tar.gz
tar -xvcf trimal-trees.tar.gz

Recalculate monophyly score

for i in *.bestTree; do python3 ../scripts/monophyly-score.py $i ../phyla-groups.txt; done > ../monophyly_scores-trimal.txt
cd ../
awk '{ total += $3 } END { if (NR > 0) print total / NR }' monophyly_scores-trimal.txt

In case you had trouble creating it here is the monophyly_scores-original.txt

Phylter - taxa/locus filtering

Besides alignment filtering, there is often a step of identifying outlier sequences either due to artifactual (e.g. contaminations, misalignments) or real biological reasons (biases in state composition, rate of evolution etc.). There are several pipelines aiming to tackle such problems, however, the effect of data removal is not well understood.

We will perform a filtering step with PhylteR and we will try to understand the effect of this filtering to the phylogenetic signal of our gene-trees based on the monophyly score.

Write all trees in a text file

cd  mild-filtering-trees
cat *bestTree > all-trees-mild-filtering.txt
R

within R

> library(phylter)
> trees <- ape::read.tree("all-trees-mild-filtering.txt")
> results <- phylter(trees, bvalue = 0, distance = "patristic", k = 3, k2 = 3, Norm = "median", Norm.cutoff = 0.001, gene.names = NULL, test.island = TRUE, verbose = TRUE, stop.criteria = 1e-5, InitialOnly = FALSE, normalizeby = "row", parallel = TRUE)
> results$Final$Outliers
> write.phylter(results, file = "phylter.out")
ctl+d

Download the gene-trees and re-calculate the average monophyly score

cd ../
mkdir phylteR-trees
cd mkdir phylteR-trees
wget  https://github.com/Pas-Kapli/CoME-Tutorials/raw/main/data-filtering/trees/phylteR-trees.tar.gz
tar -xvzf phylteR-trees.tar.gz

for i in *bestTree; do python3 ../scripts/monophyly-score.py $i ../phyla-groups.txt; done > ../monophyly_scores-phylteR.txt
cd ../
awk '{ total += $3 } END { if (NR > 0) print total / NR }' monophyly_scores-phylteR.txt

In case you had trouble creating it here is the monophyly_scores-original.txt

Q Did the quality of the data improve?

Q How do our results compare to those reported by Tan et al., 2015

Q Do you think the results would change if you re-did the gene-tree inferences at each step?

⚠️ **GitHub.com Fallback** ⚠️