EukPhylo Part 2: MSAs, trees, and contamination loop - Katzlab/EukPhylo GitHub Wiki

Overview and Modularity

EukPhylo Part 2 is designed to:

Generate multisequence alignments (MSAs) after first filtering in a "Pre-Guidance" step and then using Guidance version 2.02 , iterating to remove sequences considered non-homologs (based on sequence score is 0.3)
Generate gene trees using a 3rd party program (RaxML, IQTree, FastTree)
Remove contaminating sequences with the "Contamination Loop" (CL) and user-defined rules for both sister/subsister and clade grabbing options

Select orthologs and construct concatenated alignments for building species trees.

EukPhylo part 2 starts from the “ReadyToGo” files produced by part 1 (or any set of fasta files of sequences per-taxon with names that match EukPhylo 6 criteria) and generates multisequence alignments and trees. The pipeline is modular, and can be started, paused and resumed at multiple points. Output and input options of this part of the pipeline are flexible: users can input a folder of amino acid ReadyToGo files output by part 1, unaligned amino acid sequences per gene family (GF; in which case pre-Guidance filters are not run), aligned sequence files (one file per each GF, in which case Guidance does not run), or even trees if a user is running only the contamination loop. Details are provided below.

Example output files for EukPhylo part 2 and the contamination loop can be found here.

Installation

The following dependencies are required to run EukPhylo part 2, and only needs to be done once. The dependencies are confirmed to work using the version numbers in parentheses, though other versions may work as well.

Python 3, and the following libraries:
- Biopython (version 1.75)
- ETE3 (version 3.1.2.)
- tqdm (version 4.66.4; unlikely to matter)
Guidance (version 2.0.2)
- Note that once you have downloaded Guidance, you will likely have to run the command chmod u+x on the guidance executable script (found within the www folder) as well as all dependency executable scripts (found within the programs folder).
- Both the Guidance AND trimAl folders as downloaded from the respective websites (see trimAl link below) should be placed in the Scripts folder (see more info on folder structure below).
trimAl (version 2.rev0)
MAFFT (version 7.49)
IQ-Tree (version 2.1.2) and/OR RAxML (version 8.2.12) and/OR FastTree (version 2.1)

Setup

Input data and folder structure

Running EukPhylo Part 2 requires at least four items in your main directory:

A folder named Scripts and containing all scripts from EukPhylo part 2, as well as the Guidance and TrimAL folders downloaded from the respective websites (see links in Dependencies section above).
A folder containing your input files
A taxon list
An OG list
An Output folder
a shell script (or equivalent) to load modules and run programs ++

See below for details.

Databases

For those users interested in eukaryotic phylogeny, we provide a database of 1,000 diverse genomes and transcriptomes from across the eukaryotic, bacterial, and archaeal tree of life, with a focus on microeukaryotic diversity (see dataset S1 and S6). This database is in the form of "ReadyToGo" files, the output of EukPhylo part 1 (Dataset S16). This means that using this dataset, you can jump right in to running analyses of any subset of these taxa using any of the OGs in the Hook Database (see Part 1 above).

Running EukPhylo Part 2

You can find exemplar Run for EukPhylo in figshare

Once you have set up your run folder as described in the Setup section above, you're ready to run EukPhylo part 2. The pipeline is highly modular, and contains five main sections:

pre-Guidance (groups your sequences by GF instead of by sample and applies some basic filters; produces an unaligned amino acid file for each GF)
Guidance (aligns your amino acid sequences and iteratively removes putative non-homologs; produces an aligned amino acid file for each GF)
Tree building (produces a tree file [newick string] for each GF)
Contamination loop (see below)
Concatenation (see below) You can start and stop running EukPhylo part 2 at any of these sections, and the input to each of these sections is going to be different. This will be explained below. You're going to run part 2 using a Python command; if you're starting from scratch (ReadyToGo files as output by part 1) and would like to run all the way through tree-building (the most basic way of running the pipeline), you'll use the following command

python Scripts/eukphylo.py --start raw --end trees --gf_list listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder

The --start and --end parameters tell EukPhylo what to expect in terms of input, and when to stop running the pipeline.

If you want to produce trees, keep the default --end parameter set to 'trees'
If you want to run through the pre-Guidance step, set --end to 'unaligned'
If you want to run through the Guidance step, set --end to 'aligned'
If you want to start at a different point other than raw data, you will change the default --start parameter to 'unaligned' (input a fasta file of unaligned amino acid sequences for each GF), 'aligned' (input a fasta file of aligned amino acid sequences for each GF) , or 'trees' (input a newick string file for each GF).

The --data parameter is where you point EukPhylo to your input file. If starting from ReadyToGo files (--start raw), this should be the path to a folder containing amino acid ReadyToGo files as output by part 1. If starting with Guidance, this should be a path to a folder of unaligned amino acid files (--start unaligned), etc. Note that when you are resuming a run and want to use files in the "Output" folder from a previous run, you will have to rename this "Output" folder to something else and give the new path as your --data parameter. For example, if you run only the pre-Guidance step, EukPhylo will generate a folder called Output/Pre-Guidance. If you then want to then later resume the run, you will have to rename this, e.g. to OldOutput/Pre-Guidance, and then in your new command use --start unaligned --data OldOutput/PreGuidance.

You will also need to give EukPhylo part 2 a list of all of the sample identifiers (taxon_list.txt) and gene family identifiers (listofOGs.txt) you want to include in your analysis; these text files should have no header and should just contain the list of identifiers, with one identifier per row.

Below is a list of basic EukPhylo part 2 parameters:

Argument	Default value	Options	Description
--start	raw	raw, unaligned, aligned, trees	Stage at which to start running EukPhylo.
--end	trees	unaligned, aligned, trees	Stage until which to run EukPhylo. Options are "unaligned" (up to but not including guidance), "aligned" (up to but not including RAxML), and "trees" which will run through RAxML.
--gf_list	No default	Any valid path	Path to the file with the GFs of interest. Only required if starting from the raw dataset.
--taxon_list	No default	Any valid path	Path to the file with the taxa (10-digit codes) to include in the output.
--data	No default	Any valid path	Path to the input dataset. The format varies depending on your --start parameter. If running the contamination loop starting with trees, this folder must include both trees AND a fasta file for each tree (with identical file names other than the extension) that includes an amino-acid sequence for each tip of the tree (with matching sequence names).
--output	Current directory	Any valid path	Directory where the output folder should be created. If not given, the folder will be created in the parent directory of the folder containing the scripts.

Optional arguments can then be added to the base command, and will be described below. In the following is described each stage of EukPhylo, and some key parameters to know for each step.

Pre-Guidance: Reorganizing files by gene family, and optional filters

The pre-guidance step of EukPhylo part 2 takes ReadyToGo files (one file per sample as output by EukPhylo part 1) and reorganized the amino acid sequences into one file per gene family (one file for each one of the GF you chose in the listofOGs.txt, see above). It also applies some optional filters, described below.

Filtering by GC content

Using a utility script (GC_identifier.py), prior to running EukPhylo part 2, users can choose to rename each sequence in the ReadyToGo file depending on whether the sequence falls outside of a user-defined range of GC content. The gene family identifier (last ten digits of the sequence identifier and by default prefixed by "OG6_") will be renamed depending on if the sequence GC content falls below (OGA) or above (OGG) or within (OG6) the user specified GC range. If the user wishes to only include one category of sequences when running EukPhylo part 2, the user should give the relabeled ReadyToGo files. The parameters for this when running pre-guidance is --og_identifier and the options are 'OG','OG6','OGA','OGG' with the default being ‘OG’, which passing all the sequences to guidance without filtering.

Adding these options to the command line will give:

python Scripts/eukphylo.py --start raw --end trees --gf_list listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder --og_identifier OG

A note on filtering by GC: we set our bounds by visually inspecting plots that compare GC content at silent sites (GC3 Degen) to effective number of codons (ENc) as calculated by Wright 1991. These graphs can be made using the utility script CUB.py. For example for the following graph, we might choose to exclude all points with a GC content less than 50% (so, set the lower bound to be 50% [OGA] and the upper bound to be 100% [OGG], then in part 2 use --og_identifier OG6).

Similarity filter

Another option to filter sequences from the ReadyToGo files at the pre-guidance step (and so reducing computer resources needed, and sequence redundancy) is the similarity filter. The purpose of the similarity filter is to remove likely in-group paralogs prior to running Guidance (a CPU intensive process). Here, user can choose to remove sequences that are too similar by adding the --similarity_filter parameter into the command line, on all the dataset or only on a specific set of taxa (see the table below for a description of the parameters to use).=

Argument	Default	Choices	Help
--og_identifier	OG	OG, OG6, OGA, OGG	Program to use for selecting sequences by GC width.
--similarity_filter	flag (true/false)	include or exclude the argument	Run the similarity filter in pre-Guidance.
--sim_cutoff	1	Any number between zero and one	Sequences from the same taxa that are assigned to the same OG are removed if they are more similar (% amino acid identity over 20% of their length) than this cutoff.
--sim_taxa	No default	Any valid path	Path to the file with the taxa (10-digit codes) to apply the similarity filter on.

Adding these options to the command line will give:

python Scripts/eukphylo.py --start raw --end trees --gf_list listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder --similarity_filter --sim_cutoff 0.99 --sim_taxa sim_taxa_list.txt

Other optional parameters:

Blacklist

The blacklist is a user-defined set of sequences to be removed from runs. You might choose to keep a list of sequences removed by Guidance to avoid reconsidering these non-homologs in future runs as this can save computer time, if you're going to be using the same ReadyToGo files and GFs in multiple runs of EukPhylo part 2. For our study, we chose to include only sequences removed by Guidance in our blacklist, but users should choose what fits best for their study and their data. To include this parameter in your EukPhylo run, you will need to add the --blacklist flag to the command line as follows:

python Scripts/eukphylo.py --start raw --end trees --gf_list listofOGs.txt --taxon_list taxon_list.txt --data Input_folder --output Output_folder --blacklist Blacklist.txt

Argument	Default	Choices	Help
--blacklist	No default	Any valid path	Path to a text file with a list of sequence names not to consider.

Guidance

Within EukPhylo part 2, we use Guidance to assess homology within gene families. EukPhylo part 2 runs Guidance in an iterative fashion to remove non-homologous sequences, defined as those that fall below the sequence score cutoff (we note that there is some stochasticity here). We note that we initially wrote the tool to use Guidance v2.0.2, but have since updated it to use Guidance v2.1, which runs faster but otherwise performs very similarly. Users who wish to use the older version of Guidance will have to make a small change in guidance.py (look for a comment in the script with the phrase "UNCOMMENT THE FOLLOWING LINE IF USING v2.0.2". Users should consult the Guidance 2.02 documentation or the Guidance v2.1 Github page for details on these parameters. After inspecting a diversity of gene families, we have lowered the default sequence score cutoff from 0.6 to 0.3, though this may not be appropriate for all genes. All sequences removed by Guidance are listed in output files with their score, and MSAs are rebuilt after each iteration. Some options are available to change the default set up for Guidance:

Argument	Default	Choices	Description
--guidance_path	none	A valid path	Only required if running Guidance v2.1, and not required if running Guidance v2.0.2. Path to the downloaded Guidance folder (probably called guidance_Linux or guidance_MacOS-arm64, this folder should contain a folder called "script" which contains the guidance_main.py script). You can download this folder from this link: https://github.com/XseniaP/Guidance_mid/tree/main.
--guidance_iters	5	Any positive integer	Number of Guidance iterations for sequence removal.
--seq_cutoff	0.3	Any number between 0 and 1	During guidance, taxa are removed if their score is below this cutoff.
--col_cutoff	0.0	Any number between 0 and 1	During guidance, columns are removed if their score is below this cutoff.
--res_cutoff	0.0	Any number between 0 and 1	During guidance, residues are removed if their score is below this cutoff.
--keep_temp	False	include or exclude the argument	Use this to keep ALL Guidance intermediate files.
--keep_iter / -z	False	include or exclude the argument	Keep all Guidance iterations (beware this will be very large)

Gene trees

After homology assessment and building MSAs (the Guidance step), EukPhylo trims alignments and build trees. By default, columns in the alignments are trimmed at 70% gaps with TrimAL, and trees by default are built by IqTREE with an LG+G model; users may choose to use a different third-party tool for phylogenetic reconstruction.

Argument	Default	Choices	Description
--tree_method	iqtree_fast	iqtree, iqtree_fast, raxml, fasttree	Program to use for tree-building.

NOTE: These processes are resource-intensive. Each system has its own syntax and requirements for running resource-intensive jobs. Here, we added examples of how we run in our local Smith College 'grid' system and on the more powerful HPC Unity cluster. Please adjust the 'tree.py' and 'run_eukphylo.sh' scripts to match with your system's capabilities, as specified in the scripts.

Contamination loop

The contamination coop (CL) is implemented within EukPhylo to allow the removal of contaminants based on the topology of each tree (phylogeny-informed contamination removal). Three modes are available: sister-, subsister-, and clade-based contamination removal. All modes take a user defined file of 'rules,' used to identify the sequences to remove. We first provide an overview of the three modes and then give details on running below.

Sisters-based contamination removal identifies sequences as putative contaminants based on their sister relationships. If a sequence from sample A appears on a tree sister to a sequence from sample B, and sample B is known to have contaminated sample A, then the sequence from sample A will be removed. Subsisters-based removal operates similarly, but looks at the taxa that are sister to sample A's parent node, useful for when multiple samples are contaminated by the same other sample.

Clade-based contamination removal operates differently. In this mode, the CL searches for monophyletic clades in each gene tree that match a set of given criteria. For example, if we want to 'clade-grab' for robust Opisthokonta clades, we might choose to keep only opisthokont sequences that fall in a monophyletic clade of 12 or more species for a study that include 20 opisthokonts; all other opisthokont sequences in the tree are removed.

The CL runs iteratively and users must set the number of times that rules should be applied to reconstructed trees using the --nloops parameter. Starting with a set of trees and a list of rules (e.g., a sequence from a ciliate is to be removed if it falls sister to a known food source), EukPhylo will for each iteration: identify a list of sequences as contaminants (writing them out to a file called SequencesRemoved_ContaminationLoop.txt), generate a fasta file for each gene family excluding contaminating sequences, reconstruct an alignment using Guidance (or just MAFFT), and generate a new tree. The default setting is to run the CL for 5 loops, and users can inspect outputs to determine optimal number for their study.

Contamination loop setup

You can find exemplar Run for the Contamination Loop in figshare

The CL requires 1) a folder of alignments (it is most correct to not give gap trimmed alignments here) and 2) a folder of gene trees, and they should be formatted in the same way as output by the preceding steps of EukPhylo part 2 (i.e. in the "Output" folder, see above); most importantly, the alignment and tree files should begin with the same unique GF identifer. You can also give it data not output by EukPhylo, but you will need to match the folder, file, and sequence name formats.

You will also need to create a 'rules' file to define which sequences should be removed in which circumstances. The format here varies between the different modes of the CL, but they are all tab-separated files. Examples can be found in Datasets S8-S12 on this project's Figshare.

Sisters/subsisters mode

In the sisters- and subsisters-modes, your rules file should include three columns. Each row represents a rule, for which a sequence from a taxon (identified by a ten-digit code or shorter code in the first column) will be removed if it is sister to a sequence from the taxon in the second column and on a branch that is shorter than X times the average branch length in the tree, where X is the number in the third column. Set the third column to “NA” if you do not desire to put any branch length restriction for the rule. For example, the line

Op_ch_Dgra	Sr_di	0.1

indicates that the a sequence from the choanoflagellate Op_ch_Dgra should be removed if it is sister to any dinoflagellate (any sequence beginning with the prefix Sr_di) on a branch that is less than one tenth the average branch length in the gene tree. In some cases, a user might want to remove a sequence regardless of branch length, in which case the third column is left blank: so here we are removing a ciliate anytime it falls sister to a green alga regardless of branch length:

Sr_ci_Fsal	Pl_gr

Clade-grabbing mode

In clade-grabbing mode, each row again represents a rule. This time, there are five columns. The first column gives the target taxonomic group for which you are clade grabbing. Here you can give a ten-digit code, a subset of a code, or even the path to a text file containing a list of multiple codes if they don't all share a precise enough prefix. The third column gives the minimum number of target taxa that must be in a clade for it to be kept, and the second column gives the minimum proportion (or absolute number of >1) of taxa in that clade that are not in the target group. The fourth column allows you to give a list of 'special' taxa (or just a ten-digit code or a subset of a code), X of which must be present in a clade for it to be selected, where X is the number in the fifth column. For example, the line

Sr_ci	0.1	13	ciliate_genomes.txt	1

indicates that all ciliate sequences should be removed if they don't fall in a clade with at least 13 ciliate species (unique ten digit codes beginning with Sr_ci), where no more than 1/10 of the species in the clade are non-ciliates, and containing at least 1 sequence that begins with a prefix listed in the ciliate_genomes.txt file (i.e., if you're more confident in genomic data, you may want to make sure that there's a genome in your clade).

Having well-organized ten-digit-codes for sample identification is vital for running the CL, especially clade grabbing, because it will allow the CL to find all sequences belonging to taxa from a specific taxonomic group. For example, in the ten-digit-codes used in the 1000-ReadyToGo file database, in order to clade grab for Opisthokonta the CL will look for all sequences with ten-digit-codes starting with Op_, and to clade grab for Metazoa, all ten-digit-codes starting with Op_me, etc.

Running

To run the CL, use a similar command structure as described for running EukPhylo part 2 above, and add the --contamination_loop parameter to activate the contamination loop and specify a mode and the path to a rules file. There are different inputs required for the two different modes. You can't run both modes at the same time, so users should run one mode, then give the output files of that run as input to run the other mode. No matter what, if you want to run the contamination loop, you'll want to use the following two arguments:

Parameter	Required	Options	Description	Default
--contamination_loop	yes	seq, clade	The mode in which to run the CL	none
--nloops	no	any positive integer	Number of iterations	5

You can also change the tree-building method using the --cl_tree_method argument (iqtree, iqtree_fast, fasttree or raxml with the default being iqtree_fast). You can choose whether to run Guidance between each iteration, or just MAFFT, by using the --cl_alignment_method argument (mafft_only or guidance with the default being mafft_only). Here are some mode-specific arguments:

Sisters/subsisters mode

Below are the parameters for running sisters/subsisters. You must give a rules file to EITHER --sister_rules or --subsister_rules, or both.

Parameter	Required	Options	Description	Default
--sister_rules	only if filtering for sisters	Any valid path	Path to a text file containing 'sisters rules'	none
--subsister_rules	only if filtering for subsisters	Any valid path	Path to a text file containing 'subsisters rules'	none

Here's an example command for filtering by sisters relationships using the contamination loop:

python eukphylo.py --start trees --end trees --data <path to folder of input data> --output <path to output folder> --contamination_loop seq --sister_rules sister_rules.txt

In this case, the file sister_rules.txt should contain the rules for sisters-based removal. The format of this file is described above.

Clade-grabbing mode

Parameter	Required	Options	Description	Default
--clade_grabbing_rules	yes	Any valid path	Path to a text file containing 'clade-grabbing rules'	none
--clade_grabbing_exceptions	no	Any valid path	List of taxa to not remove for any reason	none

Here's an example command for filtering by sisters relationships using the contamination loop:

python eukphylo.py --start trees --end trees --data <path to folder of input data> --output <path to output folder> --contamination_loop clade --clade_grabbing_rules clade_rules.txt

In this case, the file clade_rules.txt should contain the rules for clade-based removal. The format of this file is described above.

Ortholog selection and concatenation

EukPhylo includes an optional step, which can be run after the tree-building stage (or by using --start trees and passing to the --data argument a folder of trees and corresponding aligned or unaligned sequences files), to select orthologs (one sequence at most per taxon from each GF) and build a concatenated alignment. EukPhylo first identifies for each taxon the monophyletic clade with the greatest number of species from that taxon's minor clade, using the first five digits of that taxon's sample identifier (e.g., Op_me for metazoa); alternatively, a user can select orthologs for only a target group of taxon using the --concat_target_taxa argument by inputting a file with a list of ten digit codes, or just a single ten-digit code or clade prefix. If only one sequence from the taxon falls into this largest clade, that's the sequence chosen for concatenation; otherwise, then a score is given to each sequence equal to its length times is k-mer coverage for transcriptomic data, and just the sequence length for genomic data, and the sequence with the highest score is taken. If a GF is not present in a taxon, then the space is filled with gaps in the concatenated alignments. This step produces a clearly labeled concatenated alignment, as well as a folder called "DataToConcatenate" in which can be found all the selected orthologs for each GF, aligned and unaligned.

To run this step, add the --concatenate flag to your EukPhylo command. If you don't include this flag, concatenation will by default not be run. If you want to run concatenation alone (i.e., you already have trees and alignments) then you'll have set the start parameter to --start trees and set up your input data in the style of EukPhylo's "Output" folder. Namely, create a folder called "Output" in the same directory that contains the "Scripts" folder. Inside the "Output" folder, create a folder called "Trees" and a folder called "Guidance." Put your input trees in the Trees folder and the input aligned or unaligned sequences files in the Guidance folder. Each file in the Trees folder should have a corresponding file in the Guidance folder; the names of these files should match up until the last period (i.e., until the file extension). For example, for gene family OG6_100206 you might have a tree file called OG6_100206.PostCL.tre and an alignment file called OG6_100206.PostCL.fasta. Here, the "OG6_100206.PostCL" must match. Below is an example command:

python eukphylo.py --start trees --concatenate --concat_target_taxa Sr_rh --data Output

In this case, the user is starting with an already built set of trees and alignments (in their Output folder) and would only like to concatenate sequences from Rhizaria (Sr_rh). If concatenating as part of an end-to-end EukPhylo run, just change your --start parameter accordingly and include other parameters as defined above.