CodeML Pipeline - a-lud/nf-pipelines GitHub Wiki
This sub-workflow is dedicated to running selection tests using CodeML
from the PAML suite
of tools. Specifically, this pipeline makes use of ETE3, a CLI that provides a more user-friendly
experience than the default CodeML
program.
The pipeline utilises cleaned, single-copy-orthologs (SCO) identified through whichever means you wish.
I recommend using OrthoFinder
, however many tools will do. For this pipeline, I have stuck to the
models implemented in ETE3
. That is to say, I have accessory functions that know how to handle
output from the default, common model-comparisons that ETE3
has specified, but will fail with anything
more unique.
This pipeline also has the functionality to perform drop-out analyses as discussed by Kowalczyk et al., 2021.
This is a method whereby a Branch-Site model is run using a foreground lineage, within
which positive selection can occur (
To account for this, Kowalczyk et al., 2021 demonstrated that by dropping the samples on the foregorund branch and running a Site model to test for positive selection on only the background samples, you can determine if positive-selection is indeed occurring on the background branches as well. Using an approach such as this prevents misattributing positive selection to convergent phenotypes.
The current version of the orthofinder
pipeline has the following arguments:
--msa string Directory path to MSA files. Extension must be '.fa' or '.fasta'.
--tree string File path to phylogenetic tree.
--models string Which CodeML models to run. Provide as a quoted string separated by spaces. Options: M0, M1, M2, M3, M4, M5, M6, M7, M8, M8a, M9, M10, M11, M12, M13, SLR, fb_anc, bsA, bsA1, bsB, bsC, bsD, b_free, b_neut, fb.
--dropout boolean Run drop-out analysis if BS models are run. Only works if you are running bsA AND bsA1.
Below I'll go into each argument in a bit more detail.
The first argument --msa
takes a directory path. This directory should contain all the codon-translated
MSA files.
The next argument --tree
expects a filepath to a newick species tree. Newick format is important as, if
you are running the drop-out analysis, the pipeline automatically determines which samples to drop (based
on your marking of the tree using #1
symbols) and returns a pruned version. If another tree format
is used, the pipeline will fail at this step.
As I stated in the introduction, I've simply implemented common models that are defined by the authors of
PAML
, as well as the developers of ETE3
. The --models
argument expects a quoted string of models
you wish to run. If there are valid comparisons to make (e.g. M2 vs M1), the pipeline will automatically
make them (valid comparisons againt taken from the ETE3
documentation).
The --dropout
argument is simply a flag that dictates if a dropout analysis should be run or not.