CodeML Pipeline - a-lud/nf-pipelines GitHub Wiki

Introduction

This sub-workflow is dedicated to running selection tests using CodeML from the PAML suite of tools. Specifically, this pipeline makes use of ETE3, a CLI that provides a more user-friendly experience than the default CodeML program.

The pipeline utilises cleaned, single-copy-orthologs (SCO) identified through whichever means you wish. I recommend using OrthoFinder, however many tools will do. For this pipeline, I have stuck to the models implemented in ETE3. That is to say, I have accessory functions that know how to handle output from the default, common model-comparisons that ETE3 has specified, but will fail with anything more unique.

This pipeline also has the functionality to perform drop-out analyses as discussed by Kowalczyk et al., 2021. This is a method whereby a Branch-Site model is run using a foreground lineage, within which positive selection can occur ( $\omega\gt1$ ). Branch-Site models in the PAML package fix $\omega$ in non-foreground branches to $\omega=1$, meaning if positive selection was occurring on a background branch, the Branch-Site test would not report it.

To account for this, Kowalczyk et al., 2021 demonstrated that by dropping the samples on the foregorund branch and running a Site model to test for positive selection on only the background samples, you can determine if positive-selection is indeed occurring on the background branches as well. Using an approach such as this prevents misattributing positive selection to convergent phenotypes.

Arguments

The current version of the orthofinder pipeline has the following arguments:

--msa string                 Directory path to MSA files. Extension must be '.fa' or '.fasta'.
--tree string                File path to phylogenetic tree.
--models string              Which CodeML models to run. Provide as a quoted string separated by spaces. Options: M0, M1, M2, M3, M4, M5, M6, M7, M8, M8a, M9, M10, M11, M12, M13, SLR, fb_anc, bsA, bsA1, bsB, bsC, bsD, b_free, b_neut, fb.
--dropout boolean            Run drop-out analysis if BS models are run. Only works if you are running bsA AND bsA1.

Below I'll go into each argument in a bit more detail.

Argument overview

MSA

The first argument --msa takes a directory path. This directory should contain all the codon-translated MSA files.

Tree

The next argument --tree expects a filepath to a newick species tree. Newick format is important as, if you are running the drop-out analysis, the pipeline automatically determines which samples to drop (based on your marking of the tree using #1 symbols) and returns a pruned version. If another tree format is used, the pipeline will fail at this step.

Models

As I stated in the introduction, I've simply implemented common models that are defined by the authors of PAML, as well as the developers of ETE3. The --models argument expects a quoted string of models you wish to run. If there are valid comparisons to make (e.g. M2 vs M1), the pipeline will automatically make them (valid comparisons againt taken from the ETE3 documentation).

Dropout

The --dropout argument is simply a flag that dictates if a dropout analysis should be run or not.

Pipeline schematic

⚠️ **GitHub.com Fallback** ⚠️