CodeML ETE3 implementation - a-lud/nf-pipelines GitHub Wiki

The ETE3 implementation of CodeML is an easy way to run selection analyses. My personal preference is to use HyPhy, however there may be use cases where CodeML is the tool for the job. Below I briefly outline how to use this sub-workflow.

Arguments

The pipeline has been designed with simplicity in mind. How it's implemented should suffice for most use cases, however there may be instances where a more complex utilisation is required. In those cases, I recommend scripting this up separately using ETE3. The main pipeline requirements are as follows:

--pipeline codeml: Pass the codeml argument to pipeline to run the sub-workflow
--models: Comma separated string of models to run. Valid models include [ M0, M1, M2, M3, M4, M5, M6, M7, M8, M8a, M9, M10, M11, M12, M13, SLR, bsA, bsA1, bsB, bsC, bsD, b_free, b_neut, fb, fb_anc ]
--trees: Paths to tree files to use as a comma separated string
--tests: String of model comparisons. Models to be compared should be separated with commas. Multiple model comparisons should be separated with spaces (e.g. 'M2,M1 M3,M0')
--codeml_optional: Optional arguments to provided to CodeML

Pipeline flow

The number of jobs submitted to the cluster is dictated by the product of alignment x tree combinations. If you have many sequences, this will mean many jobs will be submitted to the cluster. The implementation of CodeML through ETE3 means that for each sequence and tree combinations, all user specified models will be run.

This is nice, as it means we don't have to submit a unique job for each alignment x tree x model combination, which would get excessive (fast). However, we are still somewhat limited by the number of alignment files and tree combinations we generate, so consider this before diving into a full 10,000 gene analysis with 15 trees to compare (i.e. 150,000 jobs).