Orthofinder Pipeline - a-lud/nf-pipelines GitHub Wiki
The orthofinder
pipeline is as the name suggests; it runs OrthoFinder to find orthologs between a set of
protein sequences. THe pipeline is pretty straight forward, but has a little bit of processing that happens before
orthologs are detected.
In brief, the pipeline performs the following:
- Generate gene statistics using AGAT.
- Filter gene annotations for longest isoforms using AGAT.
- Extract longest isoforms using AGAT.
- Identify orthologs using OrthoFinder.
- Convert protein MSAs (from
OrthoFinder
) to nucleotide alignments using Pal2Nal. - Clean alignments using ClipKIT
- Generate MSA summary statistics using a custom tool msaSummary
As there is no 'best' way to get a representative sequence, this pipeline simply uses the longest isoform. This simplifies the ortholog detection process, preventing isoforms from inteferring with single-copy detection.
The current version of the orthofinder
pipeline has the following arguments:
--gffs string Directory path to annotation files. Extension must be '.gff' or '.gff3'.
--genomes string Directory path to genome assembly files. Extension must be '.fa', '.fasta' or '.fna'.
--tree string File path to phylogenetic tree that OrhtoFinder should use.
--search_prog string Which sequence search program OrthoFinder should use. Options: blast, diamond, diamond_ultra_sens, blast_gz, mmseqs, blast_nucl.
--trim_msa boolean Should OrthoFinder trim multiple sequence alignments?
--stop_early boolean Stop OrthoFinder after generating MSA files. Running the full pipeline can take a LONG time.
Below I'll go into each argument in a bit more detail.
The argument --gffs
takes the path to a directory that contains all the GFF3 files that you want to
extract proteins from and compare. This pipeline assumes that the basename of the file (i.e. without the suffix)
is the identifier you want in the ortholog files. Further, the basename used for the GFF3 files must match
the basename of the genome files (see below).
The --genomes
argument takes the same input at the GFF argument above, but this time the argument should
point to a directory that contains genome files in FASTA format. The files should all have the suffix .fa
.
It is imperative that the basenames of the genome files match the GFF3 files EXACTLY. The pipeline matches
the GFF3 files with the genome files using a left-join on the basenames of the files.
Any files that don't have a match between the genomes-gffs will be ignored in the analysis.
This argument is simply a user-specified Newick format species tree for OrthoFinder
to use. By default,
OrthoFinder
will build its own species tree from the orthologs. However, this can be biased by the sequencing
platform of the genome, so be aware.
The --search_prog
argument specifies which search tool to use when performing the all-by-all alignment between
proteins in OrthoFinder
. The valid options are MMseqs2 or Diamond.
By default, OrthoFinder
trims the multiple sequence alignments that it generates using some custom cut-offs. By default,
I've turned this off. If you'd like to trim the MSAs, pass the argument --trim_msa
.
This argument controls the exit point for OrthoFinder
. By default, the pipeline will let OrthoFinder
run to
complettion. However, if you'd rather have OrthoFinder
finish after generating MSAs, pass --stop_early
.