Introduction

Introduction

The orthofinder pipeline is as the name suggests; it runs OrthoFinder to find orthologs between a set of protein sequences. THe pipeline is pretty straight forward, but has a little bit of processing that happens before orthologs are detected.

In brief, the pipeline performs the following:

Generate gene statistics using AGAT.
Filter gene annotations for longest isoforms using AGAT.
Extract longest isoforms using AGAT.
Identify orthologs using OrthoFinder.
Convert protein MSAs (from OrthoFinder) to nucleotide alignments using Pal2Nal.
Clean alignments using ClipKIT
Generate MSA summary statistics using a custom tool msaSummary

As there is no 'best' way to get a representative sequence, this pipeline simply uses the longest isoform. This simplifies the ortholog detection process, preventing isoforms from inteferring with single-copy detection.

Arguments

The current version of the orthofinder pipeline has the following arguments:

--gffs string                Directory path to annotation files. Extension must be '.gff' or '.gff3'.
--genomes string             Directory path to genome assembly files. Extension must be '.fa', '.fasta' or '.fna'.
--tree string                File path to phylogenetic tree that OrhtoFinder should use.
--search_prog string         Which sequence search program OrthoFinder should use. Options: blast, diamond, diamond_ultra_sens, blast_gz, mmseqs, blast_nucl.
--trim_msa boolean           Should OrthoFinder trim multiple sequence alignments?
--stop_early boolean         Stop OrthoFinder after generating MSA files. Running the full pipeline can take a LONG time.

Below I'll go into each argument in a bit more detail.

Argument overview

Gffs

The argument --gffs takes the path to a directory that contains all the GFF3 files that you want to extract proteins from and compare. This pipeline assumes that the basename of the file (i.e. without the suffix) is the identifier you want in the ortholog files. Further, the basename used for the GFF3 files must match the basename of the genome files (see below).

Genomes

The --genomes argument takes the same input at the GFF argument above, but this time the argument should point to a directory that contains genome files in FASTA format. The files should all have the suffix .fa. It is imperative that the basenames of the genome files match the GFF3 files EXACTLY. The pipeline matches the GFF3 files with the genome files using a left-join on the basenames of the files.

Any files that don't have a match between the genomes-gffs will be ignored in the analysis.

Tree

This argument is simply a user-specified Newick format species tree for OrthoFinder to use. By default, OrthoFinder will build its own species tree from the orthologs. However, this can be biased by the sequencing platform of the genome, so be aware.

Search program

The --search_prog argument specifies which search tool to use when performing the all-by-all alignment between proteins in OrthoFinder. The valid options are MMseqs2 or Diamond.

Trim MSA

By default, OrthoFinder trims the multiple sequence alignments that it generates using some custom cut-offs. By default, I've turned this off. If you'd like to trim the MSAs, pass the argument --trim_msa.

Stop early

This argument controls the exit point for OrthoFinder. By default, the pipeline will let OrthoFinder run to complettion. However, if you'd rather have OrthoFinder finish after generating MSAs, pass --stop_early.

Orthofinder Pipeline - a-lud/nf-pipelines GitHub Wiki

Introduction

Arguments

Argument overview

Gffs

Genomes

Tree

Search program

Trim MSA

Stop early

Pipeline schematic

⚠️ GitHub.com Fallback ⚠️

Orthofinder Pipeline - a-lud/nf-pipelines GitHub Wiki

Introduction

Arguments

Argument overview

Gffs

Genomes

Tree

Search program

Trim MSA

Stop early

Pipeline schematic

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️