3 Orthology Inference - PhyloAI/Ortho2Web GitHub Wiki
Paragone-nf is an advanced bioinformatics pipeline developed to address paralogy resolution in nuclear gene datasets derived from target enrichment data (Hyb-Seq) and Deep Genome Skimming (DGS) data. This Nextflow-based tool extends the capabilities of HybPiper, providing enhanced functionality for effectively managing multiple gene copies. Paragone-nf employs algorithms such as Monophyletic Outgroup (MO), Maximum Inclusion (MI), and Rooted Ingroups (RT) to accurately differentiate between paralogous gene copies, thereby facilitating more precise downstream analyses.
The pipeline is compatible with various high-throughput computing environments, including SLURM for efficient resource management and job scheduling, and Singularity for containerization, enabling seamless integration into diverse bioinformatics workflows. Paragone-nf produces essential outputs, such as aligned gene sequences and detailed statistics on paralog resolution, which are invaluable for research into gene family evolution and phylogenomics.
Paragone-nf is especially valuable for investigating complex evolutionary histories, particularly in plants and other organisms characterized by extensive paralogous gene families.
Run Paragone with Nextflow
# Clone the Paragone repository
git clone https://github.com/chrisjackson-pellicle/paragone-nf.git
nextflow run paragone.nf -c paragone.config -profile standard_singularity --gene_fasta_directory directory/path/to/paralogs_no_chimeras --internal_outgroups outgroup_name --mo --mi --rt --threads 30 --pool 10 --minimum_taxa 40 -bg --outdir directory/path/to/result
--gene_fasta_directory
: Directory containing recovered paralog FASTA files.--internal_outgroups
: Outgroup(s) for analysis (comma-separated if multiple).--threads
: Number of threads used for the task.--pool
: Number of tasks running in parallel.--minimum_taxa
: Specifies the minimum number of taxa.-bg
: Runs the pipeline in the background.--outdir
: Specifies the output directory.
Advantages and limitations of Hybpiper-nf and Paragone-nf:
HybPiper-nf and Paragone-nf offer significant advantages by automating numerous manual steps through the use of software containers. These containers streamline software installation and reduce the need for redundant scripts, as highlighted by Yang and Smith (2014). Despite these benefits, both workflows present limitations that should be carefully considered.
Hybpiper-nf: A major limitation of HybPiper-nf is its inability to selectively re-run assemblies for specific samples when datasets are modified. This includes scenarios such as adding or removing samples, which limits its flexibility in adapting to changes. Furthermore, it lacks seamless integration of new sample statistics with existing results, which reduces its flexibility in accommodating dataset updates.
Paragone-nf: For Paragone-nf, computational efficiency remains a significant challenge, particularly when dealing with large-scale data analyses. Increasing the number of processing cores does not necessarily improve the speed of phylogenetic tree construction due to the inherent limitations of parallelizing certain tasks in phylogenetic analysis. Instead, a more effective strategy is to submit separate jobs and run multiple smaller tasks concurrently to optimize overall runtime.