Accessing and Interpreting Output - jacksonhturner/orthogarden GitHub Wiki

OrthoGarden doesn't just grow phylogenetic trees, it also produces a bounty of useful intermediate files and summary tables.

Navigating the Publish Directory

Intermediate output files created by OrthoGarden are accessible at every step of the pipeline. Navigating into the directory, set in the output directory when starting an OrthoGarden run, contains the following directory structure:

.
├── publish
|   ├── align_nt
|   ├── augustus
|   ├── design
|   ├── iqtree
|   ├── mafft
|   ├── mstatx
|   ├── mstatx_scores
|   ├── orthofinder
|   ├── orthofinder_finder
|   ├── remove_thirds
|   ├── summary
|   ├── summary_table
|   └── trimal
└── work

The work/ directory contains all intermediate files for each analysis under alias subdirectories organized in a human-unreadable fashion accessible by nextflow. Altering files within work/ will impact OrthoGarden's -resume function.

publish/ stores final intermediate files for each step within OrthoGarden. These human-readable files are organized into subdirectories for each analysis.

[!NOTE] Intermediate files for completed analyses can be accessed in publish while OrthoGarden is running.

Retrieving and interpreting the completed phylogeny

Once the pipeline is complete, the phylogeny generated by Orthogarden may be accessed at project/publish/iqtree/run_iqtree.treefile within the directory created in --publish-dir.

.
├── publish
|   └── iqtree
|       └── run_iqtree.treefile 🌱
└── work

This phylogeny may be used as an input file for a third party program designed for the visualization of phylogenies, such as the interactive Tree of Life (iTOL) or FigTree. No support is planned for visualizing phylogenies produced through OrthoGarden within the pipeline.

Evaluating contribution of taxa to the retrieval of orthologs

OrthoGarden creates summary tables that provide users an overview of the relative completeness of recovered genes. These tables support troubleshooting by describing how well each gene and taxon were represented in the final phylogeny.

.
├── publish
|   └── summary_tables
|       ├── summary_table_with_genes.tsv 🌱
|       └── summary_table_with_taxon.tsv 🌱
└── work

summary_table_with_genes.tsv

summary_table_with_genes.tsv shows how complete a specific gene recovered from a specific taxon is compared to the most complete copy of that gene recovered from all taxa. Rows of this dataset represent each gene recovered for analysis and columns are organized as follows:

column name description
Column A the zero-indexed sample number for the given gene
Column B (file_names) the given name of the gene identified, displayed as a file path
Column C (max_bp) the number of base pairs of the most complete copy of a particular gene (denoted in Column B) recovered
Columns D througn n (taxa names) the quotient of the number of base pairs recovered in a particular gene (denoted in Column B) for the given taxon divided by Column C

[!NOTE] Cells with values approaching 1 show highly complete genes recovered relative to other taxa, while values approaching 0 denote relatively incomplete genes recovered.

[!NOTE] Cells with a score of 1 don't necessarily indicate the complete recovery of a particular gene, but that the particular taxon demonstrates the most complete copy relative to others within this dataset.

summary_table_with_taxon.tsv

summary_table_with_taxon.tsv uses gene completeness scores from summary_table_with_genes.tsv to display the overall relative representation of each taxon in the final phylogeny. Row names represent each taxon, while columns denote the following:

column name description
Column A the zero-indexed sample number for the given gene
Column B (taxon) the given name of the taxon provided, as denoted in the metadata file used to initiate OrthoGarden
Column C (present) the number of genes of any length recovered for a particular taxon (denoted by Column B)
Column D (absent) the number of genes that were not recovered by OrthoGarden for a particular taxon (denoted by Column B)
Column E (90%) the number of genes recovered for a particular taxon (denoted by Column B) that are at least 90% of the length of the most complete gene recovered among all taxa
Column F (75%) the number of genes recovered for a particular taxon (denoted by Column B) that are at least 75% of the length of the most complete gene recovered among all taxa
Column G (50%) the number of genes recovered for a particular taxon (denoted by Column B) that are at least 50% of the length of the most complete gene recovered among all taxa
Column H (25%) the number of genes recovered for a particular taxon (denoted by Column B) that are at least 25% of the length of the most complete gene recovered among all taxa
Column I (10%) the number of genes recovered for a particular taxon (denoted by Column B) that are at least 10% of the length of the most complete gene recovered among all taxa

[!NOTE] Taxa highly divergent from the ingroup or with highly incomplete assemblies are observed to demonstrate less complete genes.

[!NOTE] Reducing the value of the taxonomic occupancy threshold (set by the --threshold_val parameters when initiating OrthoGarden) may increase the number of relatively complete genes recovered.

Variability of recovered genes between taxa

MStatX is used to calculate the trident statistic for each nucleotide position in the alignment of each gene recovered by OrthoGarden. These scores are averaged per gene to determine a broad measure of variability for genes recovered by OrthoGarden and are organized into a summary table. Variable genes within such a dataset may be selected for the design of highly specific primers (Pokhrel et al. 2025). Rows of this summary table represent each gene and columns denote the averaged trident statistic.

.
├── publish
|   └── mstatx_scores
|       └── mstatx_scores.csv 🌱
└── work

[!NOTE] Lower trident statistic values in this table are associated with more variable genes.