Example output

a) Simple dataset from Scoary 1

How Scoary was run

# download dataset
wget --recursive -nd --no-parent -A csv,tsv https://scoary.bioinformatics.unibe.ch/datasets/scoary-1-tetracycline/
# run Scoary2
scoary2 \
    --genes Gene_presence_absence.csv \
    --gene-info gene-info.tsv \
    --traits Tetracycline_resistance.csv \
    --outdir out \
    --multiple_testing native:0.05 \
    --n-permut 10000
# the argument gene-info is optional

See output here.

b) Large metabolomics dataset (OrthoFinder)

How Scoary was run

# download dataset
wget --recursive -nd --no-parent -A tsv,txt https://scoary.bioinformatics.unibe.ch/datasets/44-propioni/
# run Scoary2
scoary2 \
    --genes N0.tsv \
    --gene-data-type 'gene-list:\t' \
    --gene-info N0_best_names.tsv \
    --traits traits_44_noraw.tsv \
    --trait-data-type 'gaussian:kmeans:\t' \
    --trait-info traits_info_44_noraw.tsv \
    --newicktree SpeciesTree_rooted.txt \
    --isolate-info isolate_info_44.tsv \
    --multiple_testing fdr_bh:0.1 \
    --n-permut 1000 \
    --n-cpus 8 \
    --random-state 42 \
    --outdir out

# After identifying significant traits, consider running Scoary2 again with '--trait-wise-correction True'
# and '--multiple_testing bonferroni:0.999' to see the significant traits in context of the wider dataset.
# This will lead to many more traits in the output, including many false positives, but it will also show 
# traits that may be related to the significant traits.

# The following are optional: gene-info, trait-info, isolate-info

See output of a metabolomics dataset here. Click here to see the output of the same dataset, but with --trait-wise-correction.

We recommend using limit_traits and a low n-permut to determine the optimal Scoary parameters before crunching the full dataset. If the dataset has a particularly strong population structure, also use worst_cutoff to remove traits that merely correlate with the phylogeny. (See manual.)

Output files

`summary.tsv`

Table that contains one row per trait analyzed, summarizing the result. Rows:

Trait: name of the trait
best_fisher_p: uncorrected p-value of Fisher's test for the "best" gene
best_fisher_q: multiple testing corrected p-value of Fisher's test for the "best" gene
best_empirical_p: p-value of the post-hoc permutation test for the "best" gene
best_fq*ep: product of fisher_q and empirical_p for the "best" gene
...: potential additional metadata columns from trait-info.tsv

The "best" gene is defined as the gene with the lowest best_fq*ep score.

See also: Understanding the p-values

`overview_plot.svg`

This SVG image file is made interactive in output.html. It contains:

Left: Dendrogram of traits.
Middle: negative logarithms of best_fisher_q, best_empirical_p and best_fq*ep calculated by Scoary2.
Right: names of the traits.

`overview.html`

Makes overview_plot.svg interactive and links traits to traits.html

See section How to use the app

`traits` (folder)

This folder contains a subfolder for each trait. These subfolders contain the following files:

results.tsv: The content of this file is similar to the main output of original Scoary. Rows:
- Gene: Name of the gene
- Name: Description of the gene from gene_info.tsv (optional)
- g+t+: Number of isolates that have the gene (g+) and have the trait (t+)
- g+t-, g-t+, g-t-: See g+t+. These four numbers constitute the input for Fisher's test.
- sensitivity: The sensitivity of using this gene as a diagnostic test to determine trait-positivity; more details here
- specificity: The specificity of using this gene as a diagnostic test to determine trait-negativity; more details here
- odds_ratio: Odds ration (quantifies the strength of the association)
- fisher_p, fisher_q: corrected and uncorrected p-value of Fisher's test
- empirical_p: p-value of the post-hoc permutation test for the "best" gene
- fq*ep: product of fisher_q and empirical_p
- contrasting: The maximum number of pairs that contrast in both gene and trait characters that can be drawn on the phylogenetic tree without intersecting lines
- supporting, opposing: The maximum number contrasting pairs that support/oppose the hypothesis
- best: p-value of picking n supporting pairs out of n contrasting pairs
- worst: p-value of picking n opposing pairs out of n contrasting pairs
coverage_matrix.tsv: Table that indicates which isolates have which genes
meta.json: Metadata about how the trait was binarized
values.tsv: Table that indicates the original continuous value of each isolate and how it was classified (optional)

`trait.html`

Visualizes the data in a trait's folder and makes it interactive.

Shows a phylogenetic tree of the isolates, the tables results.tsv and coverage_matrix.tsv, a pie chart that shows how the orthogene and the trait intersect in the dataset and a histogram of the continuous values, colored by whether each isolates has the orthogene and the trait.

See section How to use the app

`binarized_traits.tsv`

Binary trait matrix. Rows: isolates; columns: traits

`isolate_info.tsv`

Metadata about each isolate (optional)

`app` (folder)

Contains configuration, HTML and CSS for overview.html and trait.html

By modifying link-config, the behaviour of trait.html can be changed.

See section How to use the app

`logs` (folder)

Contains log files

Output - MrTomRod/scoary-2 GitHub Wiki

Example output

Output files

`summary.tsv`

`overview_plot.svg`

`overview.html`

`traits` (folder)

`trait.html`

`binarized_traits.tsv`

`isolate_info.tsv`

`app` (folder)

`logs` (folder)

⚠️ GitHub.com Fallback ⚠️

Output - MrTomRod/scoary-2 GitHub Wiki

Example output

Output files

summary.tsv

overview_plot.svg

overview.html

traits (folder)

trait.html

binarized_traits.tsv

isolate_info.tsv

app (folder)

logs (folder)

⚠️ **GitHub.com Fallback** ⚠️

`summary.tsv`

`overview_plot.svg`

`overview.html`

`traits` (folder)

`trait.html`

`binarized_traits.tsv`

`isolate_info.tsv`

`app` (folder)

`logs` (folder)

⚠️ GitHub.com Fallback ⚠️