Output - MrTomRod/scoary-2 GitHub Wiki
a) Simple dataset from Scoary 1
How Scoary was run
# download dataset
wget --recursive -nd --no-parent -A csv,tsv https://scoary.bioinformatics.unibe.ch/datasets/scoary-1-tetracycline/
# run Scoary2
scoary2 \
--genes Gene_presence_absence.csv \
--gene-info gene-info.tsv \
--traits Tetracycline_resistance.csv \
--outdir out \
--multiple_testing native:0.05 \
--n-permut 10000
# the argument gene-info is optionalSee output here.
b) Large metabolomics dataset (OrthoFinder)
How Scoary was run
# download dataset
wget --recursive -nd --no-parent -A tsv,txt https://scoary.bioinformatics.unibe.ch/datasets/44-propioni/
# run Scoary2
scoary2 \
--genes N0.tsv \
--gene-data-type 'gene-list:\t' \
--gene-info N0_best_names.tsv \
--traits traits_44_noraw.tsv \
--trait-data-type 'gaussian:kmeans:\t' \
--trait-info traits_info_44_noraw.tsv \
--newicktree SpeciesTree_rooted.txt \
--isolate-info isolate_info_44.tsv \
--multiple_testing fdr_bh:0.1 \
--n-permut 1000 \
--n-cpus 8 \
--random-state 42 \
--outdir out
# After identifying significant traits, consider running Scoary2 again with '--trait-wise-correction True'
# and '--multiple_testing bonferroni:0.999' to see the significant traits in context of the wider dataset.
# This will lead to many more traits in the output, including many false positives, but it will also show
# traits that may be related to the significant traits.
# The following are optional: gene-info, trait-info, isolate-infoSee output of a metabolomics dataset here.
Click here to see the output of the same dataset, but with --trait-wise-correction.
We recommend using limit_traits and a low n-permut to determine the optimal Scoary parameters before crunching the full dataset.
If the dataset has a particularly strong population structure, also use worst_cutoff to remove traits that merely correlate with the phylogeny. (See manual.)
Table that contains one row per trait analyzed, summarizing the result. Rows:
-
Trait: name of the trait -
best_fisher_p: uncorrected p-value of Fisher's test for the "best" gene -
best_fisher_q: multiple testing corrected p-value of Fisher's test for the "best" gene -
best_empirical_p: p-value of the post-hoc permutation test for the "best" gene -
best_fq*ep: product offisher_qandempirical_pfor the "best" gene -
...: potential additional metadata columns fromtrait-info.tsv
The "best" gene is defined as the gene with the lowest best_fq*ep score.
See also: Understanding the p-values
This SVG image file is made interactive in output.html. It contains:
- Left: Dendrogram of traits.
- Middle: negative logarithms of
best_fisher_q,best_empirical_pandbest_fq*epcalculated by Scoary2. - Right: names of the traits.
Makes overview_plot.svg interactive and links traits to traits.html
See section How to use the app
This folder contains a subfolder for each trait. These subfolders contain the following files:
-
results.tsv: The content of this file is similar to the main output of original Scoary. Rows:-
Gene: Name of the gene -
Name: Description of the gene fromgene_info.tsv(optional) -
g+t+: Number of isolates that have the gene (g+) and have the trait (t+) -
g+t-,g-t+,g-t-: Seeg+t+. These four numbers constitute the input for Fisher's test. -
sensitivity: The sensitivity of using this gene as a diagnostic test to determine trait-positivity; more details here -
specificity: The specificity of using this gene as a diagnostic test to determine trait-negativity; more details here -
odds_ratio: Odds ration (quantifies the strength of the association) -
fisher_p,fisher_q: corrected and uncorrected p-value of Fisher's test -
empirical_p: p-value of the post-hoc permutation test for the "best" gene -
fq*ep: product offisher_qandempirical_p -
contrasting: The maximum number of pairs that contrast in both gene and trait characters that can be drawn on the phylogenetic tree without intersecting lines -
supporting,opposing: The maximum numbercontrastingpairs that support/oppose the hypothesis -
best: p-value of picking nsupportingpairs out of ncontrastingpairs -
worst: p-value of picking nopposingpairs out of ncontrastingpairs
-
-
coverage_matrix.tsv: Table that indicates which isolates have which genes -
meta.json: Metadata about how the trait was binarized -
values.tsv: Table that indicates the original continuous value of each isolate and how it was classified (optional)
Visualizes the data in a trait's folder and makes it interactive.
Shows a phylogenetic tree of the isolates, the tables results.tsv and coverage_matrix.tsv,
a pie chart that shows how the orthogene and the trait intersect in the dataset and
a histogram of the continuous values, colored by whether each isolates has the orthogene and the trait.
See section How to use the app
Binary trait matrix. Rows: isolates; columns: traits
Metadata about each isolate (optional)
Contains configuration, HTML and CSS for overview.html and trait.html
By modifying link-config, the behaviour of trait.html can be changed.
See section How to use the app
Contains log files