Subcommand: correlation - lczech/gappa GitHub Wiki

Calculate the Edge Correlation of samples and metadata features.

Usage: gappa analyze correlation [options]

Options

Input
`--jplace-path`	Required. `TEXT:PATH(existing)=[] ...` List of jplace files or directories to process. For directories, only files with the extension `.jplace[.gz]` are processed.
Settings
`--mass-norm`	Required. `TEXT:{absolute,relative}=absolute` Set the per-sample normalization method. With `absolute`, the total mass is not changed, so that input jplace samples with more pqueries (more placed sequences) have a higher influence on the result. With `relative`, the total mass of each sample is normalized to 1.0, so that each sample has the same influence on the result, independent of its number of sequences and their abundances.
`--point-mass`	`FLAG` Treat every pquery as a point mass concentrated on the highest-weight placement. In other words, ignore all but the most likely placement location (the one with the highest LWR), and set its LWR to 1.0.
`--ignore-multiplicities`	`FLAG` Set the multiplicity of each pquery to 1.0. For phylogenetic placement, the multiplicity is the equivalent of read abundances. This flag hence ignores the read abundances, treating each pquery as a singleton.
`--edge-values`	`TEXT:{both,imbalances,masses}=both` Values per edge used to calculate the correlation.
`--method`	`TEXT:{all,pearson,spearman,kendall}=all` Method of correlation.
Metadata Table Input
`--metadata-table-file`	Required. `TEXT:FILE` Tabular char-separated input file.
`--metadata-separator-char`	`TEXT:{comma,tab,space,semicolon}=comma` Separator char for tabular data.
`--metadata-select-columns`	`TEXT Excludes: --metadata-ignore-columns` Set the columns to select, by their name in the first (header) line of the table. All others columns are ignored. The options expects either a file with one column name per line, or an actual list of column names separated by --metadata-separator-char
`--metadata-ignore-columns`	`TEXT Excludes: --metadata-select-columns` Set the columns to ignore, by their name in the first (header) line of the table. All others columns are selected. The options expects either a file with one column name per line, or an actual list of column names separated by --metadata-separator-char
Color
`--color-list`	`TEXT=spectral` List of colors to use for the palette. Can either be the name of a color list, a file containing one color per line, or an actual comma-separated list of colors. Colors can be specified in the format `#rrggbb` using hex values, or by web color names.
`--reverse-color-list`	`FLAG` If set, the order of colors of the `--color-list` is reversed.
`--mask-color`	`TEXT=#dfdfdf` Color used to indicate masked or invalid values, such as infinities or NaNs. Color can be specified in the format `#rrggbb` using hex values, or by web color names.
Output
`--out-dir`	`TEXT=.` Directory to write output files to.
`--file-prefix`	`TEXT` File prefix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
`--file-suffix`	`TEXT` File suffix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
Tree Output
`--write-newick-tree`	`FLAG` If set, the tree is written to a Newick file. This format cannot store color information.
`--write-nexus-tree`	`FLAG` If set, the tree is written to a Nexus file. This can for example be opened in FigTree.
`--write-phyloxml-tree`	`FLAG` If set, the tree is written to a Phyloxml file. This can for example be used in Archaeopteryx.
`--write-svg-tree`	`FLAG` If set, the tree is written to a SVG file. This gives a file for vector graphics editors.
Newick Tree Output
`--newick-tree-branch-length-precision`	`INT=6 Needs: --write-newick-tree` Number of digits to print for branch lengths in Newick format.
`--newick-tree-quote-invalid-chars`	`FLAG Needs: --write-newick-tree` If set, node labels that contain characters that are invalid in the Newick format (i.e., spaces and `:;()[],{}`) are put into quotation marks. If not set (default), these characters are instead replaced by underscores, which changes the names, but works better with most downstream tools.
Svg Tree Output
`--svg-tree-shape`	`TEXT:{circular,rectangular}=circular Needs: --write-svg-tree` Shape of the tree.
`--svg-tree-type`	`TEXT:{cladogram,phylogram}=cladogram Needs: --write-svg-tree` Type of the tree, either using branch lengths (`phylogram`), or not (`cladogram`).
`--svg-tree-stroke-width`	`FLOAT=5 Needs: --write-svg-tree` Svg stroke width for the branches of the tree.
`--svg-tree-ladderize`	`FLAG Needs: --write-svg-tree` If set, the tree is ladderized.
Global Options
`--allow-file-overwriting`	`FLAG` Allow to overwrite existing output files instead of aborting the command.
`--verbose`	`FLAG` Produce more verbose output.
`--threads`	`UINT` Number of threads to use for calculations.
`--log-file`	`TEXT` Write all output to a log file, in addition to standard output to the terminal.

Description

The command takes a set of jplace files (called samples), as well as a table containing metadata features for each sample. It then calculates and visualizes the Edge Correlation with the metadata features per edge of the reference tree. The files need to have the same reference tree.

Edge Correlation is explained and evaluated in detail in our article (doi:10.1371/journal.pone.0217050). The following figure and its caption are an example adapted from this article:

Correlation Trees.

All subfigures show red edges or red paths at the clade on the left hand side of the tree. This indicates that presence of placements in this clade is anti-correlated with the used metadata feature. On the other hand, blue and green edges, which indicate positive correlation, are spread throughout the tree the same way in all subfigures. The extent of correlation is larger for Spearman’s Coefficient, indicating that the correlation is monotonic, but not strictly linear.

Details

By default, the command creates correlation trees for all valid metadata features, using all variants of the method. In the following, we first explain how to specify the metadata, and then how to change the default behavior.

Metadata Features (`--metadata-file`)

The metadata features are specified in a comma separated table file (.csv). The first row needs to contain the feature names, which are used as file names for the output files. The first column needs to contain the file names of the jplace files (samples) without extension.

Example:

File,Temperature,Salinity Sensor,Oxygen Sensor
ERR562588,19.85,36.32,221.47
ERR562558,23.83,37.49,n/a
ERR562591,26.23,36.62,199.94
ERR562643,21.44,37.89,207.79
ERR562637,26.64,35.36,189.81

This table specifies three types of metadata for five files ERR562588.jplace, ERR562558.jplace, etc. Note the n/a value in the last column. Any non-numerical value is interpreted as missing data, and is simply left out when calculating the correlation. That is, the last column only uses four data points.

Features Selection (`--metadata-fields`)

When specifying a comma-separated list of column headers of the meatadata table, only these features are used. Otherwise, all numerical columns are used, and trees for all for all of them are created.

Example: In order to only use the first two features of the above table, specify --metadata-fields "Temperature,Salinity Sensor" with the command. Note the double quotes, which are necessary here, as one of the feature names contains a space.

Edge Masses and Imbalances (`--edge-values`)

Controls whether to use masses or imbalances. By default, trees using both of them are crated. Using masses highlights the correlation of single edges, while using imbalances considers whole clades. See the article for details on the differences between these two variants.

Correlation Method (`--method`)

Controls which method of correlation is used for the visualization. We offer Pearson's r, Spearman's rho, and Kendall's tau (in the tau-b variant) correlation coefficients. By default, trees for all of them are created.

Normalization (`--mass-norm`)

As the command is meant to show differences in a set of jplace samples files, it is important how those are normalized. Thus, the option is required.

If using --mass-norm relative, each sample (that is, each input jplace file) is normalized to unit mass 1.0, so that they all contribute equally to the result. Hence, the correlation is measured relatively. That is, a branch exhibits a high correlation with a metadata feature depending on the relative amount of placements on that branch (or in the clade, for imbalances) compared to the other placements in that sample.

On the other hand, if --mass-norm absolute is specified, the samples are not normalized. Thus, correlation is measured absolutely. Branches then exhibit a high correlation (or anti-correlation) with a metadata feature depending on the absolute number of placements on that branch (or clade). This can vastly differ from the normalized result, as the values then depends on the total number of pqueries in each sample - which in turn depend on things like amplification bias, rarefaction, and other factors that can change the total number of sequences per sample.

The decision whether to use relative or absolute abundances depends on the use case and what each sample represents. See our article for details.

Citation

When using this method, please do not forget to cite

Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Genesis and Gappa: Processing, Analyzing and Visualizing Phylogenetic (Placement) Data. Bioinformatics, 2020. doi:10.1093/bioinformatics/btaa070

Lucas Czech, Alexandros Stamatakis. Scalable Methods for Analyzing and Visualizing Phylogenetic Placement of Metagenomic Samples. PLOS ONE, 2019. doi:10.1371/journal.pone.0217050

Subcommand: correlation - lczech/gappa GitHub Wiki

Options

Description

Details

Metadata Features (--metadata-file)

Features Selection (--metadata-fields)

Edge Masses and Imbalances (--edge-values)

Correlation Method (--method)

Normalization (--mass-norm)