Subcommand: squash - lczech/gappa GitHub Wiki

Perform Squash Clustering for a set of samples.

Usage: gappa analyze squash [options]

Options

Input
`--jplace-path`	Required. `TEXT:PATH(existing)=[] ...` List of jplace files or directories to process. For directories, only files with the extension `.jplace[.gz]` are processed.
Settings
`--exponent`	`FLOAT=1` Exponent for KR integration.
`--point-mass`	`FLAG` Treat every pquery as a point mass concentrated on the highest-weight placement. In other words, ignore all but the most likely placement location (the one with the highest LWR), and set its LWR to 1.0.
`--ignore-multiplicities`	`FLAG` Set the multiplicity of each pquery to 1.0. For phylogenetic placement, the multiplicity is the equivalent of read abundances. This flag hence ignores the read abundances, treating each pquery as a singleton.
Color
`--color-list`	`TEXT=BuPuBk` List of colors to use for the palette. Can either be the name of a color list, a file containing one color per line, or an actual comma-separated list of colors. Colors can be specified in the format `#rrggbb` using hex values, or by web color names.
`--reverse-color-list`	`FLAG` If set, the order of colors of the `--color-list` is reversed.
`--log-scaling`	`FLAG` If set, the sequential color list is logarithmically scaled instead of linearily.
Output
`--out-dir`	`TEXT=.` Directory to write output files to.
`--file-prefix`	`TEXT` File prefix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
`--file-suffix`	`TEXT` File suffix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
Tree Output
`--write-newick-tree`	`FLAG` If set, the tree is written to a Newick file. This format cannot store color information.
`--write-nexus-tree`	`FLAG` If set, the tree is written to a Nexus file. This can for example be opened in FigTree.
`--write-phyloxml-tree`	`FLAG` If set, the tree is written to a Phyloxml file. This can for example be used in Archaeopteryx.
`--write-svg-tree`	`FLAG` If set, the tree is written to a SVG file. This gives a file for vector graphics editors.
Newick Tree Output
`--newick-tree-branch-length-precision`	`INT=6 Needs: --write-newick-tree` Number of digits to print for branch lengths in Newick format.
`--newick-tree-quote-invalid-chars`	`FLAG Needs: --write-newick-tree` If set, node labels that contain characters that are invalid in the Newick format (i.e., spaces and `:;()[],{}`) are put into quotation marks. If not set (default), these characters are instead replaced by underscores, which changes the names, but works better with most downstream tools.
Svg Tree Output
`--svg-tree-shape`	`TEXT:{circular,rectangular}=circular Needs: --write-svg-tree` Shape of the tree.
`--svg-tree-type`	`TEXT:{cladogram,phylogram}=cladogram Needs: --write-svg-tree` Type of the tree, either using branch lengths (`phylogram`), or not (`cladogram`).
`--svg-tree-stroke-width`	`FLOAT=5 Needs: --write-svg-tree` Svg stroke width for the branches of the tree.
`--svg-tree-ladderize`	`FLAG Needs: --write-svg-tree` If set, the tree is ladderized.
Global Options
`--allow-file-overwriting`	`FLAG` Allow to overwrite existing output files instead of aborting the command.
`--verbose`	`FLAG` Produce more verbose output.
`--threads`	`UINT` Number of threads to use for calculations.
`--log-file`	`TEXT` Write all output to a log file, in addition to standard output to the terminal.

Description

Performs Squash Clustering. The command is a re-implementation of guppy squash, see there for more details.

Details

The main output of the command is a cluster hierarchy tree that shows which input jplace samples are clustered close to each other. Although the tree is written to Newick format, it is not a phylogeny, as its tips represent samples (jplace files). The inner node labels are numbered consecutively starting at n, with n being the number of samples used as input.

If the --write-...-tree options are used, the mass trees representing the samples (tips of the cluster tree) and the mass trees of the inner nodes (average masses of the corresponding tips) are written for visualization. Their numbering is 0 to n-1 for the tips (samples), and n to 2n-2 for the inner nodes (cluster averages). These trees can help to explore how and why the samples were clustered during the algorithm.

Citation

When using this method, please do not forget to cite

Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Genesis and Gappa: Processing, Analyzing and Visualizing Phylogenetic (Placement) Data. Bioinformatics, 2020. doi:10.1093/bioinformatics/btaa070

Frederick Matsen, Steven Evans. Edge Principal Components and Squash Clustering: Using the Special Structure of Phylogenetic Placement Data for Sample Comparison. PLOS ONE, 2013. doi:10.1371/journal.pone.0056859