Subcommand: imbalance kmeans - lczech/gappa GitHub Wiki
Run Imbalance k-means clustering on a set of samples.
Usage: gappa analyze imbalance-kmeans [options]
| Input | |
|---|---|
--jplace-path |
Required. TEXT:PATH(existing)=[] ...List of jplace files or directories to process. For directories, only files with the extension .jplace[.gz] are processed. |
| Settings | |
--k |
Required. TEXTNumber of clusters to find. Can be a comma-separated list of multiple values or ranges for k, such as "1-5,8,10,12"
|
--write-overview-file |
FLAGIf provided, a table file is written that summarizes the average distance and variance of the clusters for each k. Useful for elbow plots. |
--point-mass |
FLAGTreat every pquery as a point mass concentrated on the highest-weight placement. In other words, ignore all but the most likely placement location (the one with the highest LWR), and set its LWR to 1.0. |
--ignore-multiplicities |
FLAGSet the multiplicity of each pquery to 1.0. For phylogenetic placement, the multiplicity is the equivalent of read abundances. This flag hence ignores the read abundances, treating each pquery as a singleton. |
| Color | |
--color-list |
TEXT=BuPuBkList of colors to use for the palette. Can either be the name of a color list, a file containing one color per line, or an actual comma-separated list of colors. Colors can be specified in the format #rrggbb using hex values, or by web color names. |
--reverse-color-list |
FLAGIf set, the order of colors of the --color-list is reversed. |
--log-scaling |
FLAGIf set, the sequential color list is logarithmically scaled instead of linearily. |
| Output | |
--out-dir |
TEXT=.Directory to write output files to. |
--file-prefix |
TEXT=ikmeans_File prefix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data. |
--file-suffix |
TEXTFile suffix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data. |
| Tree Output | |
--write-newick-tree |
FLAGIf set, the tree is written to a Newick file. This format cannot store color information. |
--write-nexus-tree |
FLAGIf set, the tree is written to a Nexus file. This can for example be opened in FigTree. |
--write-phyloxml-tree |
FLAGIf set, the tree is written to a Phyloxml file. This can for example be used in Archaeopteryx. |
--write-svg-tree |
FLAGIf set, the tree is written to a SVG file. This gives a file for vector graphics editors. |
| Newick Tree Output | |
--newick-tree-branch-length-precision |
INT=6 Needs: --write-newick-treeNumber of digits to print for branch lengths in Newick format. |
--newick-tree-quote-invalid-chars |
FLAG Needs: --write-newick-treeIf set, node labels that contain characters that are invalid in the Newick format (i.e., spaces and :;()[],{}) are put into quotation marks. If not set (default), these characters are instead replaced by underscores, which changes the names, but works better with most downstream tools. |
| Svg Tree Output | |
--svg-tree-shape |
TEXT:{circular,rectangular}=circular Needs: --write-svg-treeShape of the tree. |
--svg-tree-type |
TEXT:{cladogram,phylogram}=cladogram Needs: --write-svg-treeType of the tree, either using branch lengths ( phylogram), or not (cladogram). |
--svg-tree-stroke-width |
FLOAT=5 Needs: --write-svg-treeSvg stroke width for the branches of the tree. |
--svg-tree-ladderize |
FLAG Needs: --write-svg-treeIf set, the tree is ladderized. |
| Global Options | |
--allow-file-overwriting |
FLAGAllow to overwrite existing output files instead of aborting the command. |
--verbose |
FLAGProduce more verbose output. |
--threads |
UINTNumber of threads to use for calculations. |
--log-file |
TEXTWrite all output to a log file, in addition to standard output to the terminal. |
Imbalance k-means has almost the same usage as Phylogenetic k-means. See there for details. The difference is in the distance measure being used, which is a simple Euclidean distance of the edge imbalances of the samples, instead of using the more involved Phylogenetic KR distance between samples.
When using this method, please do not forget to cite
Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Genesis and Gappa: Processing, Analyzing and Visualizing Phylogenetic (Placement) Data. Bioinformatics, 2020. doi:10.1093/bioinformatics/btaa070
Lucas Czech, Alexandros Stamatakis. Scalable Methods for Analyzing and Visualizing Phylogenetic Placement of Metagenomic Samples. PLOS ONE, 2019. doi:10.1371/journal.pone.0217050