09.Machine learning06.Clustering - sporedata/researchdesigneR GitHub Wiki

1. Use cases: in which situations should I use this method?

Uncovering clusters of similar patients within the data, using an unsupervised machine learning approach
Sometimes used as a way to reduce dimensionality (number of variables) in a dataset, usually preceding a supervised machine learning approach.
If outcome variables are included among patient characteristics, it might be possible to use unsupervised models to explore patients with a similar outcome.
When renaming or recategorizing a disease [1] [4].

2. Algorithm: how does the method work?

Model mechanics

Trajectory is a sequence of clinical events, which is usually defined by whatever we might have in our data. As far as plots, the two that come to mind are parallel coordinate plots and treatment timelines.

Describing in words

Also called unsupervised machine learning models, clustering is characterized by the lack of an outcome variable when training the models.

Finite mixture models are a type of unsupervised machine learning, although the former is characterized by a model-based clustering approach deriving clusters while using a probabilistic model that describes the data distribution.

Data science packages

mixtools
ppci - optimal linear cluster separators using optimal hyperplane separators based on minimum density.
microclustr - for clustering of categorical variables
kml and kml3d -- are R packages offering an implementation of k-means that is especially built to work on trajectories (kml) or on joint trajectories (kml3d). They offer a variety of tools to deal with longitudinal data, including quality criteria to determine the optimal number of clusters, beginning conditions for k-means techniques, and imputation methods for trajectories (four traditional and three original) (four classic and one original). They also provide graphic tools for "visualizing" the trajectories in 2D (single trajectory) or 3D. (joint-trajectories). Through LATEX, a 3D dynamic rotating PDF graph depicting the mean joint-trajectories of each cluster may be exported [6]. Plotting possibilities include parallel coordinate plots, alluvial plots, or (maybe) Sankey plots.

Learning materials

Articles combining theory and scripts
- An Introduction to Latent Variable Mixture Modeling (Part 1): Overview and Cross-Sectional Latent Class and Latent Profile Analyses [2]
- An introduction to latent variable mixture modeling (part 2): longitudinal latent class growth analysis and growth mixture models [3]
- Common references for machine learning
- ggforce: Make a Hull Plot to Visualize Clusters in ggplot2

References

[1] Doust J, Vandvik PO, Qaseem A, Mustafa RA, Horvath AR, Frances A, Al-Ansary L, Bossuyt P, Ward RL, Kopp I, Gollogly L. Guidance for modifying the definition of diseases: A checklist. JAMA Internal Medicine. 2017 Jul 1;177(7):1020-5.

[2] Berlin KS, Williams NA, Parra GR. An Introduction to Latent Variable Mixture Modeling (Part 1): Overview and Cross-Sectional Latent Class and Latent Profile Analyses. Journal of pediatric psychology. 2014 Mar 1;39(2):174-87.

[3] Berlin KS, Parra GR, Williams NA. An Introduction to Latent Variable Mixture Modeling (part 2): longitudinal latent class growth analysis and growth mixture models. Journal of Pediatric Psychology. 2014 Mar 1;39(2):188-203.

[4] Mariampillai K, Granger B, Amelin D, Guiguet M, Hachulla E, Maurier F, Meyer A, Tohmé A, Charuel JL, Musset L, Allenbach Y. Development of a New Classification System for Idiopathic Inflammatory Myopathies Based on Clinical Manifestations and Myositis-Specific Autoantibodies. JAMA Neurol. 2018 Dec 1;75(12):1528-1537.

[5] Michael Hahsler, Ian Johnson, Tomáš Kliegr and Jaroslav Kuchař. Associative Classification for numerical and categorical variables. The R Journal (2019) 11:2, pages 254-267.

[6] Genolini, C., Alacoque, X., Sentenac, M. and Arnaud, C. kml and kml3d: R packages to cluster longitudinal data.. Journal of Statistical Software, 2015, 65, pp.1-34.