004‐ Clustering - rezakj/iCellR GitHub Wiki
Clustering
We provide three functions to run the clustering method of your choice:
-
iclust (** recommended): This function is optimized for iCellR and supports PCA, UMAP, t-SNE, Destiny (diffusion map), PHATE, or KNetL maps as input. It utilizes the
Louvain algorithmfor clustering a graph constructed using k-Nearest Neighbor (KNN), similar to PhenoGraph (Levine et al., Cell, 2015). However, it employs distance values (Euclidean by default) as weights, instead of Jaccard similarity values. -
run.phenograph: R implementation of the
PhenoGraphalgorithm. Rphenograph wrapper (Levine et al., Cell, 2015). -
run.clustering: This function offers a wide range of options to explore your data using various clustering and indexing methods. You can select any combination from the table below to experiment with different approaches and "flavors" of analysis.
| clustering methods | distance methods | indexing methods |
|---|---|---|
| ward.D, ward.D2, single, complete, average, mcquitty, median, centroid, kmeans | euclidean, maximum, manhattan, canberra, binary, minkowski or NULL | kl, ch, hartigan, ccc, scott, marriot, trcovw, tracew, friedman, rubin, cindex, db, silhouette, duda, pseudot2, beale, ratkowsky, ball, ptbiserial, gap, frey, mcclain, gamma, gplus, tau, dunn, hubert, sdindex, dindex, sdbw |
Option 1: Clustering conventionally based on top pcs
Adjust sensitivity for more or less clusters.
- Lower sensitivity numbers = more clusters.
- Higher sensitivity numbers = less clusters (reverse logic).
- 100-150 generally works best for most data.
Using the top 10 PCs generally works best for most datasets. Use opt.pcs.plot(my.obj) to find the suggested optimal number of PCs to use. We recommend using 10.
my.obj <- iclust(my.obj, sensitivity = 150, data.type = "pca", dims=1:10)
Option 2: Clustering based on KNetL dimentions (or UMAP dimentions)
Conventionally, clustering is performed using PCA data (usually the first 10 dimensions). However, this function allows you to choose t-SNE, UMAP, or KNetL map dimensions as alternatives. If you have fine-tuned your KNetL map and are confident in its results, we recommend clustering based on the KNetL map.
Clustering can be one of the more challenging aspects of data analysis, and adjustments may be necessary based on marker genes. This might involve merging certain clusters, using gating tools (refer to our cell gating tools), or experimenting with different sensitivity values to identify a greater or smaller number of communities.
Notes:
- Adjust sensitivity for more or less clusters.
- Lower sensitivity numbers = more clusters.
- Higher sensitivity numbers = less clusters (reverse logic).
- 100-150 generally works best for most data.
my.obj <- iclust(my.obj, sensitivity = 150, data.type = "knetl")
# data.type could be umap or tsne, etc.
- Other examples for using
iclust:
my.obj <- iclust(my.obj, sensitivity = 150, data.type = "umap")
# or
my.obj <- iclust(my.obj, sensitivity = 150, data.type = "tsne")
- or use
run.phenographinstead of iclust
my.obj <- run.phenograph(my.obj, k = 100, data.type = "pca", dims=1:10)
- Alternatively, use the
run.clusteringfunction to pick and customize your adventure.
my.obj <- run.clustering(my.obj,
clust.method = "kmeans",
dist.method = "euclidean",
index.method = "silhouette",
max.clust = 25,
min.clust = 2,
dims = 1:10)
# If you want to manually set the number of clusters, and not used the predicted optimal number, set the minimum and maximum to the number you want:
#my.obj <- run.clustering(my.obj,
# clust.method = "ward.D",
# dist.method = "euclidean",
# index.method = "ccc",
# max.clust = 8,
# min.clust = 8,
# dims = 1:10)
# more examples
#my.obj <- run.clustering(my.obj,
# clust.method = "ward.D",
# dist.method = "euclidean",
# index.method = "kl",
# max.clust = 25,
# min.clust = 2,
# dims = 1:10)
Visualize data after clustering results
# plot clusters (in the figures below clustering is done based on KNetL)
# example: # my.obj <- iclust(my.obj, k = 150, data.type = "knetl")
A <- cluster.plot(my.obj,plot.type = "pca",interactive = F,cell.size = 0.5,cell.transparency = 1, anno.clust=T)
B <- cluster.plot(my.obj,plot.type = "umap",interactive = F,cell.size = 0.5,cell.transparency = 1,anno.clust=T)
C <- cluster.plot(my.obj,plot.type = "tsne",interactive = F,cell.size = 0.5,cell.transparency = 1,anno.clust=T)
D <- cluster.plot(my.obj,plot.type = "knetl",interactive = F,cell.size = 0.5,cell.transparency = 1,anno.clust=T)
library(gridExtra)
grid.arrange(A,B,C,D)
Re-numbering clusters based on their distances (optional):
This step rearranges clusters so that they appear in a more consecutive order based on gene expression similarities.
This re-ordering can be visually beneficial when analyzing your heatmap after identifying marker genes. Similar cell communities will appear next to each other, making it easier to visually examine and compare them. Additionally, it can help in deciding which clusters may need merging or adjustment.
my.obj <- clust.ord(my.obj,top.rank = 500, how.to.order = "distance")
#my.obj <- clust.ord(my.obj,top.rank = 500, how.to.order = "random")
Re-plot
A= cluster.plot(my.obj,plot.type = "pca",interactive = F,cell.size = 0.5,cell.transparency = 1, anno.clust=T)
B= cluster.plot(my.obj,plot.type = "umap",interactive = F,cell.size = 0.5,cell.transparency = 1,anno.clust=T)
C= cluster.plot(my.obj,plot.type = "tsne",interactive = F,cell.size = 0.5,cell.transparency = 1,anno.clust=T)
D= cluster.plot(my.obj,plot.type = "knetl",interactive = F,cell.size = 0.5,cell.transparency = 1,anno.clust=T)
library(gridExtra)
grid.arrange(A,B,C,D)
Cluster QC
clust.stats.plot(my.obj, plot.type = "box.mito", interactive = F)
clust.stats.plot(my.obj, plot.type = "box.gene", interactive = F)