Dimensions reduction - jsgounot/metagenomic-pipelines GitHub Wiki

Introduction

This post was originally designed to describe MDS but I later used it to discuss all dimensions reduction methods.

MDS is a non-linear technique for embedding data in a lower-dimensional space. It's similar to a PCA (it actually include the PCA), however it works with non Euclidian distances such as similarity values which make MDS suitable to explore relationships of a dataset based on pairwise distances. See here for more details on MDS vs PCA.

Euclidian distance is a distance value between two point in an Euclidian space. It can be calculated by comparing a pair of values (for example expression level of 2 genes between two samples) using Pythagore's theorem. It is the base of PCA or Principal component analysis and this is why you can use PCA to explore distance between samples based on values mesured for different features (such as gene expression level).

However PCA does not work for other kind of distances. This include distances calculated from similar kind of input (gene expression level) but with another calculation method (for example using fold change instead, see this video, but also if you use any other distance calculation, such as ANI or Bray-Curtis distances.

Now, PCoA or Principal coordinate analysis, also named metric MDS is similar to PCA expect that it can use non Euclidian distance. Note therefore that giving Euclidian distances as input to PCoA will return same result than PCA. Basically PCoA is a PCA but where you can use other linear distance values (or any scale data). On the other hand, non metric MDS or nMDS use ordinal value or dissimilarity matrix. Ordinal data is a statistical type of quantitative data in which variables exist in naturally occurring ordered categories (for example a rank between peppers). The distance between two categories is not established using ordinal data (we don't know the distance between spicy and blend).

Important note relative to python implementation of PCoA (metric MDS)

As strange as it seems, there are different ways to implement PCoA. PCoA in scikit learn which can be found using MDS with metric=True uses the SMACOF algorithm while skbio PCoA uses singular value decomposition (SVD). SVD is what is used in most R packages and is most likely what someone have used when he talked of PCoA. You can read this too. This also means that metric MDS in scikit will provide a stress value. Most of the articles online on PCoA are about SVD which does not provide stress value and is not random. skbio PCoA seems better in many ways and should be the de facto method when using non-ordinal distances.

Important note relative to stress value with scikit

MDS generates a stress value relative to how much the data actually fit correctly the input distances or not. To be coherent and useful, the stress value should be normed which is for now not the case in scikit (only raw value is available). Look at this PR (last one from a serie) about this. Until a normed stress value is available, it might be better to use either R or use custom code to calculate this value which should be needed for publication.

t-SNE

I recently see a nature paper using t-SNE instead of PCA, which supposedly is much more accurate than PCA, especially for local (or neighbors) relationship. You can have a look here. Some people say that it should entirely replace PCA, but as usual it depends of what you're looking for and your dataset, here is an interesting answer to use cases between the two methods. Note that t-SNE does not preserve entirely preserve the global relationship of your dataset and is not deterministic, meaning you can run two runs with the same dataset and have (slightly) different results.

UMAP

See John slides from 23. Feb 2023. UMAP looks a similar method to t-SNE but focusing more on the global structure rather than the local relationship. Should also be faster.

Main resources

Other resources

Code example for skbio

With a distance matrix as input

from skbio.stats.ordination import pcoa

names = dfmatrix.columns

# number of dimensions can be higher but not lower than 3
# if your matrix is not symmetrical, check below
r = pcoa(dfmatrix, number_of_dimensions=3)

res = r.samples.copy()
res['genomes'] = names

# Add to res your genomes metadata

labels = r.proportion_explained.to_dict()
labels = {key: key + ' (%.1f' %(value *100) + '%)' for key, value in labels.items()}

graph = Graph(res)
graph.sns.scatterplot(x='PC1', y='PC2') # add hue and style here

graph.ax.set_xlabel(labels['PC1'])
graph.ax.set_ylabel(labels['PC2'])

graph.set_size_inches(6, 6)
graph.tight_layout()

PCoA needs your matrix to be symmetrical. This is not the case for all distance calculation (like fastANI or mummer distances) where dist(query,ref) != dist(ref,query) is quite usual. Here is a code to make your matrix symmetrical using the average distance between pairs:

sym_matrix = dfmatrix.values
average_tri = (np.triu(dfmatrix, 1) + np.tril(dfmatrix.T)) / 2
sym_matrix = average_tri + average_tri.T

sym_matrix = pd.DataFrame(index=dfmatrix.index, columns=dfmatrix.columns, data=sym_matrix)
sym_matrix

Code example for scikit

Please read paragraph about implementations before, you most likely want to use skbio instead (unless you use ordinal values). If you want to use nMDS, just edit metric=False when calling MDS constructor.

As input you will need a symetric matrix (upper triangle == lower triangle) with pairwise distance without nan or inf values.

from sklearn import manifold

# We consider here that you're using a pandas dataframe
labels = df.columns
matrix = df.values

mds = manifold.MDS(
    n_components=2,
    metric=True,
    max_iter=3000,
    random_state=0,
    dissimilarity="precomputed",
    n_jobs=-1 # all cores used
)

sfit = mds.fit_transform(matrix)

sfit = pd.DataFrame(sfit, columns=['x', 'y'])
sfit['labels'] = labels
sfit

After merging your new dataframe object with metadata, you can quickly plot results

graph = Graph(sfit)
graph.sns.scatterplot(x='x', y='y', hue='hue')

graph.ax.set_xlabel('')
graph.ax.set_ylabel('')

graph.legend_outside()

graph.set_size_inches(8, 6)
graph.tight_layout()