Isometric feature mapping (ISOmap) - AAU-Dat/P5-Nonlinear-Dimensionality-Reduction GitHub Wiki

Attention

I don't know which version of MDS is used for isomap!!! Different papers refer to slightly different versions of MDS.

This paper is king >> https://lvdmaaten.github.io/publications/papers/TR_Dimensionality_Reduction_Review_2009.pdf

Strong paper about MDS + ISOmap >> https://arxiv.org/pdf/2009.08136.pdf

Original paper about ISOmap >> https://wearables.cc.gatech.edu/paper_of_week/isomap.pdf

Thesis about nonlinear dimensionality reduction >> https://core.ac.uk/download/pdf/144146939.pdf

https://proceedings.neurips.cc/paper/2018/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf

Motivation

Data is inherently low-dimensional, according to the manifold hypothesis, and that datasets can need not be described by their extrinsic dimensionality.

Intuition

Locally you can describe the manifold by local maps that are low dimensional Euclidian spaces, but globally you need to glue all these maps together to get the image of what is going on.

You do this by estimating the geodesic distances between the points of the manifold $-$ then we can find the embedding the of that points in the 2-dimensional that realizes these kind of distances, which can be done with Multidimensional Scaling(MDS).

isomap tries to find the true geodesic distances given only their Euclidean distances in the high-dimensional space.

Why do we have ISOmap?

Tenenbaum, the creator of ISOmap, says that PCA and MDS fall short with regards to some nonlinear manifolds, and that they only can compute the Euclidean structure in some manifolds, while ignoring the manifolds' geodesic distances, and thus don't detect the intrinsic dimensionlaity of the data.

In ISOmap, local distances are preserved, which make the embedding better for dimensionality reduction.

Algorithm

Build kNN graph on the data. We choose a small k. We should not introduce shortcuts that don't exist in the manifold*.
Use the shortest path distance on this graph.
We compute the Euclidean distance on the manifold. We go along the manifold, so as to estimate the geodesic distance.
We apply MDS on the new Distance matrix

The edges on the graph are the Euclidean coordinates in the ambient space. What we do is that we compute the local space in the ambient space, but it coincides with the geodesic distance.

The good with ISOmap

Basically Isomap is like MDS. It has a quirk: the geodesic distance.

ISOmap can recover the true dimensionality and geometric structure of a strictly larger class of nonlinear manifolds. An example is the Swiss Roll, which has a intrinsic geomtry resembling a convex region of Euclidean space, but in the Euclidean space of one that is folded.

ISOmap tries to find a low global optimal structure in the data. You can apply ISOmap when the nonlinear geometry impedes PCA or MDS from recovering true dimensionality.

A little bit about MDS

The algorithm for classical MDS is:

Create a matrix a of squared dissimilarities $\Delta^2$from the data (in the case of isomap it is the matrix filled with the geodesic distances)
Obtain the matrix B by double centering $\Delta^2$ $B=-\frac{1}{2}J\Delta^{2}J$, where $J=I_{n} -\frac{1}{n}1_n1_n^\top$
Find the solution to $B_\Delta=Q\Lambda Q^\top$ (eigenvalue decomposition)
Choose $K$ eigenvectors in decreasing order to embed $X$ into the desired embedding $Y$

Concepts you might not understand

A manifold is a topological space, i.e. a high-dimensional space that has local points that are Euclidean. In other words, a manifold is a low dimensional object that is embedded into a higher dimensional space.

The geodesic distance / curvilinear distance is the length of the shortest path between two points on the possibly curvy manifold.

degrees of freedom: features

convex region: the segment between any given two points in the region is a subset of the region.

concave region: the opposite of convex; at least one segment partially goes outside the region.

Kernel function isometric manifold. Euclidean space

Distortion: distortion measures the deviation of an embedding from isometry, which means that it shows how far the embedded points deviate from the original distances.

Extrinsic dimension: high-dimensional data is thought of as being of extrinsic dimensionality, which means that it is the dataset.

Intrinsic dimension: only a small fraction of the high-dimensional data is actually useful to extract the most important features/ get an optimal representation of the data.

Ambient space: Euclidean space

How MDS works

Misc

The difference between MDS and ISOMAP is that we are given data-pairs instead of data points.

We need to embed high-dimensional data ${x_{i}}^{n}{i=1}$ into the loewr dimensional embedded data ${y{i}}^{n}_{i=1}$ , where $n$ is the number of data points.

When we refer to input and output dimensions we say that $x_{i} \in \mathbf{R^{d}}$ and $y_{i} \in \mathbf{R^{p}}$ respectively, where the embedded space $p \leq d$

Our input data $X$ is noted as $\mathbf{R^{d \times n}}$ ; i.e. $X := [ x_{1},...x_{n} ]$

Can we find a space in the Euclidean space $d$?

Abstract space. $\Phi : \mathbf{X} ->$ which can map to the euclidian one.

Metric embedding problem.