Ordination - Statistics-and-Machine-Learning-with-R/Statistical-Methods-and-Machine-Learning-in-R GitHub Wiki

Ordination

Click for R-Script

Ordination is a collective term for multivariate techniques that summarize a multidimensional dataset in such a way that when it is projected onto a low-dimensional space, any intrinsic pattern the data may possess becomes apparent upon visual inspection.

Ordination

Why is it necessary?

  • It is impossible to visualize multiple dimensions simultaneously
  • Saves time, in contrast to a separate univariate analysis
  • Other than being a “dimension reduction technique”, by focusing on ‘important dimensions’, we avoid interpreting (and misinterpreting) noise. Thus, ordination is also a ‘noise reduction technique’

Principal Component Analysis (PCA)

Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets while preserving as much information as possible. PCA is used in exploratory data analysis and for making predictive models.

PCA steps

Characteristics

  • "Easy to implement" tool for exploratory data analysis & for making predictive models
  • Convenient visualization of high-dimensional data
  • Highly affected by outliers in data
  • Favours strong correlations

Steps

PCA steps 1

Principal Coordinate Analysis (PCoA)

Principal Coordinate Analysis, or PCoA, is a method to explore and to visualize similarities or dissimilarities of data. It starts with a similarity matrix or dissimilarity matrix (= distance matrix) and assigns for each item a location in a low-dimensional space.

PCA steps

Characteristics

  • can handle wide range of data
  • convenient visualization of high-dimensional data
  • values of the objects along a PCoA axis of interest may be correlated

PCA vs PCoA

PCA vs PCoA

Source: https://www.youtube.com/watch?v=HMOI_lkzW08&t=1s&ab_channel=StatQuestwithJoshStarmer

Interpreting Principal Components

PCA

  • There is a Principal Component/Coordinate for each dimension
  • If we have “n” variables, we would have “n” Principal Components/Coordinates
    • PC1/PCo1/Dim1 would span the direction of most variation
    • PCo2/PCo2/Dim2 would span in the direction of 2nd most variation
    • .
    • .
    • PC”n”/PCoA”n”/Dim"n" would span in the direction of “n”th most variation
  • Each axis has an eigenvalue whose magnitude indicates the amount of variation captured in that axis

Redundancy Analysis (RDA)

RDA_plot

Redundancy Analysis can analyse relationships between 2 tables of variables. It is very similar to PCA.

PCA vs RDA

RDA can be labelled as constrained PCA. While PCA, being without constraints, can search for variables that can best explain sample composition, RDA only searches for the best explanatory and defining variables.

Correspondance Analysis (CA)

CA_Biplot

PCA vs CA

While PCA decomposes relations between columns only, CA decomposes columns and rows simultaneously. CA is more suitable for categorical data than continuous data.

Another similar technique to PCA is Correspondance Analysis which can summarise a set of 2-dimensional data.

Canonical Correspondance Analysis (CCA)

CCA

Canonical Correspondance Analysis identifies patterns in 2 multivariate datasets & constructs sets of transformed variables by projecting data on these variables

PCA vs CCA

PCA looks for patterns within a single dataset that can represent the maximum distribution of data whereas CCA looks for patterns within 2 datasets describing internal variability

Non-metric MultiDimensional Scaling (NMDS)

  • Non-Metric Multidimensional Scaling is fundamentally different than PCA, CA; more robust: produces an ordination based on a distance or dissimilarity matrix.
  • Ordination based on ranks rather than distance rather than object A being 2.1 units distant from object B and 4.4 units distant from object C, object C is the "first" most distant from object A while object C is the "second" most distant.
  • Avoids assumption of linear relationships among variables

Stress

NMDS Maximizes rank-order correlation between distance measures and distance in ordination space. Points are iteratively moved to minimize "stress". Stress is a measure of the mismatch between the two kinds of distance.

Shepard Diagram

Shepard Diagram

"Goodness-of-fit" is measured by stress, a measure of rank order disagreement between observed & fitted(in the reduced dimension) distance. Ideally, all points should fall on the monotonic "red" line. Shepard Diagram helps us decide the number of dimensions we should use to plot our ordination results

NMDS Shepard Diagram 1

Interpreting NMDS plots

NMDS

Like other ordination plots, you should qualitatively identify gradients corresponding to the underlying process

Differences from eigenanalysis :

  • Does not extract components(based on distances) so axes are meaningless
  • The plot can be rotated, translated, or scaled as long as relative distances are maintained

Scree Plot (For PCA, RDA, PCoA, CA or NMDS)

Scree Plot

In multivariate statistics, a scree plot is a line plot of the eigenvalues of factors or principal components in an analysis. The scree plot is used to determine the number of principal components to keep in a principal component analysis (PCA)

In this case, we can see 2 Principal Components are enough to capture approximately 90% variance of data with respect to all the dimensions

Click for R-Script