Ordination - Statistics-and-Machine-Learning-with-R/Statistical-Methods-and-Machine-Learning-in-R GitHub Wiki
Ordination
Click for R-Script
Ordination is a collective term for multivariate techniques that summarize a multidimensional dataset in such a way that when it is projected onto a low-dimensional space, any intrinsic pattern the data may possess becomes apparent upon visual inspection.
Why is it necessary?
- It is impossible to visualize multiple dimensions simultaneously
- Saves time, in contrast to a separate univariate analysis
- Other than being a “dimension reduction technique”, by focusing on ‘important dimensions’, we avoid interpreting (and misinterpreting) noise. Thus, ordination is also a ‘noise reduction technique’
Principal Component Analysis (PCA)
Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets while preserving as much information as possible. PCA is used in exploratory data analysis and for making predictive models.
Characteristics
- "Easy to implement" tool for exploratory data analysis & for making predictive models
- Convenient visualization of high-dimensional data
- Highly affected by outliers in data
- Favours strong correlations
Steps
Principal Coordinate Analysis (PCoA)
Principal Coordinate Analysis, or PCoA, is a method to explore and to visualize similarities or dissimilarities of data. It starts with a similarity matrix or dissimilarity matrix (= distance matrix) and assigns for each item a location in a low-dimensional space.
Characteristics
- can handle wide range of data
- convenient visualization of high-dimensional data
- values of the objects along a PCoA axis of interest may be correlated
PCA vs PCoA
Source: https://www.youtube.com/watch?v=HMOI_lkzW08&t=1s&ab_channel=StatQuestwithJoshStarmer
Interpreting Principal Components
- There is a Principal Component/Coordinate for each dimension
- If we have “n” variables, we would have “n” Principal Components/Coordinates
- PC1/PCo1/Dim1 would span the direction of most variation
- PCo2/PCo2/Dim2 would span in the direction of 2nd most variation
- .
- .
- PC”n”/PCoA”n”/Dim"n" would span in the direction of “n”th most variation
- Each axis has an eigenvalue whose magnitude indicates the amount of variation captured in that axis
Redundancy Analysis (RDA)
Redundancy Analysis can analyse relationships between 2 tables of variables. It is very similar to PCA.
PCA vs RDA
RDA can be labelled as constrained PCA. While PCA, being without constraints, can search for variables that can best explain sample composition, RDA only searches for the best explanatory and defining variables.
Correspondance Analysis (CA)
PCA vs CA
While PCA decomposes relations between columns only, CA decomposes columns and rows simultaneously. CA is more suitable for categorical data than continuous data.
Another similar technique to PCA is Correspondance Analysis which can summarise a set of 2-dimensional data.
Canonical Correspondance Analysis (CCA)
Canonical Correspondance Analysis identifies patterns in 2 multivariate datasets & constructs sets of transformed variables by projecting data on these variables
PCA vs CCA
PCA looks for patterns within a single dataset that can represent the maximum distribution of data whereas CCA looks for patterns within 2 datasets describing internal variability
Non-metric MultiDimensional Scaling (NMDS)
- Non-Metric Multidimensional Scaling is fundamentally different than PCA, CA; more robust: produces an ordination based on a distance or dissimilarity matrix.
- Ordination based on ranks rather than distance rather than object A being 2.1 units distant from object B and 4.4 units distant from object C, object C is the "first" most distant from object A while object C is the "second" most distant.
- Avoids assumption of linear relationships among variables
Stress
NMDS Maximizes rank-order correlation between distance measures and distance in ordination space. Points are iteratively moved to minimize "stress". Stress is a measure of the mismatch between the two kinds of distance.
Shepard Diagram
"Goodness-of-fit" is measured by stress, a measure of rank order disagreement between observed & fitted(in the reduced dimension) distance. Ideally, all points should fall on the monotonic "red" line. Shepard Diagram helps us decide the number of dimensions we should use to plot our ordination results
Interpreting NMDS plots
Like other ordination plots, you should qualitatively identify gradients corresponding to the underlying process
Differences from eigenanalysis :
- Does not extract components(based on distances) so axes are meaningless
- The plot can be rotated, translated, or scaled as long as relative distances are maintained
Scree Plot (For PCA, RDA, PCoA, CA or NMDS)
In multivariate statistics, a scree plot is a line plot of the eigenvalues of factors or principal components in an analysis. The scree plot is used to determine the number of principal components to keep in a principal component analysis (PCA)
In this case, we can see 2 Principal Components are enough to capture approximately 90% variance of data with respect to all the dimensions