Dimensionality Reduction - niranjv/ml-notes GitHub Wiki

Overview
Principal Components Analysis
Partial Least Squares
Factor Analysis
Singular Value Decomposition
Projection Pursuit
Multi-Dimensional Scaling
Linear Discriminant Analysis
Partial Least Squares Discriminant Analysis
Mixture Discriminant Analysis
Quadratic Discriminant Analysis
Regularized Discriminant Analysis
Manifold Learning
Random Projections
Feature Agglomeration
Non-Negative Matrix Factorization
Independent Components Analysis
Latent Dirichlet Allocation
References

Overview

Dimensionality reduction is the process of reducing the number of features under consideration to a set of principal features.
Extract signal from raw, noisy features or compress number of features while maintaining structure of data
The transformation can be linear (e.g., PCA) or non-linear (e.g., manifold learning)
Advantages:
Reduce time and storage space required
Removal of multi-collinearity
Easier to visualize data
Disadvantages:
Response variable is not always considered in feature extraction/dimensionality reduction

Principal Components Analysis

PCA finds a low-dim representation of a dataset that contains as much of the variance of the dataset as possible
It is an orthogonal linear transformation that converts a set of possibly correlated features to a smaller set of linearly uncorrelated variables (Principal Components), all orthogonal to each other. Uncorrelated => orthogonal because Cov[XY] is the dot product of X and Y => |X||Y|cos(theta) = 0 => theta = 90 => random variable vectors are perpendicular.
It re-expresses the dataset in the most meaningful basis that will filter out noise and reveal hidden structure (linear combination of original basis => change of basis to a new basis). PCs form the orthogonal basis set of the features.
PC directions are the directions in which the data is most variable
Focus of PCA is on variance of features (common & unique) and seeks to reproduce total variance of features; PCs best capture the variance of the features
PCs can be used to re-construct most of the variance of the original data
Does not involve response, if available
PC are unique only up to sign flip (since direction & variance do not change with sign flip)
Features are always mean-centered before PCA is performed (i.e., column-means of design matrix = 0) since we are only interested in variance
Better to normalize all features to mean=0, stdev=1 to prevent scaling from affecting PCA
Sum of square of (PCA loadings/weights) for each component = 1 (to prevent arbitrarily large values of loadings leading to arbitrarily large variance)
Each component has an 1 x p loading vector (direction of component; has highest variance of dataset) and an 1 x n score vector (projections along this direction)
If original data is projected along in direction of loading vector, then the projected values for the data are the corresponding scores.
PCs are the lines in p-dim space closest to the data (where close = mean sq euclidean distance). The first M PCs and scores provide the best M-dim approximation to the i^th sample. When M = p, approximation is exact.
The linear transformation is PX = Y where row vectors of P are PCs of X
Row vectors of P are the eigen vectors of XX'/n => this diagonalizes covariance matrix of new variables
PCA's focus is on diagonal terms of correlation matrix
Importance of a component is measured by the amount of variation of the component
1st PC accounts for as much variance as possible
2nd PC accounts for as much of the remaining variation as possible (subject to being orthogonal to the 1st component) and so on
Number of PCs <= min(samples - 1, # features)
Kernel PCA is used to do non-linear dimensionality reduction
Multiple correspondence analysis is an equivalent of PCA for categorical data

Applications

Visualization of high dimensional data (during EDA)
Discover structure / sub-groups
PC regression - regress on the first M PCs instead of all p covariates (M << p); less noisy results since most of the signal is concentrated in the first few PCs

Assumptions

Non-parametric
Linearity => change of basis for feature space
Large variances => PCs with large variances represent important structure in feature space => high SNR
Orthogonal components => Lower redundancy; easy way to calculate P

Solutions

via Eigenvector decomposition of covariance matrix of features
via Singular Value Decomposition of data matrix of features

Advantages

Minimize redundancy (based on covariance of feature matrix) by removing correlated features
Increase signal-to-noise ratio (based on variance of feature matrix)
In regression, if large number of possibly correlated features are present, run PCA first and then regress on the first few PCs
If data is noisy, running PCA helps to concentrate more of total variance of data in first few PCs, which is proportionally higher than noise variance (which is same as in original data). This increases signal-to-noise ration of of the first few PCs.
Useful for visualization

Disadvantages

Only removes 2nd order dependencies (which is sufficient for data sets with Gaussian distribution); does not remove higher-order dependencies (in non-Gaussian data); need to use kernel PCA or ICA for this
Does not consider response while reducing dimension of features
Sensitive to relative scaling of original features, so scale covariates to have stdev = 1

Implementations

scikit-learn: Exact PCA, Incremental PCA, Kernel PCA, Sparse PCA, Mini Batch Sparse PCA
R: prcomp(), princomp(), psych.principal()
Spark: RowMatrix.computePrincipalComponents() (for matrix with n >> p only) & PCA for vectors

Ref

ISLR, Section 10.2 (PCA)

Partial Least Squares

Supervised dimensionality reduction
Looks for projections having highest covariance with group labels

Factor Analysis

Focus of Factor Analysis (FA) is on correlations
FA generates new variables that are linear combinations of features and seeks to reproduce inter-correlation between variables
Factors represent the common variance of features, excluding unique variance
Factors are latent unmeasured variables that express themselves through their relationship with other measured variables
FA is a measurement model of a latent feature (different from PCA which is just a linear combination of features)
FA allows for investigation of relationship between variables
As many factors as there are variables; select the factors with the largest eigenvalues as they explain more of the variance than the remaining factors
Factor loadings measure strength of relationship between a measured variable and the unmeasured latent variable
Focus is on off-diagonal terms of correlation matrix
FA is used when purpose is detect data structure or for causal modeling
Both PCA & FA estimate a Gaussian with a low-rank covariance matrix (with Gaussian priors on the latent variables). Non-gaussian priors on latent variables leads to Independent Component Analysis.
Advantage over PCA: can model variance of every component independently which leads to better approximation of original data when components have different variances
Components generated by FA have no general structure (i.e., don't have to be orthogonal like PCA)
Assumptions:
Multiple measured variables have similar patterns because of their correlation with an unmeasured latent variable
Implementations
scikit-learn: factor-analysis
R: stats.factanal, psych.factor.pa, FactoMineR package, GPARotation package

Independent Components Analysis

Non-linear decomposition of mixed signal into additive components
Typically used for separating mixed signals, not dimension reduction
No noise term in model, so whitening must be applied

Implementations

scikit-learn: Fast ICA

Singular Value Decomposition

A = UDV', where D = diagonal matrix of singular values in desc order; U V = orthonormal matrix whose columns are singular vectors

Implementations

scikit-learn: TruncatedSVD
Spark: linalg.SingularValueDecomposition, RowMatrix.computeSVD()

Projection Pursuit

Multi-Dimensional Scaling

Linear Discriminant Analysis

Supervised dimensionality reduction (i.e., takes into account the response/label)
Generalizations of Fisher's linear discriminant
Finds a linear combination of features to separate 2 or more classes
Looks for projections having highest correlation with dummy variables encoding group labels
Special case of Canonical Correlation Analysis

Partial Least Squares Discriminant Analysis

Mixture Discriminant Analysis

Quadratic Discriminant Analysis

Regularized Discriminant Analysis

Manifold Learning

Non-linear mapping of features to lower-dimensional space
Many approaches available such as Isomap, Local Linear Embedding, Hessian Eigenmapping, Spectral Embedding, Local Tangent Space Alignment, t-distributed Stochastic Neighbor Embedding, etc.
These approaches construct a lower-dimensional representation of data using a cost function that retains local properties of the data

Random Projections

Feature Agglomeration

Non-Negative Matrix Factorization

Implementations
scikit-learn: NMF

Latent Dirichlet Allocation

Implementations
Spark: LDA

References

An Introduction to Statistical Learning with Applications in R
A Tutorial on Principal Components Analysis, Jonathon Shlens, 2014
A Tutorial on Principal Components Analysis, Lindsay I Smith, 2002
Principal Components Analysis, Wikipedia, Jan 2017
A Tutorial on Independent Components Analysis, Jonathon Shlens, 2014
A Survey of Dimensionality Reduction Techniques, Imola K Fodor, LLNL technote
Bayesian PCA, Christopher Bishop, NIPS 1998
Quick-R: PCA & Factor Analysis in R
Analysis Factor: A Short Introduction to Factor Analysis
Analysis Factor: Difference between PCA & FA

Dimensionality Reduction - niranjv/ml-notes GitHub Wiki

Contents

Overview

Principal Components Analysis

Applications

Assumptions

Solutions

Advantages

Disadvantages

Implementations

Ref

Partial Least Squares

Factor Analysis

Independent Components Analysis

Implementations

Singular Value Decomposition

Implementations

Projection Pursuit

Multi-Dimensional Scaling

Linear Discriminant Analysis

Partial Least Squares Discriminant Analysis

Mixture Discriminant Analysis

Quadratic Discriminant Analysis

Regularized Discriminant Analysis

Manifold Learning

Random Projections

Feature Agglomeration

Non-Negative Matrix Factorization

Latent Dirichlet Allocation

References