Dimensionality Reduction - niranjv/ml-notes GitHub Wiki
Contents
- Overview
- Principal Components Analysis
- Partial Least Squares
- Factor Analysis
- Singular Value Decomposition
- Projection Pursuit
- Multi-Dimensional Scaling
- Linear Discriminant Analysis
- Partial Least Squares Discriminant Analysis
- Mixture Discriminant Analysis
- Quadratic Discriminant Analysis
- Regularized Discriminant Analysis
- Manifold Learning
- Random Projections
- Feature Agglomeration
- Non-Negative Matrix Factorization
- Independent Components Analysis
- Latent Dirichlet Allocation
- References
Overview
- Dimensionality reduction is the process of reducing the number of features under consideration to a set of principal features.
- Extract signal from raw, noisy features or compress number of features while maintaining structure of data
- The transformation can be linear (e.g., PCA) or non-linear (e.g., manifold learning)
- Advantages:
- Reduce time and storage space required
- Removal of multi-collinearity
- Easier to visualize data
- Disadvantages:
- Response variable is not always considered in feature extraction/dimensionality reduction
Principal Components Analysis
-
PCA finds a low-dim representation of a dataset that contains as much of the variance of the dataset as possible
-
It is an orthogonal linear transformation that converts a set of possibly correlated features to a smaller set of linearly uncorrelated variables (Principal Components), all orthogonal to each other. Uncorrelated => orthogonal because
Cov[XY]
is the dot product ofX
andY
=>|X||Y|cos(theta) = 0
=>theta = 90
=> random variable vectors are perpendicular. -
It re-expresses the dataset in the most meaningful basis that will filter out noise and reveal hidden structure (linear combination of original basis => change of basis to a new basis). PCs form the orthogonal basis set of the features.
-
PC directions are the directions in which the data is most variable
-
Focus of PCA is on variance of features (common & unique) and seeks to reproduce total variance of features; PCs best capture the variance of the features
-
PCs can be used to re-construct most of the variance of the original data
-
Does not involve response, if available
-
PC are unique only up to sign flip (since direction & variance do not change with sign flip)
-
Features are always mean-centered before PCA is performed (i.e., column-means of design matrix = 0) since we are only interested in variance
-
Better to normalize all features to mean=0, stdev=1 to prevent scaling from affecting PCA
-
Sum of square of (PCA loadings/weights) for each component = 1 (to prevent arbitrarily large values of loadings leading to arbitrarily large variance)
-
Each component has an
1 x p
loading vector (direction of component; has highest variance of dataset) and an1 x n
score vector (projections along this direction) -
If original data is projected along in direction of loading vector, then the projected values for the data are the corresponding scores.
-
PCs are the lines in
p-dim
space closest to the data (where close = mean sq euclidean distance). The first M PCs and scores provide the best M-dim approximation to thei^th
sample. When M = p, approximation is exact. -
The linear transformation is
PX = Y
where row vectors of P are PCs of X -
Row vectors of
P
are the eigen vectors ofXX'/n
=> this diagonalizes covariance matrix of new variables -
PCA's focus is on diagonal terms of correlation matrix
-
Importance of a component is measured by the amount of variation of the component
-
1st PC accounts for as much variance as possible
-
2nd PC accounts for as much of the remaining variation as possible (subject to being orthogonal to the 1st component) and so on
-
Number of PCs <= min(samples - 1, # features)
-
Kernel PCA is used to do non-linear dimensionality reduction
-
Multiple correspondence analysis is an equivalent of PCA for categorical data
Applications
- Visualization of high dimensional data (during EDA)
- Discover structure / sub-groups
- PC regression - regress on the first
M
PCs instead of allp
covariates (M << p
); less noisy results since most of the signal is concentrated in the first few PCs
Assumptions
- Non-parametric
- Linearity => change of basis for feature space
- Large variances => PCs with large variances represent important structure in feature space => high SNR
- Orthogonal components => Lower redundancy; easy way to calculate P
Solutions
- via Eigenvector decomposition of covariance matrix of features
- via Singular Value Decomposition of data matrix of features
Advantages
- Minimize redundancy (based on covariance of feature matrix) by removing correlated features
- Increase signal-to-noise ratio (based on variance of feature matrix)
- In regression, if large number of possibly correlated features are present, run PCA first and then regress on the first few PCs
- If data is noisy, running PCA helps to concentrate more of total variance of data in first few PCs, which is proportionally higher than noise variance (which is same as in original data). This increases signal-to-noise ration of of the first few PCs.
- Useful for visualization
Disadvantages
- Only removes 2nd order dependencies (which is sufficient for data sets with Gaussian distribution); does not remove higher-order dependencies (in non-Gaussian data); need to use kernel PCA or ICA for this
- Does not consider response while reducing dimension of features
- Sensitive to relative scaling of original features, so scale covariates to have
stdev = 1
Implementations
- scikit-learn: Exact PCA, Incremental PCA, Kernel PCA, Sparse PCA, Mini Batch Sparse PCA
- R:
prcomp()
,princomp()
,psych.principal()
- Spark:
RowMatrix.computePrincipalComponents()
(for matrix withn >> p
only) &PCA
for vectors
Ref
- ISLR, Section 10.2 (PCA)
Partial Least Squares
- Supervised dimensionality reduction
- Looks for projections having highest covariance with group labels
Factor Analysis
- Focus of Factor Analysis (FA) is on correlations
- FA generates new variables that are linear combinations of features and seeks to reproduce inter-correlation between variables
- Factors represent the common variance of features, excluding unique variance
- Factors are latent unmeasured variables that express themselves through their relationship with other measured variables
- FA is a measurement model of a latent feature (different from PCA which is just a linear combination of features)
- FA allows for investigation of relationship between variables
- As many factors as there are variables; select the factors with the largest eigenvalues as they explain more of the variance than the remaining factors
- Factor loadings measure strength of relationship between a measured variable and the unmeasured latent variable
- Focus is on off-diagonal terms of correlation matrix
- FA is used when purpose is detect data structure or for causal modeling
- Both PCA & FA estimate a Gaussian with a low-rank covariance matrix (with Gaussian priors on the latent variables). Non-gaussian priors on latent variables leads to Independent Component Analysis.
- Advantage over PCA: can model variance of every component independently which leads to better approximation of original data when components have different variances
- Components generated by FA have no general structure (i.e., don't have to be orthogonal like PCA)
- Assumptions:
- Multiple measured variables have similar patterns because of their correlation with an unmeasured latent variable
- Implementations
- scikit-learn: factor-analysis
- R:
stats.factanal
,psych.factor.pa
,FactoMineR
package,GPARotation
package
Independent Components Analysis
- Non-linear decomposition of mixed signal into additive components
- Typically used for separating mixed signals, not dimension reduction
- No noise term in model, so whitening must be applied
Implementations
- scikit-learn: Fast ICA
Singular Value Decomposition
- A = UDV', where D = diagonal matrix of singular values in desc order; U V = orthonormal matrix whose columns are singular vectors
Implementations
- scikit-learn: TruncatedSVD
- Spark:
linalg.SingularValueDecomposition
,RowMatrix.computeSVD()
Projection Pursuit
- ?
Multi-Dimensional Scaling
- ?
Linear Discriminant Analysis
- Supervised dimensionality reduction (i.e., takes into account the response/label)
- Generalizations of Fisher's linear discriminant
- Finds a linear combination of features to separate 2 or more classes
- Looks for projections having highest correlation with dummy variables encoding group labels
- Special case of Canonical Correlation Analysis
Partial Least Squares Discriminant Analysis
- ?
Mixture Discriminant Analysis
- ?
Quadratic Discriminant Analysis
- ?
Regularized Discriminant Analysis
- ?
Manifold Learning
- Non-linear mapping of features to lower-dimensional space
- Many approaches available such as Isomap, Local Linear Embedding, Hessian Eigenmapping, Spectral Embedding, Local Tangent Space Alignment, t-distributed Stochastic Neighbor Embedding, etc.
- These approaches construct a lower-dimensional representation of data using a cost function that retains local properties of the data
Random Projections
- ?
Feature Agglomeration
- ?
Non-Negative Matrix Factorization
- Implementations
- scikit-learn: NMF
Latent Dirichlet Allocation
- Implementations
- Spark:
LDA
References
- An Introduction to Statistical Learning with Applications in R
- A Tutorial on Principal Components Analysis, Jonathon Shlens, 2014
- A Tutorial on Principal Components Analysis, Lindsay I Smith, 2002
- Principal Components Analysis, Wikipedia, Jan 2017
- A Tutorial on Independent Components Analysis, Jonathon Shlens, 2014
- A Survey of Dimensionality Reduction Techniques, Imola K Fodor, LLNL technote
- Bayesian PCA, Christopher Bishop, NIPS 1998
- Quick-R: PCA & Factor Analysis in R
- Analysis Factor: A Short Introduction to Factor Analysis
- Analysis Factor: Difference between PCA & FA