Dimensionality Reduction - niranjv/ml-notes GitHub Wiki

Contents

Overview

  • Dimensionality reduction is the process of reducing the number of features under consideration to a set of principal features.
  • Extract signal from raw, noisy features or compress number of features while maintaining structure of data
  • The transformation can be linear (e.g., PCA) or non-linear (e.g., manifold learning)
  • Advantages:
  • Reduce time and storage space required
  • Removal of multi-collinearity
  • Easier to visualize data
  • Disadvantages:
  • Response variable is not always considered in feature extraction/dimensionality reduction

Principal Components Analysis

  • PCA finds a low-dim representation of a dataset that contains as much of the variance of the dataset as possible

  • It is an orthogonal linear transformation that converts a set of possibly correlated features to a smaller set of linearly uncorrelated variables (Principal Components), all orthogonal to each other. Uncorrelated => orthogonal because Cov[XY] is the dot product of X and Y => |X||Y|cos(theta) = 0 => theta = 90 => random variable vectors are perpendicular.

  • It re-expresses the dataset in the most meaningful basis that will filter out noise and reveal hidden structure (linear combination of original basis => change of basis to a new basis). PCs form the orthogonal basis set of the features.

  • PC directions are the directions in which the data is most variable

  • Focus of PCA is on variance of features (common & unique) and seeks to reproduce total variance of features; PCs best capture the variance of the features

  • PCs can be used to re-construct most of the variance of the original data

  • Does not involve response, if available

  • PC are unique only up to sign flip (since direction & variance do not change with sign flip)

  • Features are always mean-centered before PCA is performed (i.e., column-means of design matrix = 0) since we are only interested in variance

  • Better to normalize all features to mean=0, stdev=1 to prevent scaling from affecting PCA

  • Sum of square of (PCA loadings/weights) for each component = 1 (to prevent arbitrarily large values of loadings leading to arbitrarily large variance)

  • Each component has an 1 x p loading vector (direction of component; has highest variance of dataset) and an 1 x n score vector (projections along this direction)

  • If original data is projected along in direction of loading vector, then the projected values for the data are the corresponding scores.

  • PCs are the lines in p-dim space closest to the data (where close = mean sq euclidean distance). The first M PCs and scores provide the best M-dim approximation to the i^th sample. When M = p, approximation is exact.

  • The linear transformation is PX = Y where row vectors of P are PCs of X

  • Row vectors of P are the eigen vectors of XX'/n => this diagonalizes covariance matrix of new variables

  • PCA's focus is on diagonal terms of correlation matrix

  • Importance of a component is measured by the amount of variation of the component

  • 1st PC accounts for as much variance as possible

  • 2nd PC accounts for as much of the remaining variation as possible (subject to being orthogonal to the 1st component) and so on

  • Number of PCs <= min(samples - 1, # features)

  • Kernel PCA is used to do non-linear dimensionality reduction

  • Multiple correspondence analysis is an equivalent of PCA for categorical data

Applications

  • Visualization of high dimensional data (during EDA)
  • Discover structure / sub-groups
  • PC regression - regress on the first M PCs instead of all p covariates (M << p); less noisy results since most of the signal is concentrated in the first few PCs

Assumptions

  • Non-parametric
  • Linearity => change of basis for feature space
  • Large variances => PCs with large variances represent important structure in feature space => high SNR
  • Orthogonal components => Lower redundancy; easy way to calculate P

Solutions

  • via Eigenvector decomposition of covariance matrix of features
  • via Singular Value Decomposition of data matrix of features

Advantages

  • Minimize redundancy (based on covariance of feature matrix) by removing correlated features
  • Increase signal-to-noise ratio (based on variance of feature matrix)
  • In regression, if large number of possibly correlated features are present, run PCA first and then regress on the first few PCs
  • If data is noisy, running PCA helps to concentrate more of total variance of data in first few PCs, which is proportionally higher than noise variance (which is same as in original data). This increases signal-to-noise ration of of the first few PCs.
  • Useful for visualization

Disadvantages

  • Only removes 2nd order dependencies (which is sufficient for data sets with Gaussian distribution); does not remove higher-order dependencies (in non-Gaussian data); need to use kernel PCA or ICA for this
  • Does not consider response while reducing dimension of features
  • Sensitive to relative scaling of original features, so scale covariates to have stdev = 1

Implementations

  • scikit-learn: Exact PCA, Incremental PCA, Kernel PCA, Sparse PCA, Mini Batch Sparse PCA
  • R: prcomp(), princomp(), psych.principal()
  • Spark: RowMatrix.computePrincipalComponents() (for matrix with n >> p only) & PCA for vectors

Ref

  • ISLR, Section 10.2 (PCA)

Partial Least Squares

  • Supervised dimensionality reduction
  • Looks for projections having highest covariance with group labels

Factor Analysis

  • Focus of Factor Analysis (FA) is on correlations
  • FA generates new variables that are linear combinations of features and seeks to reproduce inter-correlation between variables
  • Factors represent the common variance of features, excluding unique variance
  • Factors are latent unmeasured variables that express themselves through their relationship with other measured variables
  • FA is a measurement model of a latent feature (different from PCA which is just a linear combination of features)
  • FA allows for investigation of relationship between variables
  • As many factors as there are variables; select the factors with the largest eigenvalues as they explain more of the variance than the remaining factors
  • Factor loadings measure strength of relationship between a measured variable and the unmeasured latent variable
  • Focus is on off-diagonal terms of correlation matrix
  • FA is used when purpose is detect data structure or for causal modeling
  • Both PCA & FA estimate a Gaussian with a low-rank covariance matrix (with Gaussian priors on the latent variables). Non-gaussian priors on latent variables leads to Independent Component Analysis.
  • Advantage over PCA: can model variance of every component independently which leads to better approximation of original data when components have different variances
  • Components generated by FA have no general structure (i.e., don't have to be orthogonal like PCA)
  • Assumptions:
  • Multiple measured variables have similar patterns because of their correlation with an unmeasured latent variable
  • Implementations
  • scikit-learn: factor-analysis
  • R: stats.factanal, psych.factor.pa, FactoMineR package, GPARotation package

Independent Components Analysis

  • Non-linear decomposition of mixed signal into additive components
  • Typically used for separating mixed signals, not dimension reduction
  • No noise term in model, so whitening must be applied

Implementations

  • scikit-learn: Fast ICA

Singular Value Decomposition

  • A = UDV', where D = diagonal matrix of singular values in desc order; U V = orthonormal matrix whose columns are singular vectors

Implementations

  • scikit-learn: TruncatedSVD
  • Spark: linalg.SingularValueDecomposition, RowMatrix.computeSVD()

Projection Pursuit

  • ?

Multi-Dimensional Scaling

  • ?

Linear Discriminant Analysis

  • Supervised dimensionality reduction (i.e., takes into account the response/label)
  • Generalizations of Fisher's linear discriminant
  • Finds a linear combination of features to separate 2 or more classes
  • Looks for projections having highest correlation with dummy variables encoding group labels
  • Special case of Canonical Correlation Analysis

Partial Least Squares Discriminant Analysis

  • ?

Mixture Discriminant Analysis

  • ?

Quadratic Discriminant Analysis

  • ?

Regularized Discriminant Analysis

  • ?

Manifold Learning

  • Non-linear mapping of features to lower-dimensional space
  • Many approaches available such as Isomap, Local Linear Embedding, Hessian Eigenmapping, Spectral Embedding, Local Tangent Space Alignment, t-distributed Stochastic Neighbor Embedding, etc.
  • These approaches construct a lower-dimensional representation of data using a cost function that retains local properties of the data

Random Projections

  • ?

Feature Agglomeration

  • ?

Non-Negative Matrix Factorization

  • Implementations
  • scikit-learn: NMF

Latent Dirichlet Allocation

  • Implementations
  • Spark: LDA

References