MoMA: Modern Multivariate Analysis in R - rstats-gsoc/gsoc2018 GitHub Wiki

`MoMA`: Modern Multivariate Analysis in `R`

Background

Multivariate analysis -- the study of finding meaningful patterns in datasets -- is a key technique in any data scientist's toolbox. Beyond its use for Exploratory Data Analysis ("EDA"), multivariate analysis also allows for principled Data-Driven Discovery: finding meaningful, actionable, and reproducible structure in large data sets. Classical techniques for multivariate analysis have proven immensely successful through history, but modern Data-Driven Discovery requires new techniques to account for the specific complexities of modern data. We propose a new unified framework for Modern Multivariate Analysis ("MoMA"), which will provide a unified and flexible baseline for future research in multivariate analysis. Even more importantly, we anticipate that an easy-to-use R package will increase adoption of these powerful new models by end users and, in conjunction with R's rich graphics libraries, position R as the leading platform for modern exploratory data analysis and data-driven discovery.

Multivariate analysis techniques date back to the earliest days of statistics, pre-dating other foundational concepts like hypothesis testing by several decades. Classical techniques such as Principal Components Analysis ("PCA") [1, 2], Partial Least Squares ("PLS"), Canonical Correlation Analysis ("CCA") [3], and Linear Discriminant Analysis ("LDA"), have a long and distinguished history of use in statistics and are still among the most widely used methods for EDA. Their importance is reflected in the CRAN Task View dedicated to Multivariate Analysis [4], as well as the specialized implementations available for a range of application areas. Somewhat surprisingly, each of these techniques can be interpreted as a variant of the well-studied eigendecomposition problem, allowing statisticians to build upon a rich mathematical and computational literature.

In the early 2000s, researchers noted that naive extensions of classical multivariate techniques to the high-dimensional setting produced unsatisfactory results, a finding later confirmed by advances in random matrix theory [5]. In response to these findings, multivariate analysis experienced a renaissance as researchers developed a wide array of new techniques to incorporate sparsity, smoothness, and other structure into classical techniques [6,7,8,9,10,11,12,13,14 among many others], resulting in a rich literature on "modern multivariate analysis." Around the same time, theoretical advances showed that these techniques avoided many of the pitfalls associated with naive extensions [15,16,17,18,19,20].

While this literature is vast, it relies on a single basic principle: it is essential to adapt classical techniques to account for known characteristics and complexities of the dataset at hand for multivariate analysis to succeed. For example, a neuroscientist investigating the brain's response to an external stimulus may expect a response which is simultaneously spatially smooth and sparse: spatially smooth because the brain processes related stimuli in well-localized areas (e.g., the visual cortex) and sparse because not all regions of the brain are used to respond to a given stimulus. Alternatively, a statistical factor model used to understand financial returns may be significantly improved by incorporating known industry sector data, motivating a form of group sparsity. A sociologist studying how pollution leads to higher levels of respiratory illnesses may combine spatial smoothness and sparsity (indicating "pockets" of activity) with a non-negativity constraint, knowing that pollution and illness have a positive effect.

To incorporate these different forms of prior knowledge into multivariate analysis, a wide variety of algorithms and approaches have been proposed. In 2013, Allen proposed a general framework that unified existing techniques for "modern" PCA, as well as proposing a number of novel extensions [21]. The mentors' recently developed MoMA algorithm [42] builds on this work, allowing more forms of regularization and structure, as well as supporting more forms of multivariate analysis.

In this project, we propose a new package to make modern multivariate analysis available to a wide audience. The proposed package will allow for fitting PCA, PLS, CCA, and LDA with all of the modern "bells-and-whistles:" sparsity, smoothness, ordered and unordered fusion, orthogonalization with respect to arbitrary bases, and non-negativity constraints. Uniting this wide literature under a single umbrella using the MoMA algorithm will provide a unified and flexible platform for data-driven discovery in R.

Related work

Plain PCA is implemented in base R's stats::prcomp function. A number of packages which perform either sparse [34] or smooth [35] PCA can be found on CRAN. The spls [36] and RGCCA packages [37] implement sparse, but not smooth PLS and CCA respectively.

The PMA package [38], based on [12], is the most similar in spirit to our package, providing a unified approach to PCA and CCA, as well as multi-way CCA, but it does not implement PLS or LDA. More importantly, it does not implement the smoothing options, orthogonalization, or unordered fusion (clustering) penalties that the MoMA algorithm provides.

A Matlab implementation of the basic SFPCA algorithm is available online [39] but does not implement the full MoMA algorithm.

Coding Project Details

Core Algorithm (Estimated time: 5 Weeks)

The iterative MoMA algorithm forms the core of the package and will be implemented in C++ using Rcpp and RcppArmadillo for speed [22, 23, 24]. The algorithm is based on repeated application of the ISTA/FISTA (proximal gradient) scheme of Beck and Teboulle [25]. In addition to the options discussed in [21], the package will include a generalized algorithm with options for:
- L1 ("Lasso") [26], SCAD [27], and MCP [28] sparsity-inducing penalization
- Group sparsity [29]
- Ordered fusion using the "fused lasso" [30]
- Unordered fusion using a convex clustering penalty [31,32]
- Smoothing with respect to an arbitrary basis
- Orthogonalization with respect to an arbitrary basis
- Non-Negativity constraints
Each of these options may be applied to the left singular vectors, the right singular vectors, or both.
User Interface and Tuning Parameter Selection (Estimated time: 5 Weeks)

Once the MoMA algorithm is implemented, wrapper functions for PCA, PLS, CCA, and LDA will be implemented. Additionally, helper functions to select regularization parameters (either via cross-validation, BIC, or extended BIC [33]) will be provided.
Visualization and Documentation (Estimated time: 3 Weeks)

Finally, the package requires extensive documentation and visualization tools to be helpful. We will implement some basic all-purpose visualizations and write two vignettes: one with a detailed example application to neuroscience, loosely based on Section 6 of [21] and one describing the implementation details of the package.

Future development (not in scope for GSoC-2018) may incorporate the Generalized PCA algorithm, with natural extensions to PLS and CCA [41], Multi-Way CCA, missing value support (including imputation), or more general loss functions such as those described in [43]. Tensor (higher-order array) support is of interest to the mentors, but would require more comprehensive support for higher-order arrays in Armadillo or Eigen.

Expected Impact

Multivariate Analysis techniques are indispensable for true Data Driven Discovery, but a unified and easy-to-use framework has been lacking to date. Individual packages allow for certain special cases handled by MoMA, but for advanced cases, no standard packaged solution is available. The MoMA package will provide the first unified toolbox for all forms of high-dimensional multivariate analysis in R or any other language. MoMA will empower statisticians and data scientists to flexibly find patterns that respect the specific structure of their data and allow for truly Modern Multivariate Analysis.

Mentors

Genevera Allen ([email protected]): Theory and Algorithms

Departments of Statistics, CS, and ECE, Rice University

Jan and Dan Duncan Neurological Research Institute, Baylor College of Medicine and Texas Children's Hospital
Michael Weylandt ([email protected]): Implementation

Department of Statistics, Rice University

Required Skills

Convex Optimization
Coding with R, C++, and Rcpp

Tests

Potential applicants should:

Implement the SFPCA algorithm [21, Algorithm 1] for a rank-one approximation in C++ for the special case of L1 ("lasso") penalization on both the left and right singular vectors and user defined smoothing matrices; (see [40] for details of the algorithm)
Wrap their implementation using Rcpp;
Test their implementation using testthat;
Package their implementation and pass R CMD check on at least two of the three major platforms: Windows, MacOS, and Linux (Debian/Ubuntu).

Solutions of Tests

Students, please add a link to your solutions below. You are encouraged to check the output of your implementation against the Matlab version [39]. Mentors will check that the package passes R CMD check without any WARNING(s) or ERROR(s).

Solution - Chaman Agrawal

Solution - Alex Hayes

Solution - Luofeng Liao

Solution - Rongpeng Li

Solution - Ruijin Lu

Coding Milestones

Community Bonding Period

Student will become familiar with GitHub workflow (including use of TravisCI and CodeCov) as well as the MoMA algorithm. Mentors will set up package infrastructure on GitHub.

Coding Period

Phase I

Weeks 1 - 2:

Core MoMA Algorithm: Sparsity (L1+SCAD+MCP), Group Sparsity, Smoothing

Week 3:

Core MoMA Algorithm: Orthogonalization, Non-Negativity

Weeks 4 - 5:

Core MoMA Algorithm: Fusion Penalties

Phase II

Weeks 6 - 7:

PCA, PLS, CCA, LDA wrappers

Weeks 8:

Parameter Selection by EBIC for all methods & by Cross-Validation for PCA

Weeks 9 - 10:

Parameter Selction by Cross-Validation for PLS, CCA, LDA

Phase III

Weeks 11:

Function Level Documentation
Algorithm + Timing Vignette

Week 12:

Neuroscience Vignette

Week 13:

GitHub pages + pkgdown documentation
"Polishing" as needed

References

[1] K. Pearson. "On Lines and Planes of Closest Fit to Systems of Points in Space." The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2, p.559-572, 1901. https://doi.org/10.1080/14786440109462720

[2] H. Hotelling. Analysis of a Complex of Statistical Variables into Principal Components. Journal of Educational Psychology 24(6), p.417-441, 1933. http://dx.doi.org/10.1037/h0071325

[3] H. Hotelling. "Relations Between Two Sets of Variates" Biometrika 28(3-4), p.321-377, 1936. https://doi.org/10.1093/biomet/28.3-4.321

[4] See CRAN Task View: Multivariate Statistics

[5] I. Johnstone, A. Lu. "On Consistency and Sparsity for Principal Components Analysis in High Dimensions." Journal of the American Statistical Association: Theory and Methods 104(486), p.682-693, 2009. https://doi.org/10.1198/jasa.2009.0121

[6] B. Silverman. "Smoothed Functional Frincipal Fomponents Analysis by Choice of Norm." Annals of Statistics 24(1), p.1-24, 1996. https://projecteuclid.org/euclid.aos/1033066196

[7] J. Huang, H. Shen, A. Buja. "Functional Principal Components Analysis via Penalized Rank One Approximation." Electronic Journal of Statistics 2, p.678-695, 2008. https://projecteuclid.org/euclid.ejs/1217450800

[8] I.T. Jolliffe, N.T. Trendafilov, M. Uddin. "A Modified Principal Component Technique Based on the Lasso." Journal of Computational and Graphical Statistics 12(3), p.531-547, 2003. https://doi.org/10.1198/1061860032148

[9] H. Zou, and T. Hastie, and R. Tibshirani. "Sparse Principal Component Analysis." Journal of Computational and Graphical Statistics 15(2), p.265-286, 2006. https://doi.org/10.1198/106186006X113430

[10] A. d'Aspremont, L. El Gahoui, M.I. Jordan, G.R.G. Lanckriet. "A Direct Formulation for Sparse PCA Using Semidefinite Programming." SIAM Review 49(3), p.434-448, 2007. https://doi.org/10.1137/050645506

[11] A. d'Aspremont, F. Bach, L. El Gahoui. "Optimal Solutions for Sparse Principal Component Analysis." Journal of Machine Learning Research 9, p.1269-1294, 2008. http://www.jmlr.org/papers/v9/aspremont08a.htm

[12] D. Witten, R. Tibshirani, T. Hastie. "A Penalized Matrix Decomposition, with Applications to Sparse Principal Components and Canonical Correlation Analysis." Biostatistics 10(3), p.515-534, 2009. https://doi.org/10.1093/biostatistics/kxp008

[13] R. Jenatton, G. Obozinski. F. Bach. "Structured Sparse Principal Component Analysis." Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS) 2010. http://proceedings.mlr.press/v9/jenatton10a.html

[14] G.I. Allen, M. Maletic-Savatic. "Sparse Non-Negative Generalized PCA with Applications to Metabolomics." Bioinformatics 27(21), p.3029-3035, 2011. https://doi.org/10.1093/bioinformatics/btr522

[15] A.A. Amini, M.J. Wainwright. "High-Dimensional Analysis of Semidefinite Relaxations for Sparse Principal Components." Annals of Statistics 37(5B), p.2877-2921, 2009. https://projecteuclid.org/euclid.aos/1247836672

[16] S. Jung, J.S. Marron. "PCA Consistency in High-Dimension, Low Sample Size Context." Annals of Statistics 37(6B), p.4104-4130, 2009. https://projecteuclid.org/euclid.aos/1256303538

[17] Z. Ma. "Sparse Principal Component Analysis and Iterative Thresholding." Annals of Statistics 41(2), p.772-801, 2013. https://projecteuclid.org/euclid.aos/1368018173

[18] T.T. Cai, Z. Ma, Y. Wu. "Sparse PCA: Optimal Rates and Adaptive Estimation." Annals of Statistics 41(6), p.3074-3110, 2013. https://projecteuclid.org/euclid.aos/1388545679

[19] V.Q. Vu, J. Lei. "Minimax Sparse Principal Subspace Estimation in High Dimensions." Annals of Statistics 41(6), p.2905-2947, 2013. https://projecteuclid.org/euclid.aos/1388545673

[20] D. Shen, H. Shen, J.S. Marron. "Consistency of Sparse PCA in High Dimension, Low Sample Size Contexts." Journal of Multivariate Analysis 115, p.317-333, 2013. https://doi.org/10.1016/j.jmva.2012.10.007

[21] G.I. Allen. "Sparse and Functional Principal Components Analysis." ArXiv Pre-Print 1309.2895 (2013). https://arxiv.org/abs/1309.2895

[22] D. Eddelbuettel, R. Francois. "Rcpp: Seamless R and C++ Integration." Journal of Statistical Software 40(8), p.1-18, 2011. https://doi.org/10.18637/jss.v040.i08

[23] D. Eddelbuettel, C. Sanderson. "RcppArmadillo: Accelerating R with High-Performance C++ Linear Algebra." Computational Statistics and Data Analysis 71, p.1054-1063, 2014. https://doi.org/10.1016/j.csda.2013.02.005

[24] C. Sanderson, R. Curtin. "Armadillo: A Template-Based C++ Library for Linear Algebra." Journal of Open Source Software 1(2), p.26, 2016. https://doi.org/10.21105/joss.00026

[25] A. Beck, M. Teboulle. "A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems." SIAM Journal on Imaging Sciences 2(1), p.183-202, 2009. https://doi.org/10.1137/080716542

[26] R. Tibshirani. "Regression Shrinkage and Selection via the Lasso." Journal of the Royal Statistical Society, Series B. 58(1), p.267-288, 1996. http://www.jstor.org/stable/2346178

[27] J. Fan, R. Li. "Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties." Journal of the American Statistical Association 96(456), p.1348-1360, 2011. https://doi.org/10.1198/016214501753382273

[28] C.-H. Zhang. "Nearly Unbiased Variable Selection Uner Minimax Convex Penalty." Annals of Statistics 38(2), p.894-942, 2010. https://projecteuclid.org/euclid.aos/1266586618

[29] M. Yuan, "Model Selection and Estimation in Regression with Grouped Variables." Journal of the Royal Statistical Society, Series B 68(1), p.49-67, 2006. https://doi.org/10.1111/j.1467-9868.2005.00532.x

[30] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, K. Knight. "Sparsity and Smoothness via the Fused Lasso." Journal of the Royal Statistical Society, Series B 67(1), p. 91-108, 2005. https://doi.org/10.1111/j.1467-9868.2005.00490.x

[31] T.D. Hocking, A. Joulin, F. Bach, J.-P. Vert. "ClusterPath: An Algorithm for Clustering using Convex Fusion Penalties." Proceedings of the 28th International Conference on Machine Learning (ICML) 2011.

[32] E.C. Chi, K. Lange. "Splitting Methods for Convex Clustering." Journal of Computational and Graphical Statistics 24(4), p.994-1013, 2015. https://doi.org/10.1080/10618600.2014.948181

[33] J. Chen, Z. Chen. "Extended Bayesian Information Criteria for Model Selection with Large Model Spaces." Biometrika 95(3), p.759-771, 2008. https://doi.org/10.1093/biomet/asn034

[34] For example, the elasticnet, nsprcomp, and rospca packages.

[35] For example, the fda, fdapace, and fdasrvf packages.

[36] Available on CRAN: spls

[37] Available on CRAN: RGCCA

[38] Available on CRAN: PMA

[39] http://www.stat.rice.edu/~gallen/software/sfpca/

[40] Slide 25 of http://www.stat.rice.edu/~gallen/gallen_sfpca_talk.pdf

[41] G.I. Allen, L. Grosenick, J. Taylor. "A Generalized Least-Square Matrix Decomposition." Journal of the American Statistical Association 109(505), p.145-159, 2014. https://doi.org/10.1080/01621459.2013.852978

[42] G.I. Allen, M. Weylandt. "MoMA: A General Algorithm for Modern Multivariate Analysis." (In Preparation)

[43] M. Udell, C. Horn, R. Zadeh S. Boyd. "Generalized Low Rank Models." Foundations and Trends in Machine Learning 9(1), p.1-118, 2016. https://doi.org/10.1561/2200000055