MoMA: Modern Multivariate Analysis in R - rstats-gsoc/gsoc2018 GitHub Wiki
MoMA
: Modern Multivariate Analysis in R
Background
Multivariate analysis -- the study of finding meaningful patterns in
datasets -- is a key technique in any data scientist's toolbox.
Beyond its use for Exploratory Data Analysis ("EDA"),
multivariate analysis also allows for principled Data-Driven Discovery:
finding meaningful, actionable, and reproducible structure in large data sets.
Classical techniques for multivariate analysis have proven immensely
successful through history, but modern Data-Driven Discovery requires
new techniques to account for the specific complexities of modern data.
We propose a new unified framework for Modern Multivariate Analysis ("MoMA"),
which will provide a unified and flexible baseline for future research in multivariate
analysis. Even more importantly, we anticipate that an easy-to-use R
package will increase adoption of these powerful new models by end
users and, in conjunction with R
's rich graphics libraries, position
R
as the leading platform for modern exploratory data analysis and
data-driven discovery.
Multivariate analysis techniques date back to the earliest days of statistics, pre-dating other foundational concepts like hypothesis testing by several decades. Classical techniques such as Principal Components Analysis ("PCA") [1, 2], Partial Least Squares ("PLS"), Canonical Correlation Analysis ("CCA") [3], and Linear Discriminant Analysis ("LDA"), have a long and distinguished history of use in statistics and are still among the most widely used methods for EDA. Their importance is reflected in the CRAN Task View dedicated to Multivariate Analysis [4], as well as the specialized implementations available for a range of application areas. Somewhat surprisingly, each of these techniques can be interpreted as a variant of the well-studied eigendecomposition problem, allowing statisticians to build upon a rich mathematical and computational literature.
In the early 2000s, researchers noted that naive extensions of classical multivariate techniques to the high-dimensional setting produced unsatisfactory results, a finding later confirmed by advances in random matrix theory [5]. In response to these findings, multivariate analysis experienced a renaissance as researchers developed a wide array of new techniques to incorporate sparsity, smoothness, and other structure into classical techniques [6,7,8,9,10,11,12,13,14 among many others], resulting in a rich literature on "modern multivariate analysis." Around the same time, theoretical advances showed that these techniques avoided many of the pitfalls associated with naive extensions [15,16,17,18,19,20].
While this literature is vast, it relies on a single basic principle: it is essential to adapt classical techniques to account for known characteristics and complexities of the dataset at hand for multivariate analysis to succeed. For example, a neuroscientist investigating the brain's response to an external stimulus may expect a response which is simultaneously spatially smooth and sparse: spatially smooth because the brain processes related stimuli in well-localized areas (e.g., the visual cortex) and sparse because not all regions of the brain are used to respond to a given stimulus. Alternatively, a statistical factor model used to understand financial returns may be significantly improved by incorporating known industry sector data, motivating a form of group sparsity. A sociologist studying how pollution leads to higher levels of respiratory illnesses may combine spatial smoothness and sparsity (indicating "pockets" of activity) with a non-negativity constraint, knowing that pollution and illness have a positive effect.
To incorporate these different forms of prior knowledge into multivariate analysis,
a wide variety of algorithms and approaches have been proposed. In 2013, Allen
proposed a general framework that unified existing techniques for "modern" PCA,
as well as proposing a number of novel extensions [21]. The mentors' recently developed
MoMA
algorithm [42] builds on this work, allowing more forms of regularization
and structure, as well as supporting more forms of multivariate analysis.
In this project, we propose a new package to make modern multivariate
analysis available to a wide audience. The proposed package will allow
for fitting PCA, PLS, CCA, and LDA with all of the
modern "bells-and-whistles:" sparsity, smoothness, ordered and
unordered fusion, orthogonalization with respect to arbitrary bases,
and non-negativity constraints. Uniting this wide literature under a
single umbrella using the MoMA
algorithm will provide
a unified and flexible platform for data-driven discovery in R
.
Related work
Plain PCA is implemented in base R
's stats::prcomp
function. A
number of packages which perform either sparse [34] or smooth [35] PCA
can be found on CRAN
. The spls
[36] and RGCCA
packages [37]
implement sparse, but not smooth PLS and CCA respectively.
The PMA
package [38], based on [12], is the most similar in spirit
to our package, providing a unified approach to PCA and CCA, as well
as multi-way CCA, but it does not implement PLS or LDA. More
importantly, it does not implement the smoothing options,
orthogonalization, or unordered fusion (clustering) penalties that the
MoMA
algorithm provides.
A Matlab implementation of the basic SFPCA algorithm is available online [39]
but does not implement the full MoMA
algorithm.
Coding Project Details
-
Core Algorithm (Estimated time: 5 Weeks)
The iterative
MoMA
algorithm forms the core of the package and will be implemented inC++
usingRcpp
andRcppArmadillo
for speed [22, 23, 24]. The algorithm is based on repeated application of the ISTA/FISTA (proximal gradient) scheme of Beck and Teboulle [25]. In addition to the options discussed in [21], the package will include a generalized algorithm with options for:- L1 ("Lasso") [26], SCAD [27], and MCP [28] sparsity-inducing penalization
- Group sparsity [29]
- Ordered fusion using the "fused lasso" [30]
- Unordered fusion using a convex clustering penalty [31,32]
- Smoothing with respect to an arbitrary basis
- Orthogonalization with respect to an arbitrary basis
- Non-Negativity constraints
Each of these options may be applied to the left singular vectors, the right singular vectors, or both.
-
User Interface and Tuning Parameter Selection (Estimated time: 5 Weeks)
Once the
MoMA
algorithm is implemented, wrapper functions for PCA, PLS, CCA, and LDA will be implemented. Additionally, helper functions to select regularization parameters (either via cross-validation, BIC, or extended BIC [33]) will be provided. -
Visualization and Documentation (Estimated time: 3 Weeks)
Finally, the package requires extensive documentation and visualization tools to be helpful. We will implement some basic all-purpose visualizations and write two vignettes: one with a detailed example application to neuroscience, loosely based on Section 6 of [21] and one describing the implementation details of the package.
Future development (not in scope for GSoC-2018) may incorporate the Generalized
PCA algorithm, with natural extensions to PLS and CCA [41], Multi-Way CCA,
missing value support (including imputation), or more general loss functions
such as those described in [43]. Tensor (higher-order array) support is of
interest to the mentors, but would require more comprehensive support for
higher-order arrays in Armadillo
or Eigen
.
Expected Impact
Multivariate Analysis techniques are indispensable for true Data Driven Discovery,
but a unified and easy-to-use framework has been lacking to date. Individual
packages allow for certain special cases handled by MoMA
, but for advanced
cases, no standard packaged solution is available. The MoMA
package will
provide the first unified toolbox for all forms of high-dimensional
multivariate analysis in R
or any other language. MoMA
will empower
statisticians and data scientists to flexibly find patterns that respect the
specific structure of their data and allow for truly Modern Multivariate Analysis.
Mentors
-
Genevera Allen ([email protected]): Theory and Algorithms
Departments of Statistics, CS, and ECE, Rice University
Jan and Dan Duncan Neurological Research Institute, Baylor College of Medicine and Texas Children's Hospital
-
Michael Weylandt ([email protected]): Implementation
Department of Statistics, Rice University
Required Skills
- Convex Optimization
- Coding with
R
,C++
, andRcpp
Tests
Potential applicants should:
- Implement the SFPCA algorithm [21, Algorithm 1] for a rank-one
approximation in
C++
for the special case of L1 ("lasso") penalization on both the left and right singular vectors and user defined smoothing matrices; (see [40] for details of the algorithm) - Wrap their implementation using
Rcpp
; - Test their implementation using
testthat
; - Package their implementation and pass
R CMD check
on at least two of the three major platforms: Windows, MacOS, and Linux (Debian/Ubuntu).
Solutions of Tests
Students, please add a link to your solutions below. You are
encouraged to check the output of your implementation against the
Matlab version [39]. Mentors will check that the package passes R CMD check
without any WARNING(s)
or ERROR(s)
.
Solution - Luofeng Liao
Solution - Ruijin Lu
Coding Milestones
Community Bonding Period
Student will become familiar with GitHub workflow (including use of
TravisCI and CodeCov) as well as the MoMA
algorithm. Mentors will set
up package infrastructure on GitHub.
Coding Period
Phase I
Weeks 1 - 2:
- Core
MoMA
Algorithm: Sparsity (L1+SCAD+MCP), Group Sparsity, Smoothing
Week 3:
- Core
MoMA
Algorithm: Orthogonalization, Non-Negativity
Weeks 4 - 5:
- Core
MoMA
Algorithm: Fusion Penalties
Phase II
Weeks 6 - 7:
- PCA, PLS, CCA, LDA wrappers
Weeks 8:
- Parameter Selection by EBIC for all methods & by Cross-Validation for PCA
Weeks 9 - 10:
- Parameter Selction by Cross-Validation for PLS, CCA, LDA
Phase III
Weeks 11:
-
Function Level Documentation
-
Algorithm + Timing Vignette
Week 12:
- Neuroscience Vignette
Week 13:
-
GitHub pages +
pkgdown
documentation -
"Polishing" as needed
References
[1] K. Pearson. "On Lines and Planes of Closest Fit to Systems of Points in Space." The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2, p.559-572, 1901. https://doi.org/10.1080/14786440109462720
[2] H. Hotelling. Analysis of a Complex of Statistical Variables into Principal Components. Journal of Educational Psychology 24(6), p.417-441, 1933. http://dx.doi.org/10.1037/h0071325
[3] H. Hotelling. "Relations Between Two Sets of Variates" Biometrika 28(3-4), p.321-377, 1936. https://doi.org/10.1093/biomet/28.3-4.321
[4] See CRAN Task View: Multivariate Statistics
[5] I. Johnstone, A. Lu. "On Consistency and Sparsity for Principal Components Analysis in High Dimensions." Journal of the American Statistical Association: Theory and Methods 104(486), p.682-693, 2009. https://doi.org/10.1198/jasa.2009.0121
[6] B. Silverman. "Smoothed Functional Frincipal Fomponents Analysis by Choice of Norm." Annals of Statistics 24(1), p.1-24, 1996. https://projecteuclid.org/euclid.aos/1033066196
[7] J. Huang, H. Shen, A. Buja. "Functional Principal Components Analysis via Penalized Rank One Approximation." Electronic Journal of Statistics 2, p.678-695, 2008. https://projecteuclid.org/euclid.ejs/1217450800
[8] I.T. Jolliffe, N.T. Trendafilov, M. Uddin. "A Modified Principal Component Technique Based on the Lasso." Journal of Computational and Graphical Statistics 12(3), p.531-547, 2003. https://doi.org/10.1198/1061860032148
[9] H. Zou, and T. Hastie, and R. Tibshirani. "Sparse Principal Component Analysis." Journal of Computational and Graphical Statistics 15(2), p.265-286, 2006. https://doi.org/10.1198/106186006X113430
[10] A. d'Aspremont, L. El Gahoui, M.I. Jordan, G.R.G. Lanckriet. "A Direct Formulation for Sparse PCA Using Semidefinite Programming." SIAM Review 49(3), p.434-448, 2007. https://doi.org/10.1137/050645506
[11] A. d'Aspremont, F. Bach, L. El Gahoui. "Optimal Solutions for Sparse Principal Component Analysis." Journal of Machine Learning Research 9, p.1269-1294, 2008. http://www.jmlr.org/papers/v9/aspremont08a.htm
[12] D. Witten, R. Tibshirani, T. Hastie. "A Penalized Matrix Decomposition, with Applications to Sparse Principal Components and Canonical Correlation Analysis." Biostatistics 10(3), p.515-534, 2009. https://doi.org/10.1093/biostatistics/kxp008
[13] R. Jenatton, G. Obozinski. F. Bach. "Structured Sparse Principal Component Analysis." Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS) 2010. http://proceedings.mlr.press/v9/jenatton10a.html
[14] G.I. Allen, M. Maletic-Savatic. "Sparse Non-Negative Generalized PCA with Applications to Metabolomics." Bioinformatics 27(21), p.3029-3035, 2011. https://doi.org/10.1093/bioinformatics/btr522
[15] A.A. Amini, M.J. Wainwright. "High-Dimensional Analysis of Semidefinite Relaxations for Sparse Principal Components." Annals of Statistics 37(5B), p.2877-2921, 2009. https://projecteuclid.org/euclid.aos/1247836672
[16] S. Jung, J.S. Marron. "PCA Consistency in High-Dimension, Low Sample Size Context." Annals of Statistics 37(6B), p.4104-4130, 2009. https://projecteuclid.org/euclid.aos/1256303538
[17] Z. Ma. "Sparse Principal Component Analysis and Iterative Thresholding." Annals of Statistics 41(2), p.772-801, 2013. https://projecteuclid.org/euclid.aos/1368018173
[18] T.T. Cai, Z. Ma, Y. Wu. "Sparse PCA: Optimal Rates and Adaptive Estimation." Annals of Statistics 41(6), p.3074-3110, 2013. https://projecteuclid.org/euclid.aos/1388545679
[19] V.Q. Vu, J. Lei. "Minimax Sparse Principal Subspace Estimation in High Dimensions." Annals of Statistics 41(6), p.2905-2947, 2013. https://projecteuclid.org/euclid.aos/1388545673
[20] D. Shen, H. Shen, J.S. Marron. "Consistency of Sparse PCA in High Dimension, Low Sample Size Contexts." Journal of Multivariate Analysis 115, p.317-333, 2013. https://doi.org/10.1016/j.jmva.2012.10.007
[21] G.I. Allen. "Sparse and Functional Principal Components Analysis." ArXiv Pre-Print 1309.2895 (2013). https://arxiv.org/abs/1309.2895
[22] D. Eddelbuettel, R. Francois. "Rcpp
: Seamless R
and C++
Integration."
Journal of Statistical Software 40(8), p.1-18, 2011. https://doi.org/10.18637/jss.v040.i08
[23] D. Eddelbuettel, C. Sanderson. "RcppArmadillo
: Accelerating R
with High-Performance
C++
Linear Algebra." Computational Statistics and Data Analysis 71, p.1054-1063,
2014. https://doi.org/10.1016/j.csda.2013.02.005
[24] C. Sanderson, R. Curtin. "Armadillo
: A Template-Based C++
Library for
Linear Algebra." Journal of Open Source Software 1(2), p.26, 2016.
https://doi.org/10.21105/joss.00026
[25] A. Beck, M. Teboulle. "A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems." SIAM Journal on Imaging Sciences 2(1), p.183-202, 2009. https://doi.org/10.1137/080716542
[26] R. Tibshirani. "Regression Shrinkage and Selection via the Lasso." Journal of the Royal Statistical Society, Series B. 58(1), p.267-288, 1996. http://www.jstor.org/stable/2346178
[27] J. Fan, R. Li. "Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties." Journal of the American Statistical Association 96(456), p.1348-1360, 2011. https://doi.org/10.1198/016214501753382273
[28] C.-H. Zhang. "Nearly Unbiased Variable Selection Uner Minimax Convex Penalty." Annals of Statistics 38(2), p.894-942, 2010. https://projecteuclid.org/euclid.aos/1266586618
[29] M. Yuan, "Model Selection and Estimation in Regression with Grouped Variables." Journal of the Royal Statistical Society, Series B 68(1), p.49-67, 2006. https://doi.org/10.1111/j.1467-9868.2005.00532.x
[30] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, K. Knight. "Sparsity and Smoothness via the Fused Lasso." Journal of the Royal Statistical Society, Series B 67(1), p. 91-108, 2005. https://doi.org/10.1111/j.1467-9868.2005.00490.x
[31] T.D. Hocking, A. Joulin, F. Bach, J.-P. Vert. "ClusterPath: An Algorithm for Clustering using Convex Fusion Penalties." Proceedings of the 28th International Conference on Machine Learning (ICML) 2011.
[32] E.C. Chi, K. Lange. "Splitting Methods for Convex Clustering." Journal of Computational and Graphical Statistics 24(4), p.994-1013, 2015. https://doi.org/10.1080/10618600.2014.948181
[33] J. Chen, Z. Chen. "Extended Bayesian Information Criteria for Model Selection with Large Model Spaces." Biometrika 95(3), p.759-771, 2008. https://doi.org/10.1093/biomet/asn034
[34] For example, the elasticnet
,
nsprcomp
, and
rospca
packages.
[35] For example, the fda
,
fdapace
, and
fdasrvf
packages.
[36] Available on CRAN: spls
[37] Available on CRAN: RGCCA
[38] Available on CRAN: PMA
[39] http://www.stat.rice.edu/~gallen/software/sfpca/
[40] Slide 25 of http://www.stat.rice.edu/~gallen/gallen_sfpca_talk.pdf
[41] G.I. Allen, L. Grosenick, J. Taylor. "A Generalized Least-Square Matrix Decomposition." Journal of the American Statistical Association 109(505), p.145-159, 2014. https://doi.org/10.1080/01621459.2013.852978
[42] G.I. Allen, M. Weylandt. "MoMA
: A General Algorithm for Modern Multivariate
Analysis." (In Preparation)
[43] M. Udell, C. Horn, R. Zadeh S. Boyd. "Generalized Low Rank Models." Foundations and Trends in Machine Learning 9(1), p.1-118, 2016. https://doi.org/10.1561/2200000055