1. Sparsity: OLS with L1 vs. L2 - qihongl/MVPA_tutorial GitHub Wiki
Compare lasso vs. ridge regression
[simulation code]
Often, neuroimaging data is highly underdetermined. For example, a typical fMRI data might contain about 100,000 voxels as the features and a few hundreds of stimuli. We know that given a m by n matrix, where n is much larger than m, there are infinitely many solutions (assuming independence). In this case, standard least square estimator suffers from the issue of large variance. Therefore, we need to introduce regularizers, such as L1 or L2 penalty, to get a reasonable estimate. Let's conduct some simulations to understand to effect of L1 and L2 penalty terms!
In the following simulation, I will compare least square with Lasso or Ridge regularization. I will show that Lasso is more accurate at reconstructing the true signal (or parameter) if the underlying signal is sparse.
First, I generate a standard normal random matrix X, with 256 rows and 512 columns. The parameter, beta, is a 512-dimensional standard normal random vector with only 100 non-zero elements. Finally, I set y = X * beta. Our task is to reconstruct beta, given X and y.
Note that our system is underdetermined, so standard least square doesn't work at all. This is a common issue with MVPA analysis with fMRI data. Some people get around with this issue by restricting the analysis with in some ROIs. However, we will fit whole brain model directly by regularization.
The plots below below shows the reconstruction results for L1 and L2 regularized least square model, when 100 elements in the true beta are actually non-zero. The 45 degree line is drawn as a reference - if the parameter estimation perfectly match the true parameter, all the points would sit on the function f(x) = x.
If the true signal is even sparser: only 50 elements are actually non-zero, the Lasso estimate is quite accurate. Also, the ridge estimates almost never set any elements to exactly zero, due to the geometry of the L2 norm.
However, note that if the true signal is not sparse, then Lasso miss a lots of true signal. In the left most plot, there are a lot of points densely concentrated in the vicinity of x = 0, showing LASSO thinks many non-zero weights are zero.
Finally, as both Lasso and Ridge terms penalize the magnitude of the weights, they tend to underestimates the magnitude of the true weights.
[PLOTS]