Optimization 5: Numerical Algorithms - ricket-sjtu/bi028 GitHub Wiki

In many real-world applications of computer science, one often encounters the numerical optimization problems in this form: $$ \min_{x \in \mathcal{C}} f(x) $$ where

  • $x \in \mathbb{R}^d$ is within a (closed) constraint set $\mathcal{C} \subseteq \mathbb{R}^d$
  • $f: \mathbb{R}^d \to \mathbb{R}$

1. Gradient-based approach

1.0 Basic Gradient method: Unconstrained optimization problems

When we say unconstrained, it means $\mathcal{C} = \mathbb{R}^d$.

Gradient Descent Method

  • Input: $f: \mathbb{R}^d \to \mathbb{R}$
  • Initialize: $x_0 \in \mathbb{R}^d$
  • Parameter: $T$ the iterations
  • $t = 0$
  • for $t = 1$ to $T$:
    • Choose step size $\eta_t$
    • $x_{t+1} = x_t - \eta_t \nabla f(x_t)$
  • Output: $x_{T+1}$

Two schemes for selecting $\eta_t$

Exact line-search

$$ \eta_t \in \arg\min_{\eta>0} f(x_t - \eta \nabla f(x_t)) $$

Inexact line-search

  • Input: $f:\mathbb{R}^d \to \mathbb{R}, \mathbf{x}_t$
  • Parameter: $\alpha \in (0,1), \beta \in (0,1)$
  • $\eta_t = 1$
  • while $f(x_t - \eta_t \nabla f(x_t)) > f(x_t) - \alpha \eta_t \lVert \nabla f(x_t) \rVert_2^2$
    • $\eta_t = \beta \eta_t$
  • Output: $\eta_t$

1.1 Newton's method(牛顿法)

1.2 Quasi-Newton method(拟牛顿法)

1.3 BFGS

1.4 L-BFGS

1.5 L-BFGS-B

1.6 Nesterov

Many machine learning algorithms have the objective function in this form: $$ \min_{\mathbf{w}} J(\mathbf{w}) = L_{emp}(\mathbf{w}) + \lambda \Omega(\mathbf{w}) $$ where

  • $L_{emp}$ is the convex and non-negative empirical loss function $$ L_{emp}(\mathbf{w}) := \frac{1}{n} \sum_{i=1}^n \ell(\mathbf{x}_i, \mathbf{y}_i; \mathbf{w}) $$
    • can be non-smooth, possibly non-convex
  • $\mathbf{w}$: the parameters to be learn;
  • ${\mathbf{x}_i, \mathbf{y}_i}$: the training data;
  • $\Omega(\mathbf{w})$: convex and non-negative regularization term;
  • $\lambda$: tuning parameter

(1) Examples

Model Cost function
Linear SVM $\frac{1}{n}\sum_{i=1}^n \max(0, 1-y_i\langle \mathbf{x}_i, \mathbf{w}\rangle) + \frac{\lambda}{2}\lVert \mathbf{w}\rVert_2^2$
$\ell_1$-regularized logistic regression $\frac{1}{n}\sum_{i=1}^n \log(1 + \exp(-y_i\langle \mathbf{x}_i, \mathbf{w}\rangle)) + \lambda \lVert \mathbf{w}\rVert_1$
$\epsilon$-insensitive classifier $\frac{1}{n}\sum_{i=1}^n \max(0, | y_i - \langle \mathbf{x}_i, \mathbf{w}\rangle | - \epsilon) + \frac{\lambda}{2}\lVert \mathbf{w}\rVert_2^2$
LASSO $\langle \mathbf{Xw - y} \rangle_2^2 + \lambda \lVert \mathbf{w} \rVert_1$

|

2. Direct method

2.1 Nelder-Mead

2.2

  1. Equality-constraint (等式约束) can easily be relaxed using the Lagrangian multipler methods (拉格朗日乘子法);
  2. Inequality-constrained (不等式约束)
  3. Alternating direction multiplier method (ADMM,交替方向乘子法) can be used for any convex function, such as $L_1$-regularized optimization problem, $\min_w \sum f_i(w) + \langle w \rangle_1$. For ADMM, the loss function does not need to be differentiable, but should be convex. The advantage for ADMM is that it is easy to implement and parallelized. However, ADMM has a very low convergence rate.

We emphasize that for any particular problem, it is likely that another method will perform better than ADMM, or that some variation on ADMM will substantially improve performance.

  1. Gradient descent (GD,梯度下降), conjugate gradient (CG, 共轭梯度), BFGS, and L-BFGS-B can only work with differentiable functions. GD and CG should be applied to non-constrained functions, whereas BFGS and L-BFGS-B can be applied to constrained function.
  2. Gradient descent can be modified to handle constraints, which are called the projected gradient descent algorithms.
  3. If some more technical conditions on the objective ff hold (i.e., smoothness and strong convexity), gradient descent and its variants (projected gradient descent, proximal gradient descent, etc.) have worst-case convergence bounds that beat ADMM.
  4. What ADMM does get you is the ability to perform distributed optimization (分布式计算) without any apology (i.e., it will work for any convex problem). It lets you break up a larger problem into smaller, more manageable chunks. The smaller chunks can be solved using your favorite solver (including GD).