CS7545_Sp23_Lecture_18: Online Exponential Optimization - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2023

Instructor: Guanghui Wang

Notes for Lecture 18

March 9, 2023

Scribes: Leyan Pan, Chen Lin

Recap

Online Convex Optimization

The algorithm for online convex optimization is as follows.

For $t=1,2,\dots,T$ do:

  1. Learner picks $w_t\in\mathcal{K} \subseteq \mathbb{R}^d$. Environment chooses a loss function $f_t: \mathcal{K} \rightarrow \mathbb{R}^d$.
  2. Learner observes $f_t$ and updates $w_t$.

End For

We will assume that $\mathcal{K}$ and $f_t$ are convex.

Regret

$$\text{Reg}_T^{\mathcal{A}} =\sum_{t=1}^T f_t(w_t) - \sum_{t=1}^Tf_t(w^*)$$

Assumption 1: $\forall w,w'\in \mathcal{K}$, $||w-w'||_2\leq D$.

Assumption 2: $\forall w\in \mathcal{K}, t\in[T]$, $||\nabla f_t(w_t)||_2\leq G$.

$$\text{OGD} \begin{cases} w_{t+1}'=w_t-\eta_t\nabla f_t(w_t)\\ w_{t+1}=\Pi_k[w_{t+1}'] \end{cases} $$

$$\text{Reg}_T^{\text{OGD}} \begin{cases} \mathcal{O}\left(\sqrt{T}\right),&\eta_t=\frac{D}{G\sqrt{T}},\quad\text{Convex}\\ \mathcal{O}\left(\frac{\log T}{\lambda}\right),&\eta_t=\frac{1}{\lambda t},\quad\lambda\text{-SC} \end{cases} $$

Online Optimization for $\alpha$-Exponentially-Concave Functions

Definition 1: A function $f:\mathcal{K}\rightarrow \mathbb{R}$ is $\alpha$-exp-concave iff $\exp(-\alpha f(w))$ is concave.

Note: All $\alpha$-exponentially-concave functions are convex, i.e., $f(w)$ in the definition is convex.

Example:

  1. $$f_t(w)=\log\left(\frac{1}{\theta_t^\top w}\right)$$

It is 1-exp-concave since $\exp(-f_t(w))=\theta_t^\top w$, which is a linear function and thus concave.

This is an important function in portfolio management, as we will show.

Suppose we have $d$ stocks and initial wealth $M_1$.

In each round $t$:

  1. Learner picks $w_t\in \Delta^d$ (d-dimension simplex).
  2. Market returns a price ratio vector $\theta_t\in\mathbf{R}^d$.
  3. Wealth after iteration $t$: $M_{t+1}=M_t\cdot(\theta_t^\top w_t)$.

Note: For (2), we have $\theta_{t,i}=\frac{\text{price of the }i^{\text{th}}\text{ stock at round }t+1}{\text{price of the }i^{\text{th}}\text{ stock at round }t}$.

For (3), by recursion, we have $\displaystyle M_{T+1}=M_1\cdot\prod_{t=1}^T \theta_t^\top w_t.$

Goal: maximize $$\frac{M_{T+1}}{M_1}=\prod_{t=1}^T\theta_t^\intercal w_t$$

For online learning, we maximize the sum of loss function, so change the problem to online learning framework, it is equivalent to maximizing

$$\log\frac{M_{T+1}}{M_1}=\sum_{t=1}^T\log\left(\theta_t^\top w_t\right)$$

i.e., minimizing $$-\log\frac{M_{T+1}}{M_1}=\sum_{t=1}^T\log\left(\frac{1}{\theta_t^\top w_t}\right)=f_t(w_t)$$

Theorem: A twice-differentiable function is $\alpha$-exp-concave iff $\forall w\in \mathcal{K}$, $$\nabla^2 f(w)\succeq \alpha \nabla f(w)\nabla f(w)^\intercal \succeq 0,$$ i.e., $\nabla^2 f(w)$ is PSD.

Comparison: A twice-differentiable function is $\lambda$-SC iff $\nabla^2 f(w)\succeq \lambda \mathbf{I}_d$, i.e., $\nabla^2 f(w)$ is PD.

A twice-differentiable function is convex iff $\nabla^2 f(w)\succeq \lambda 0$.

So here we see $\lambda$-SC $\subseteq\alpha$-exp-concave $\subseteq$ convex.

Corollary: If a function is $\lambda$-SC, then it is also $\frac{\lambda}{G^2}$-exp-concave.

Theorem: For an $\alpha$-exp-concave function $f$, suppose Assumptions 1 and 2 hold. Then $\forall x,y \in \mathcal{K}$, $$f(y) \geq f(x) + \nabla f(x)^\top (y-x) + \frac{\beta}{2}(y-x)^\top\nabla f(w)\nabla f(w)^\top(y-x)$$ where $\beta=\frac{1}{2}\min(\frac{1}{GD},\alpha)$.

Algorithm (Online Newton Step):

Initialize: $A_0=w\mathbf{I}_d$, $w_1\in \mathcal{K}$;

For $t=1,\dots, T$ do:

  1. $A_t=A_{t-1}+\nabla f(w_{t-1})\nabla f(w_{t-1})^\top$
  2. $w_{t+1}'=w_t-\eta_t A_{t}^{-1}\nabla f(w_t)^\top$
  3. $w_{t+1}=\underset{w\in\mathcal{K}}\arg\min(w_{t+1}'-w)^\top A_{t}(w_{t+1}'-w)$

End For

Proof (idea): Recall proof from last lecture. For $\lambda$-SC we have $$\Vert w_t - w^* \Vert_2^2\left( \frac{1}{\eta_t}-\frac{1}{\eta_{t-1}} \right)-\frac{\lambda}{2} \Vert w_t - w ^* \Vert_2^2$$ such that we try to use the negative term to cancel the positive term by setting $\eta_t=\frac{1}{\lambda t}$ so that $\frac{1}{\eta_t}-\frac{1}{\eta_{t-1}}=\lambda t-\lambda(t-1)=\lambda$.

Here we can try $$-\frac{\beta}{2}(w_t - w^{*} )^{\top}\nabla^2 f_t(w_t)\nabla f_t(w_t)^{\top}(w_t - w^{*} )$$ so that $$(w_t-w^{*})^\top\left(\frac{1}{\eta_t}A_t-\frac{1}{\eta_t}A_{t-1}\right)(w_t-w^{*})$$

with $\displaystyle A_t=\sum_{i=1}^t\nabla f_i(w_i)\nabla f_i(w_i)^\top$ and set $\eta_t=\frac{1}{\beta}$.

Theorem: Suppose all loss functions are $\alpha$-exp-concave, and Assumptions 1 and 2 hold. Let $\epsilon=\frac{1}{\beta^2D^2}$ and $\eta_t=\frac{1}{\beta}$, then the regret of the Online Newton Step algorithm is bounded by $$Reg_T^{\text{ONS}}\leq 2\left(\frac{1}{\alpha}+GD\right)d\log T=\mathcal{O}\left(\frac{d\log T}{\alpha}\right)$$

To summarize the online convex optimization so far:

Algorithm Upper Bound Lower Bound
Convex OGD with $\eta_t=\mathcal{O}\left(\frac{1}{\sqrt{T}}\right)$ $\mathcal{O}(\sqrt{T})$ $\Omega(\sqrt{T})$
$\alpha$-exp-concave ONS $\mathcal{O}\left(\frac{d\log{T}}{\alpha}\right)$ $\Omega\left(\frac{d\log{T}}{\alpha}\right)$
$\lambda$-SC OGD with $\eta_t=\mathcal{O}\left(\frac{1}{\lambda_t}\right)$ $\mathcal{O}\left(\frac{\log T}{\lambda}\right)$ $\Omega\left(\frac{\log T}{\lambda}\right)$

Better Bounds 2

Achieving data-dependent bounds

Example 1: Gradient-dependent bound: $$\mathcal{O}\left(D\sqrt{\sum_{t=1}^T \Vert \nabla f_t(w_t)\Vert _2^2}\right)$$

$$ \begin{cases} \text{Worst case} & \mathcal{O}\left(DG\sqrt{T}\right)\\ \text{Tighter bound} & \displaystyle\sum_{t=1}^T \Vert \nabla f_t(w_t)\Vert_2^2 \end{cases} $$

Example 2: Small-loss bound: $$\mathcal{O}\left(\sqrt{\sum_{t=1}^T f_t(w^*)}\right)$$

Achieved by $$\text{OGD}\left( \eta_t=\frac{1}{\sqrt{\displaystyle\sum_{t=1}^T \Vert \nabla f_t(x_t)\Vert_2^2}} \right)$$

If the loss functions are smooth: Gradient-dependent bound achieves small-loss bound.

Other methods:

  • First-order methods: OGD, ONS, OMD

  • Learning perspective: $$w_{t+1}=\underset{w\in\mathcal{K}}\arg\min\sum_{i=1}^Tf_i(w)$$

  • Follow the Leader (FTL): At each step pick $\displaystyle\underset{w\in\mathcal{K}}\arg\min\sum_{t=1}^T f_i(w)$ for all previous steps (assumes oracle access to all previous functions). Achieves $\Omega(T)$ regret.

  • Follow the Regularized Leader (FTRL): At each step pick $\displaystyle\underset{w\in\mathcal{K}}\arg\min\sum_{t=1}^T f_i(w)+R(w)$ with $R(w)=\lambda\Vert w\Vert_2^2$ for all previous steps. Achieves $\mathcal{O}(\sqrt{T})$ regret.

  • Follow the Perturbed Leader

⚠️ **GitHub.com Fallback** ⚠️