If $\eta = \dfrac{D}{G\sqrt{T}}$ then $R_{t}^{P-OGD} = O(DG\sqrt{T})$, where $D$ is the upper bound of the decision set and $G$ is the upper bound of the gradient.
In the worst case of the FTL algorithm, $R_T^{FTL}=\Theta(T)$
This lecture
Algorithm 3. Follow the Regularized Leader (FTRL)
$$w_{t+1} = argmin_{w \in W} + \eta\cdot R(w)$$
where $R(w)$ is the regularization term, in many cases $R(w) = \frac{1}{2}||w||_2^2$.
We would like to analyze the algorithm by bounding its regret. First we need to do some preparation.
Assume without loss of generality that $R(w)$ is 1-strongly convex w.r.t. some norm $||\cdot||$
Recall that for any $\lambda>0$, we say a differentiable function $l: W → \mathbb{R}$ is $\lambda$-strongly convex w.r.t. $||\cdot||$ if
$$\forall w, w^{'} \in W, l(w^{'})\geq l(w) + \langle \nabla l(w), w^{'} - w \rangle + \dfrac{\lambda}{2}||w^{'} - w||^2$$
where $||\cdot||$ and $||\cdot||_*$ are dual norms.
**Remark $D_R^2$ is some constant depends on $R(\omega)$, and $G$ is also some constant.
Theorem:
Suppose Assumptions 1 and 2 hold, then FTRL with $\eta = \dfrac{G}{D_R\sqrt{T}}$ achieves:
$$R_{T}^{FTRL} = O\left(D_RG\sqrt{T}\right)$$
Proof: We first define a new OCO problem, which features:
$f_0(w) = \eta\cdot R(w)$, and $f_1, f_2, ..., f_T$ remain the same. We call this problem $P'$ while the original OCO problem $P$.
Apply FTL on $P'$:
$$w_{t+1} = argmin_{w \in W} \sum_{i=0}^{t}f_i(w)$$
$$= argmin_{w \in W} \sum_{i=1}^{t}f_i(w) + \eta\cdot R(w)$$
which shows us that applying FTL on $P'$ is equivalent to applying FTRL on $P$.
Now let $$F_t(w) = \sum_{i=1}^tf_i(w)+ \eta R(w),$$
it is easy to check that $F_t(w)$ is $\eta$-strongly convex.
Note that by definition $w_{t+1} = argmin_{w \in W}F_{t}(w)$. By previous observation, $\forall w' \in W$ we have
Plug in $\eta = \frac{G\sqrt{T}}{D_R}$ shows that $R_{T}^{FTRL, P} = O\Big(D_R G \sqrt{T}\Big)$, which completes the proof.
Remark:
In the FTRL algorithm, find $w_{t+1}$ may be not very easy since it is solving a hard optimization problem. To make the algorithm more efficient, can we approximate (first order approximation) $f_t$ as follows:
which is much easier since the first term is just a linear function in $w$. In the next lecture, we will show that we actually don't lose much using this approximation.
Homework: Tuning Parameters
We are going to imagine we have some algorithm $\mathcal{A}$ with a performance bound that depends on some input values (which can not be adjusted) and some tuning parameters (which can be optimized). We will use greek letters ($\alpha, \eta, \zeta,$ etc.) for the tuning parameters and capital letters ($T, D, N$, etc.) for inputs. We would like the bound to be the tightest possible, up to multiplicative constants. For each of the following, tune the parameters to obtain the optimal bound. Using big-Oh notation is fine to hide constants, but you must not ignore the dependence on the input parameters. For example, assuming $M,T > 0$, imagine we have a performance guarantee of the form:
$$\text{Performance}(\mathcal{A}; M,T, \epsilon) \leq M \epsilon + \frac{T}{\epsilon},$$
and we know $\epsilon > 0$. Then by optimizing the above expression with respect to the free parameter we can set $\epsilon = \sqrt{\frac{T}{M}}$. With this value we obtain $\text{Performance}(\mathcal{A}; M,T, \epsilon) = O(\sqrt{MT})$
NOTE: We didn't have to make up this problem, We actually pulled all the bounds below from different papers!
$\textnormal{Performance}(\mathcal{A}; T, \eta) \leq \frac T \eta + \exp(\eta)$. (Note: you needn't obtain the optimal choice of $\eta$ here or the tightest possible bound, but try to tune in order to get a bound that is $o(T)$ -- i.e. the bound should grow strictly slower than linear in $T$.)
$\textnormal{Performance}(\mathcal{A}; N, T, \eta, \epsilon) \leq \frac{ T \epsilon }{\eta} + \frac{N}{\epsilon} + T \eta$