生物医学工程 | 机器学习 - 凸优化

目的：学习optimize objective的有效方法——gradient descent。该方法基于convex function

Convex Optimization

本节目标是求解下面optimization problem

其中 $\mathscr{C}\subset \mathbb{R}^d$ , $F:\mathbb{R}^d \rightarrow \mathbb{R}$ 。C是convex set，F是convex function

Convex Function和Convex set的定义

Convex Function定义

Convex Set定义

Convex Function的例子

$l_2$ -norm

F(w) = \|x\|_2^2=x^\top x

Logistic

F(w;x,y)=\log(1+\exp(-yw^\top x))

Mean of Convex Functions

F(w) = \frac{1}{m}\sum_{i=1}^mF_i(w)

Convex Function的first and second order

可以用这俩来判断是不是convex function，其中second order最常用

First Order

for all$ \omega, \omega’ \in \mathbb{R}^d, $

F(\omega')\geq F(\omega)+\nabla F(\omega)^\top (\omega'-\omega)

解释：该函数在任何一点上都位于切线之上

Second Order（常用于判断是不是convex function）

for all $\omega \in \mathbb{R}^d$ ,

\nabla^2F(\omega)\succeq0

或者说Hessian矩阵 $\nabla^2F(\omega)$ 是 semi-definite (PSD)的

解释：函数的曲率总是非负的

Theorm定理

当w满足 $\nabla F(\omega)=0$ 的时候，w是F的global minimum

证明：

F(w')\geq F(w)+\nabla F(w)^\top(w'-w)\Rightarrow F(w')\geq F(w)

Convex Function的三条bound

L-smooth Function (upper bound)

For all $w, w' \in \mathbb{R}^d$
$F(w') \leq F(w)+\nabla F(w)^\top(w'-w)+\frac{L}{2}\|w-w'\|^2_2$
Convex Function (lower bound)

For all $w, w' \in \mathbb{R}^d$
$F(w') \geq F(w)+\nabla F(w)^\top(w'-w)$
$\mu$ - Strong Convex Function (lower bound)

For all $w, w' \in \mathbb{R}^d$
$F(w') \geq F(w)+\nabla F(w)^\top(w'-w)+\frac{\mu}{2}\|w-w'\|^2_2$

Gradient Descent

算法本身（GD）

其中 $\eta_t$ 叫做learning rate，表示每次移动多元。其中stopping condition我们一般设置为 $F(\omega_t)\leq \epsilon$ 或者 $\|\nabla F(\omega_t)\|_2\leq \epsilon$

选择Learning Rate / Step Size

过大的learning rate会使得结果diverge，甚至会产生negative progress

过小的learning rate会会需要花费很多时间才能converge

Gradient Descent for Smooth Functions

L-Smoothness Function

for all$ \omega, \omega’ \in \mathbb{R}^d $,

F(w')\leq F(w)+\nabla F(w)^\top (w'-w)+\frac{L}{2}\|w-w'\|^2_2

等价于说 $\nabla F$ 是L-Lipschitz

\|\nabla F(w)-\nabla F(w')\|_2\leq L\|w-w'\|_2

用smoothness对gradient descent进行解释

为了 $\min F$ at $w$ ，我们选择min upper bound（L-smoothness），也就是下面这个式子

\min_{w'} F(w)+\nabla F(w)^\top (w'-w)+\frac{L}{2}\|w'-w\|_2^2

通过令导数等于0，我们可以求出下一步的w即w’

w' = w-\frac{1}{L}\nabla F(w)

通过对比算法GD，我们可以看出来这就是gradient step with learning rate$ η = 1/L $的时候

即根据upper bound，我们求upper bound min的时候算出来的w’所对应的是upper bound的最低点，其对应的F(w’)一定比F(w)小，从而实现gradient descent

Converge的证明

定理

解释

假设开始的时候 $w_1=0$ 且 $\|\omega_*\|_2=1$ ，则在T iterations之后可以得到

F(w_{T+1})-F(\omega_*)\leq \frac{L}{2T}

这表示了在 $T = \frac{L}{2\epsilon}=O(1/\epsilon)$ step之后，我们可以得到 $F(w_{T+1})-F(w_*)\leq \epsilon$ 。所以我们说gradient descent has a convergence rate of $O(1/T)$

证明

具体证明见pdf（Lec10-2）

Gradient Descent on Smooth and Strongly Convex Functions

$\mu$ - Strongly Convex Function

for all$ \omega, \omega’ \in \mathbb{R}^d $,

F(w)的lower bound

F(w')\geq F(w)+\nabla F(w)^\top (w'-w)+\frac{\mu}{2}\|w-w'\|^2_2

Converge证明

定理

解释

假设开始的时候 $w_1=0$ 且 $\|\omega_*\|_2=1$ ，则在T iterations之后可以得到

\|w_{T+1}-w_*\|_2^2\leq (1-\frac{\mu}{L})^\top

这表示了在 $T = \frac{L}{\mu}\log(1/\epsilon)=O(log(1/\epsilon))$ step之后，我们可以得到 $\|w_{T+1}-w_*\|_2\leq \epsilon$ 。所以我们说gradient descent has a convergence rate of $O(\exp(-T))$

附录

GD converge with L smooth和F的证明

Step1

根据smoothness的式子1，我们可以知道在t时刻

F(w_{t+1})-F(w_t)\leq \nabla F(w_t)^\top(w_{t+1}-w_t)+\frac{L}{2}\|w_{t+1}-w_t\|^2_2

将 $\nabla F(w_t) = L(w_t-w_{t+1})$ 带入式（4）

\begin{aligned} F(w_{t+1})-F(w_t) &\leq L(w_{t}-w_{t+1})^\top(w_{t+1}-w_t)+\frac{L}{2}\|w_{t+1}-w_t\|^2_2\\ &= -L(w_{t+1}-w_{t})^\top(w_{t+1}-w_t)+\frac{L}{2}\|w_{t+1}-w_t\|^2_2\\ & =-L \|w_{t+1}-w_t\|_2^2+\frac{L}{2}\|w_{t+1}-w_t\|^2_2\\ & =-\frac{L}{2}\|w_{t+1}-w_t\|^2_2 \end{aligned}

Step2

根据convex function的式子2和 $\nabla F(w_t) = L(w_t-w_{t+1})$ ，我们可以得到

\begin{aligned} F(w_*) &\geq F(w)+\nabla F(w)^\top(w_*-w)\\ \Rightarrow F(w_t)-F(w_*)&\leq L(w_t-w_{t-1})^\top(w_t-w_*) \end{aligned}

又因为

\begin{aligned} \|w_{t+1}-w_*\|_2^2&=\|w_{t+1}-w_t+w_t-w_*\|_2^2\\ & = \|w_{t+1}-w_t\|_2^2+\|w_{t}-w_*\|_2^2+2(w_{t+1}-w_t)^\top(w_t-w_*) \end{aligned}

将式子（7）带入式子（6），我们可以得到

\begin{aligned} F(w_t)-F(w_*)&\leq \frac{L}{2}(\|w_{t+1}-w_t\|_2^2+\|w_t-w_*\|_2^2-\|w_{t+1}-w_*\|_2^2) \end{aligned}

将式子（5）带入式子（8）

\begin{aligned} F(w_{t+1})-F(w_t)+F(w_t)-F(w_*)&\leq \frac{L}{2}(\|w_{t+1}-w_t\|_2^2 + \|w_t-w_*\|_2^2 - \|w_{t+1}-w_*\|_2^2)\\ &+(-\frac{L}{2}\|w_{t+1}-w_t\|_2^2)\\ F(w_{t+1})-F(w_*)&\leq\frac{L}{2}(\|w_t-w_*\|_2^2 - \|w_{t+1}-w_*\|_2^2) \end{aligned}

Step3

根据式9，从tau+1累加到tau可以得到下式

\begin{aligned} &F(w_{\tau+1})-F(w_*)+F(w_{\tau+1})-F(w_*)+...+F(w_{\tau+1})-F(w_*) = \tau(F(w_{\tau+1})-F(w_*))\\ &\leq F(w_{\tau+1})-F(w_*)+F(w_\tau)-F(w_*)+...+F(w_1)-F(w_*)=\sum_{t=1}^\tau(F(w_{t+1})-F(w_*))\\ &\leq \frac{L}{2}(\|w_\tau-w_*\|^2_2-\|w_{\tau+1}-w_*\|_2^2+\|w_{\tau-1}-w_*\|^2_2-\|w_{\tau}-w_*\|_2^2)+...+\|w_1-w_*\|_2^2-\|w_2-w_*\|_2^2)\\ &\leq \frac{L}{2}(\|w_1-w_*\|_2^2-\|w_{\tau+1}-w_*\|_2^2) \\ &\leq\frac{L}{2}\|w_1-w_*\|_2^2 \end{aligned}

带入 $w_1=0$ 和 $\|w_*\|_2^2=1$ ，可以求出来经过T次迭代之后（即 $\tau=T$ ），

F(w_{T+1})-F(w_*)\leq \frac{L}{2T}

这表示经过 $T=\frac{L}{2\epsilon}$ 次迭代，函数F收敛，即 $F(w_{T+1})-F(w_*)\leq \epsilon$

不同梯度下降算法的比较（非课程要求）

批量梯度下降｜标准梯度下降｜Batch GD

每次使用所有的样本进行计算梯度（i=1～n）

f(x) = \frac{1}{n}\sum_{i=1}^nf_i(x)\\ \nabla f(x)=\frac{1}{n}\sum_{i=1}^n \nabla f_i(x)

优点：梯度更新准确

缺点：只有convex function才可以保证最后得到的是全剧最优；收敛速度很慢

随机梯度下降｜Stochastic GD

每次下降只用一个样本点进行计算（i是1-n的任意值，随机选取的样本点）

f(x) = f_i(x)\\ \nabla f(x)=\nabla f_i(x)

优点：训练速度快

缺点：容易在最优的地方反复横跳而到不了最优点，只能在最优点周围

小批量梯度下降｜Mini-Batch GD

每次下降选用一批样本点进行计算（从n个样本点里面随机选取k个样本）

f(x) = \frac{1}{k}\sum_{i=1}^kf_i(x)\\ \nabla f(x)=\frac{1}{k}\sum_{i=1}^k \nabla f_i(x)

优点：训练速度快

缺点：选择平均梯度最小的方向

声明：此blog内容为上课笔记，仅为分享使用。部分图片和内容取材于课本、老师课件、网络。如果有侵权，请联系aursus.blog@gmail.com删除。