Linear Regression

Input Space

X=Rd\mathcal{X}=\mathbb{R}^d

Label Space/Output Space (yy ranges from 0 to 1)

Y=R\mathcal{Y}=\mathbb{R}

Hypothesis Class FF

F:{x wx+bwRd,bR}\mathcal{F}:\{x \mapsto\ w^\top x+b | w\in\mathbb{R}^d,b\in\mathbb{R}\}

Loss Function (l2l_2-loss) Square Loss

(f(x),y)=(f(x)y)2\ell(f(x),y)=(f(x)-y)^2\\

Loss Function (Absolute Loss)

(f(x),y)=f(x)y\ell(f(x),y)=|f(x)-y|

Empirical Risk Minimizer (Convex)

R^(w)=1mi=1m(yiwxi)2\hat R(w)=\frac{1}{m}\sum_{i=1}^m(y_i-w^\top x_i)^2

As the above function is convex, we can find the minimum by taking its derivative and setting it to zero. w^\hat w obtained from ERM looks like this (if XTXX^TX is invertible):

w^=(XX)1XY\hat w=(X^\top X)^{-1}X^\top Y

Regularization

Regularization: To prevent XTXX^TX from becoming singular, resulting in large ww, we introduce a penalty function.

G^=R^(w)+λψ(w)\hat G=\hat R(w)+\lambda \psi(w)

Here, ψ(w)\psi(w) commonly takes values such as w1\|w\|_1 or w22\|w\|_2^2.

Ridge Regression | ψ=ω22\psi=||\omega||_2^2

G^(w)=1mi=1m(yiwxi)2+λw22w^λ=(XX+λmI)1XY\hat G(w)=\frac{1}{m}\sum_{i=1}^m(y_i-w^\top x_i)^2+\lambda\|w\|_2^2\\ \hat w_\lambda=(X^\top X+\lambda m I)^{-1}X^\top Y

LASSO Regression | ψ=ω1\psi=||\omega||_1 | When we require sparse solutions

G^(w)=1mi=1m(yiwxi)2+λw1\hat G(w)=\frac{1}{m}\sum_{i=1}^m(y_i-w^\top x_i)^2+\lambda\|w\|_1\\


Disclaimer: This blog content is intended as class notes and is solely for sharing purposes. Some images and content are sourced from textbooks, teacher presentations, and the internet. If there are any copyright infringements, please contact aursus.blog@gmail.com for removal.