Linear SVM

Set up

Given n data points (x1,y1),...,(xn,yn)(x_1, y_1), ..., (x_n, y_n), where yiy_i is 1 or -1, indicating which class xix_i belongs to. Each xix_i is a p-dimensional real vector.

The goal of SVM is to find a maximum-margin hyperplane that separates all points xix_i based on yi=1y_i=1 and yi=1y_i=-1, ensuring the maximum distance from the hyperplane to the nearest points of both groups.

SVM Objective: Find a hyperplane that maximizes the margin.

Hyperplane Definition:

ωxi+b=0\omega^\top x_i + b = 0

Margin Definition: The distance of the nearest point to the hyperplane.

γ(w,b)=mini[m]wxi+bw2\gamma(w,b) = \min_{i\in[m]}\frac{|w^\top x_i + b|}{\|w\|_2}

Hard Margin SVMs

Assumption: Data can be completely separated by hyperplane ωx+b=0\omega_*^\top x + b_* = 0.

Objective: Find a classifier that obtains the maximum margin.

Approach 1: Geometric interpretation (wiki approach)

Firstly, two hyperplanes are created for the two groups of xix_i based on yi=1y_i=1 and yi=1y_i=-1:

For yi=1y_i=1:

Hyperplane 1: ωxb=1\omega^\top x - b = 1, includes all data points above this hyperplane.

For yi=1y_i=-1:

Hyperplane 2: ωxb=1\omega^\top x - b = -1, includes all data points below this hyperplane.

These two hyperplanes are parallel, and the maximum-margin hyperplane ωxb=0\omega^\top x - b = 0 lies between them.

Finally, all data points satisfy the following:

For yi=1y_i=1: ωxib1\omega^\top x_i - b \geq 1

For yi=1y_i=-1: ωxib1\omega^\top x_i - b \leq -1

Combining both, we get yi(ωxib)1y_i(\omega^\top x_i - b) \geq 1 for all 1in1 \leq i \leq n.

The minimization equation and conditions are then derived as follows:

Minimization equation

This can also be written in the form of the sign function: sgn(ωxb)\text{sgn}(\omega^\top x - b).

Approach 2: Deriving from formulas (classroom approach)

The objective can be formulated mathematically as:

Objective formula

Simplified to:

Simplified formula

To ensure a unique solution despite different scales causing identical solutions, uniqueness is ensured by adding the following condition:

mini[m]ωxi+b=1\min_{i\in [m]}|\omega^\top x_i + b| = 1

Substituting this condition into the original equation:

Original equation

This simplifies to a convex quadratic optimization problem, where both the objective function and constraints are convex:

Quadratic optimization

Finding ω\omega and bb

To address the equation mentioned earlier and find ww and bb that minimize the objective function, we can obtain ω\omega through Dual formulation and then derive bb using Support Vectors.

Obtaining ω\omega through Dual formulation

Initialization and approach:

Initialization and approach

Step one involves writing the Lagrangian function (note αi>0\alpha_i > 0):

Lagrangian function
Lagrangian function 2

Step two involves taking partial derivatives with respect to ω\omega and bb, setting them to zero:

Partial derivatives

Step three involves substituting back into the original equation and simplifying:

Substitution and simplification
And satisfying:

Satisfaction

Step four involves expressing the dual function:

Dual function

Step five involves solving the fourth step’s equation to get α\alpha. Substituting α\alpha into the equation results in obtaining ω\omega:

Obtaining

Obtaining bb through Support Vectors

Definition of support vector: Any data point where αi>0\alpha_i > 0, precisely falling on the boundary, satisfying 1=yi(wxi+b)1=y_i(w^\top x_i+b).

Assuming SV={i[m]:αi>0}SV = \{i\in [m]: \alpha_i > 0\}, which is the set of all ii where αi\alpha_i is greater than 0.

ω\omega evolves to the following equation:

ω=iSVαiyixi\omega=\sum_{i\in SV}\alpha_i y_i x_i

Using the complementary slackness condition from the KKT conditions, we arrive at the following equation:

Complementary slackness condition

Hence, we obtain bb:

In simpler terms: Considering a set of α>0\alpha > 0, each ii can yield a bb (as their corresponding xix_i and yiy_i differ), taking an average to reduce noise, resulting in a more accurate bb.

Result of the DUAL:

DUAL result

Soft Margin SVMs

Assumption: Data cannot be entirely separated by a hyperplane.

Due to non-separability, yi(ωxi+b)1y_i(\omega^\top x_i+b)\geq 1 is infeasible. Thus, we can make it feasible by adding slack variables ξi\xi_i (ξi0\xi_i\geq 0) to relax the constraint.

Soft Margin SVM introduction

Where:

  • ξi=0\xi_i=0 signifies point ii is correctly classified and satisfies the large margin constraint.
  • 0<ξi<10<\xi_i<1 signifies point ii is correctly classified but does not meet the large margin constraint.
  • ξi>1\xi_i>1 signifies point ii is misclassified.

To ensure ξi\xi_i is not too large, we increase the penalty in the original equation:

Penalized equation

Where CC controls the penalty’s size. If CC is large, it degrades to a hard-margin SVM.

The above equation is still a convex quadratic optimization problem. We can rewrite it as the dual:

Convex quadratic optimization problem

Following steps similar to the hard margin SVM, we can solve using the dual.

Viewing from a Loss Minimization Perspective

Soft-SVM can also be written in the form of minimizing hinge loss:

lhinge(y,y)=max(0,1yy)=max(0,1yi(ωxib))l_{hinge}(y,y)=\max(0,1-yy)=\max(0,1-y_i(\omega^\top x_i-b))

If yi=sgn(wTxib)y_i = sgn(w^Tx_i-b), i.e., xix_i is in the correct class, the output is 0; otherwise, it’s 1yi(ωxib)1-y_i(\omega^\top x_i-b), the distance from the point to the margin.

Thus, SVM can be written as (where λ>0\lambda>0):

minw,b1mi=1mmax(0,1yi(wxi+b))+λw2\min_{w,b}\frac{1}{m}\sum_{i=1}^m \max(0,1-y_i(w^\top x_i+b))+\lambda \|w\|^2

Upon introducing slack variables, it becomes:

Hinge loss form with slack variables

Now, if we set C=12λmC=\frac{1}{2\lambda m}, this equation becomes the soft-SVM.

Summary

Hard-Margin SVM

Optimization function

Classification vector ω\omega

ω=i=1mαiyixi\omega=\sum_{i=1}^m\alpha_iy_ix_i

Support vector

SV={i[m]:0<αi}SV=\{i\in [m]: 0<\alpha_i\}

Calculation of bb

b=yiωxib=y_i-\omega^\top x_i

Prediction function

sign(ωx+b)=sign(i=1mαiyixix+b)\text{sign}(\omega^\top x+b)\\ = \text{sign}(\sum_{i=1}^m \alpha_iy_ix_i^\top x+b)

Soft-Margin SVM

Optimization function

Classification vector ω\omega

ω=i=1mαiyixi\omega=\sum_{i=1}^m\alpha_iy_ix_i

Support vector

SV={i[m]:0<αi<C=12nλ}SV=\{i\in [m]: 0<\alpha_i<C=\frac{1}{2n\lambda}\}

Calculation of bb

b=yiωxib=y_i-\omega^\top x_i

Prediction function

sign(ωx+b)=sign(i=1mαiyixix+b)\text{sign}(\omega^\top x+b)\\ = \text{sign}(\sum_{i=1}^m \alpha_iy_ix_i^\top x+b)

Non-Linear Kernel SVM

Concept

Purpose: Address non-linearly separable data that cannot be solved simply by adding slack, such as two circles.

Approach: To separate, we map the data to a higher dimension using a feature map ϕ\phi, i.e., xϕ(x)x\mapsto \phi(x).

Method: Replace dot product with a non-linear kernel.

Example

Linearizing Data through ϕ\phi to obtain a new predictor function

Using feature map ϕ\phi, map the training data from a non-linear input space to a linear feature space:

{(x1,y1),...,(xm,ym)}{(ϕ(x1),y1),...,(ϕ(xm),ym)}\{(x_1,y_1),...,(x_m,y_m)\} \mapsto \{(\phi(x_1),y_1),...,(\phi(x_m),y_m)\}

Thus, we obtain a linear predictor function:

ωϕ(x)+b\omega^\top \phi(x)+b

Where ϕ(x)\phi(x) is as follows, and dd represents the dimensionality of xx:

ϕ(x)=[1,x1,...,xd,x12,x1x2,...,xd2]\phi(x) = [1,x_1,...,x_d,x_1^2,x_1x_2,...,x_d^2]^\top

Solving the optimization function using Soft-SVM’s method

Since direct solving is too high-dimensional, we use the Soft-SVM method to solve:

Equivalent to solving the inner product ϕ(x)ϕ(x)\phi(x)^\top \phi(x')

We can simplify it to:

ϕ(x)ϕ(x)=1+xx+(xx)2\phi(x)^\top \phi(x') = 1+x^\top x'+(x^\top x')^2

Constructing Kernels

To simplify the solving process, let the kernel kRk\in\mathbb{R} be defined as:

k(xi,xj)=<ϕ(xi),ϕ(xj)>=ϕ(xi)ϕ(xj)k(x_i,x_j)=<\phi(x_i),\phi(x_j)>=\phi(x_i)\cdot \phi(x_j)

For x1,x2,...,xmx_1,x_2,...,x_m, the kernel matrix KRm×mK\in \mathbb{R}^{m\times m} composed of kk is:

Kij=k(xi,xj)=<ϕ(xi),ϕ(xj)>K_{ij}=k(x_i,x_j)=<\phi(x_i),\phi(x_j)>

Where K is a symmetric and positive semi-definite matrix, satisfying these three conditions: 1) K=KK=K^\top, 2) xKx0x^\top K x\geq 0, 3) all eigenvalues are non-negative.

Common Kernels

Linear

k(xi,xj)=xixj=xixjk(x_i,x_j)=x_i^\top x_j=x_i\cdot x_j

Polynomial (homogeneous)

k(xi,xj)=(xixj)r=(xixj)rk(x_i,x_j)=(x_i^\top x_j)^r=(x_i\cdot x_j)^r

Polynomial (inhomogeneous)

k(xi,xj)=(xixj+d)r=(xixj+d)rk(x_i,x_j)=(x_i^\top x_j+d)^r=(x_i\cdot x_j+d)^r

Gaussian/Random Radial Basis Function(RBF): for σ>0\sigma>0

k(x,x)=exp(xx22σ2)k(x,x')=\exp(-\frac{\|x-x'\|^2}{2\sigma^2})

Kernelization, Expressing the original equations using k

  1. Prove the conclusion necessarily lies within the span of training points ω=i=1mαixi\omega=\sum_{i=1}^{m}\alpha_ix_i
  2. Rewrite the algorithm and predictor in the form of xixjx_i^\top x_j
  3. Replace with k, transforming xixjϕ(xi)ϕ(xj)x_i^\top x_j\Rightarrow \phi(x_i)^\top \phi(x_j) with k(xi,xj)k(x_i,x_j), and replace xix_i with ϕ(xi)\phi(x_i)

Iterative Solution (Comparing SVM and Kernel)

Soft-SVM

Optimization function

Classification vector ω\omega

ω=i=1mαiyixi\omega=\sum_{i=1}^m\alpha_iy_ix_i

Supply vector

SV={i[m]:0<αi<12nλ}SV=\{i\in [m]: 0<\alpha_i<\frac{1}{2n\lambda}\}

Calculation of bb

b=yiωxib=y_i-\omega^\top x_i

Prediction function

sign(ωx+b)=sign(i=1mαiyixix+b)\text{sign}(\omega^\top x+b)\\ = \text{sign}(\sum_{i=1}^m \alpha_iy_ix_i^\top x+b)

Perceptron Algorithm

Kernel

Optimization function

Classification vector ω\omega

ω=i=1mαiyiϕ(xi)\omega=\sum_{i=1}^m\alpha_iy_i\phi(x_i)

Supply vector

SV={i[m]:0<αi<12nλ}SV=\{i\in [m]: 0<\alpha_i<\frac{1}{2n\lambda}\}

Calculation of bb

b=yiωϕ(xi)=yi[i=1mαiyiϕ(xi)ϕ(xj)]=yi[i=1mαiyiϕ(xi)ϕ(xj)]=yi[i=1mαiyik(xi,xj)]b=y_i-\omega^\top \phi(x_i)\\ =y_i-[\sum_{i=1}^m\alpha_iy_i\phi(x_i)\cdot \phi(x_j)]\\ = y_i-[\sum_{i=1}^m\alpha_iy_i\phi(x_i)^\top \phi(x_j)]\\ = y_i-[\sum_{i=1}^m\alpha_iy_ik(x_i,x_j)]\\

Prediction function

sign(ωϕ(x)+b)=sign(i=1mαiyik(xi,x)+b)\text{sign}(\omega^\top \phi(x)+b)\\ = \text{sign}(\sum_{i=1}^m \alpha_iy_i k(x_i,x)+b)

Perceptron Algorithm

Replace  inside this with kernel

Kernel Representation of Rigid Regression

SVM’s Rigid Regression

Original expression for Rigid Regression

minw1mi=1m(yiwxi)2+λw22\min_w \quad \frac{1}{m}\sum_{i=1}^m(y_i-w^\top x_i)^2+\lambda\|w\|^2_2

Replacing w=i=1mαixi=Xαw = \sum_{i=1}^m\alpha_ix_i=X^\top \alpha, where all elements in XXXX^\top are the inner products xixjx_i^\top x_j

minw1mYXXα2+λαXXα\min_w \quad \frac{1}{m}\|Y-XX^\top \alpha\|^2+\lambda\alpha^\top X X^\top \alpha

Prediction is wx=i=1mαixix=αXxw^\top x=\sum_{i=1}^m\alpha_ix_i^\top x=\alpha^\top X x

Kernel’s Rigid Regression

Replacing XXXX^\top with K, where Kij=k(xi,xj)K_{ij}=k(x_i,x_j)

minα1mYKα2+λαKα\min_\alpha \quad \frac{1}{m}\|Y-K \alpha\|^2+\lambda\alpha^\top K \alpha

Prediction is wϕ(x)=i=1mαik(xi,x)=αkxw^\top \phi(x)=\sum_{i=1}^m\alpha_ik(x_i,x)=\alpha^\top k_x

where kx=[k(x,x1),...,k(x,xm)]k_x = [k(x,x_1), ..., k(x,x_m)]^\top

optimal α=(K+λmI)1Y\alpha = (K+\lambda m I)^{-1}Y


Note: The content in this blog is class notes shared for educational purposes only. Some images and content are sourced from textbooks, teacher materials, and the internet. If there is any infringement, please contact aursus.blog@gmail.com for removal.