BME | Machine Learning - Support Vector Machine SVM

Linear SVM

Set up

Given n data points $(x_1, y_1), ..., (x_n, y_n)$ , where $y_i$ is 1 or -1, indicating which class $x_i$ belongs to. Each $x_i$ is a p-dimensional real vector.

The goal of SVM is to find a maximum-margin hyperplane that separates all points $x_i$ based on $y_i=1$ and $y_i=-1$ , ensuring the maximum distance from the hyperplane to the nearest points of both groups.

SVM Objective: Find a hyperplane that maximizes the margin.

Hyperplane Definition:

\omega^\top x_i + b = 0

Margin Definition: The distance of the nearest point to the hyperplane.

\gamma(w,b) = \min_{i\in[m]}\frac{|w^\top x_i + b|}{\|w\|_2}

Hard Margin SVMs

Assumption: Data can be completely separated by hyperplane $\omega_*^\top x + b_* = 0$ .

Objective: Find a classifier that obtains the maximum margin.

Approach 1: Geometric interpretation (wiki approach)

Firstly, two hyperplanes are created for the two groups of $x_i$ based on $y_i=1$ and $y_i=-1$ :

For $y_i=1$ :

Hyperplane 1: $\omega^\top x - b = 1$ , includes all data points above this hyperplane.

For $y_i=-1$ :

Hyperplane 2: $\omega^\top x - b = -1$ , includes all data points below this hyperplane.

These two hyperplanes are parallel, and the maximum-margin hyperplane $\omega^\top x - b = 0$ lies between them.

Finally, all data points satisfy the following:

For $y_i=1$ : $\omega^\top x_i - b \geq 1$

For $y_i=-1$ : $\omega^\top x_i - b \leq -1$

Combining both, we get $y_i(\omega^\top x_i - b) \geq 1$ for all $1 \leq i \leq n$ .

The minimization equation and conditions are then derived as follows:

Minimization equation

This can also be written in the form of the sign function: $\text{sgn}(\omega^\top x - b)$ .

Approach 2: Deriving from formulas (classroom approach)

The objective can be formulated mathematically as:

Objective formula

Simplified to:

Simplified formula

To ensure a unique solution despite different scales causing identical solutions, uniqueness is ensured by adding the following condition:

\min_{i\in [m]}|\omega^\top x_i + b| = 1

Substituting this condition into the original equation:

Original equation

This simplifies to a convex quadratic optimization problem, where both the objective function and constraints are convex:

Quadratic optimization

Finding $\omega$ and $b$

To address the equation mentioned earlier and find $w$ and $b$ that minimize the objective function, we can obtain $\omega$ through Dual formulation and then derive $b$ using Support Vectors.

Obtaining $\omega$ through Dual formulation

Initialization and approach:

Initialization and approach

Step one involves writing the Lagrangian function (note $\alpha_i > 0$ ):

Lagrangian function
Lagrangian function 2

Step two involves taking partial derivatives with respect to $\omega$ and $b$ , setting them to zero:

Partial derivatives

Step three involves substituting back into the original equation and simplifying:

Substitution and simplification
And satisfying:

Satisfaction

Step four involves expressing the dual function:

Dual function

Step five involves solving the fourth step’s equation to get $\alpha$ . Substituting $\alpha$ into the equation results in obtaining $\omega$ :

Obtaining

Obtaining $b$ through Support Vectors

Definition of support vector: Any data point where $\alpha_i > 0$ , precisely falling on the boundary, satisfying $1=y_i(w^\top x_i+b)$ .

Assuming $SV = \{i\in [m]: \alpha_i > 0\}$ , which is the set of all $i$ where $\alpha_i$ is greater than 0.

$\omega$ evolves to the following equation:

\omega=\sum_{i\in SV}\alpha_i y_i x_i

Using the complementary slackness condition from the KKT conditions, we arrive at the following equation:

Complementary slackness condition

Hence, we obtain $b$ :

In simpler terms: Considering a set of $\alpha > 0$ , each $i$ can yield a $b$ (as their corresponding $x_i$ and $y_i$ differ), taking an average to reduce noise, resulting in a more accurate $b$ .

Result of the DUAL:

DUAL result

Soft Margin SVMs

Assumption: Data cannot be entirely separated by a hyperplane.

Due to non-separability, $y_i(\omega^\top x_i+b)\geq 1$ is infeasible. Thus, we can make it feasible by adding slack variables $\xi_i$ ( $\xi_i\geq 0$ ) to relax the constraint.

Soft Margin SVM introduction

Where:

$\xi_i=0$ signifies point $i$ is correctly classified and satisfies the large margin constraint.
$0<\xi_i<1$ signifies point $i$ is correctly classified but does not meet the large margin constraint.
$\xi_i>1$ signifies point $i$ is misclassified.

To ensure $\xi_i$ is not too large, we increase the penalty in the original equation:

Penalized equation

Where $C$ controls the penalty’s size. If $C$ is large, it degrades to a hard-margin SVM.

The above equation is still a convex quadratic optimization problem. We can rewrite it as the dual:

Convex quadratic optimization problem

Following steps similar to the hard margin SVM, we can solve using the dual.

Viewing from a Loss Minimization Perspective

Soft-SVM can also be written in the form of minimizing hinge loss:

l_{hinge}(y,y)=\max(0,1-yy)=\max(0,1-y_i(\omega^\top x_i-b))

If $y_i = sgn(w^Tx_i-b)$ , i.e., $x_i$ is in the correct class, the output is 0; otherwise, it’s $1-y_i(\omega^\top x_i-b)$ , the distance from the point to the margin.

Thus, SVM can be written as (where $\lambda>0$ ):

\min_{w,b}\frac{1}{m}\sum_{i=1}^m \max(0,1-y_i(w^\top x_i+b))+\lambda \|w\|^2

Upon introducing slack variables, it becomes:

Hinge loss form with slack variables

Now, if we set $C=\frac{1}{2\lambda m}$ , this equation becomes the soft-SVM.

Summary

Hard-Margin SVM

Optimization function

Classification vector $\omega$

\omega=\sum_{i=1}^m\alpha_iy_ix_i

Support vector

SV=\{i\in [m]: 0<\alpha_i\}

Calculation of $b$

b=y_i-\omega^\top x_i

Prediction function

\text{sign}(\omega^\top x+b)\\ = \text{sign}(\sum_{i=1}^m \alpha_iy_ix_i^\top x+b)

Soft-Margin SVM

Optimization function

Classification vector $\omega$

\omega=\sum_{i=1}^m\alpha_iy_ix_i

Support vector

SV=\{i\in [m]: 0<\alpha_i<C=\frac{1}{2n\lambda}\}

Calculation of $b$

b=y_i-\omega^\top x_i

Prediction function

\text{sign}(\omega^\top x+b)\\ = \text{sign}(\sum_{i=1}^m \alpha_iy_ix_i^\top x+b)

Non-Linear Kernel SVM

Concept

Purpose: Address non-linearly separable data that cannot be solved simply by adding slack, such as two circles.

Approach: To separate, we map the data to a higher dimension using a feature map $\phi$ , i.e., $x\mapsto \phi(x)$ .

Method: Replace dot product with a non-linear kernel.

Example

Linearizing Data through $\phi$ to obtain a new predictor function

Using feature map $\phi$ , map the training data from a non-linear input space to a linear feature space:

\{(x_1,y_1),...,(x_m,y_m)\} \mapsto \{(\phi(x_1),y_1),...,(\phi(x_m),y_m)\}

Thus, we obtain a linear predictor function:

\omega^\top \phi(x)+b

Where $\phi(x)$ is as follows, and $d$ represents the dimensionality of $x$ :

\phi(x) = [1,x_1,...,x_d,x_1^2,x_1x_2,...,x_d^2]^\top

Solving the optimization function using Soft-SVM’s method

Since direct solving is too high-dimensional, we use the Soft-SVM method to solve:

Equivalent to solving the inner product $\phi(x)^\top \phi(x')$

We can simplify it to:

\phi(x)^\top \phi(x') = 1+x^\top x'+(x^\top x')^2

Constructing Kernels

To simplify the solving process, let the kernel $k\in\mathbb{R}$ be defined as:

k(x_i,x_j)=<\phi(x_i),\phi(x_j)>=\phi(x_i)\cdot \phi(x_j)

For $x_1,x_2,...,x_m$ , the kernel matrix $K\in \mathbb{R}^{m\times m}$ composed of $k$ is:

K_{ij}=k(x_i,x_j)=<\phi(x_i),\phi(x_j)>

Where K is a symmetric and positive semi-definite matrix, satisfying these three conditions: 1) $K=K^\top$ , 2) $x^\top K x\geq 0$ , 3) all eigenvalues are non-negative.

Common Kernels

Linear

k(x_i,x_j)=x_i^\top x_j=x_i\cdot x_j

Polynomial (homogeneous)

k(x_i,x_j)=(x_i^\top x_j)^r=(x_i\cdot x_j)^r

Polynomial (inhomogeneous)

k(x_i,x_j)=(x_i^\top x_j+d)^r=(x_i\cdot x_j+d)^r

Gaussian/Random Radial Basis Function(RBF): for $\sigma>0$

k(x,x')=\exp(-\frac{\|x-x'\|^2}{2\sigma^2})

Kernelization, Expressing the original equations using k

Prove the conclusion necessarily lies within the span of training points $\omega=\sum_{i=1}^{m}\alpha_ix_i$
Rewrite the algorithm and predictor in the form of $x_i^\top x_j$
Replace with k, transforming $x_i^\top x_j\Rightarrow \phi(x_i)^\top \phi(x_j)$ with $k(x_i,x_j)$ , and replace $x_i$ with $\phi(x_i)$

Iterative Solution (Comparing SVM and Kernel)

Soft-SVM

Optimization function

Classification vector $\omega$

\omega=\sum_{i=1}^m\alpha_iy_ix_i

Supply vector

SV=\{i\in [m]: 0<\alpha_i<\frac{1}{2n\lambda}\}

Calculation of $b$

b=y_i-\omega^\top x_i

Prediction function

\text{sign}(\omega^\top x+b)\\ = \text{sign}(\sum_{i=1}^m \alpha_iy_ix_i^\top x+b)

Perceptron Algorithm

Kernel

Optimization function

Classification vector $\omega$

\omega=\sum_{i=1}^m\alpha_iy_i\phi(x_i)

Supply vector

SV=\{i\in [m]: 0<\alpha_i<\frac{1}{2n\lambda}\}

Calculation of $b$

b=y_i-\omega^\top \phi(x_i)\\ =y_i-[\sum_{i=1}^m\alpha_iy_i\phi(x_i)\cdot \phi(x_j)]\\ = y_i-[\sum_{i=1}^m\alpha_iy_i\phi(x_i)^\top \phi(x_j)]\\ = y_i-[\sum_{i=1}^m\alpha_iy_ik(x_i,x_j)]\\

Prediction function

\text{sign}(\omega^\top \phi(x)+b)\\ = \text{sign}(\sum_{i=1}^m \alpha_iy_i k(x_i,x)+b)

Perceptron Algorithm

Replace inside this with kernel

Kernel Representation of Rigid Regression

SVM’s Rigid Regression

Original expression for Rigid Regression

\min_w \quad \frac{1}{m}\sum_{i=1}^m(y_i-w^\top x_i)^2+\lambda\|w\|^2_2

Replacing $w = \sum_{i=1}^m\alpha_ix_i=X^\top \alpha$ , where all elements in $XX^\top$ are the inner products $x_i^\top x_j$

\min_w \quad \frac{1}{m}\|Y-XX^\top \alpha\|^2+\lambda\alpha^\top X X^\top \alpha

Prediction is $w^\top x=\sum_{i=1}^m\alpha_ix_i^\top x=\alpha^\top X x$

Kernel’s Rigid Regression

Replacing $XX^\top$ with K, where $K_{ij}=k(x_i,x_j)$

\min_\alpha \quad \frac{1}{m}\|Y-K \alpha\|^2+\lambda\alpha^\top K \alpha

Prediction is $w^\top \phi(x)=\sum_{i=1}^m\alpha_ik(x_i,x)=\alpha^\top k_x$

where $k_x = [k(x,x_1), ..., k(x,x_m)]^\top$

optimal $\alpha = (K+\lambda m I)^{-1}Y$

Note: The content in this blog is class notes shared for educational purposes only. Some images and content are sourced from textbooks, teacher materials, and the internet. If there is any infringement, please contact aursus.blog@gmail.com for removal.

Linear SVM

Set up

Hard Margin SVMs

Approach 1: Geometric interpretation (wiki approach)

Approach 2: Deriving from formulas (classroom approach)

Finding ω\omegaω and bbb

Obtaining ω\omegaω through Dual formulation

Obtaining bbb through Support Vectors

Soft Margin SVMs

Viewing from a Loss Minimization Perspective

Summary

Hard-Margin SVM

Soft-Margin SVM

Non-Linear Kernel SVM

Concept

Linearizing Data through ϕ\phiϕ to obtain a new predictor function

Solving the optimization function using Soft-SVM’s method

Constructing Kernels

Kernelization, Expressing the original equations using k

Iterative Solution (Comparing SVM and Kernel)

Kernel Representation of Rigid Regression

Finding $\omega$ and $b$

Obtaining $\omega$ through Dual formulation

Obtaining $b$ through Support Vectors

Linearizing Data through $\phi$ to obtain a new predictor function