BME | Machine Learning - Binary Classification (0-1 Distribution)

Binary Classification

Assumptions

Linearly separable data.
Binary classification, meaning the data can be divided into two classes.

Objective

To categorize the data into two classes based on labels.

Set up

Input space

\mathcal{X} = \mathbb{R}^d

Label space / Output space

\mathcal{Y} = \{-1, 1\}

Classifier | Hypothesis class $\mathcal{F}$

\mathcal{F} := \{x \mapsto \text{sign}(\mathbf{w}^\top \mathbf{x} + \mathbf{b})\ |\ \mathbf{w}\in\mathbb{R}^d,\mathbf{b}\in\mathbb{R}\}

Where $\omega$ is the weight vector; $b$ is the bias, and sign is the sign function defined as:

\text{sign}(a)=\left\{ \begin{aligned} &1 \qquad &&\text{if}\ a\geq0\\ &-1 \qquad &&\text{otherwise} \end{aligned} \right.

To simplify subsequent calculations, we aim to have the classifier containing only $w$ without $b$ . For this purpose, we add a dimension and treat $b$ in the second dimension of $w$ as a constant, as follows:

x \mapsto \left[ \begin{matrix}x\\1 \end{matrix}\right] \quad w \mapsto \left[ \begin{matrix}w\\b \end{matrix}\right]

The original classifier simplifies to

\mathcal{F} := \{x \mapsto \text{sign}(\mathbf{w}^\top \mathbf{x})\ \}

Empirical Risk Minimizer

We use the following loss function to find ERM:

\ell(f(x),y)=\left\{ \begin{aligned} &0 \qquad &&\text{if}\ f(x)=y\\ &1 \qquad &&\text{otherwise} \end{aligned} \right.

This expression is also known as 0-1 loss, where 1 represents prediction error, and 0 represents correct prediction. This expression can also be represented using the indicator function $\mathbb{I}$ (Indicator function) as:

\mathbb{I}[f(x)\neq y] = \left\{ \begin{aligned} &1 \qquad &&\text{if}\ f(x)\neq y\\ &0 \qquad &&\text{if}\ f(x)= y \end{aligned} \right.

By substituting the classifier into the loss function, we can derive ERM as follows:

\begin{aligned} \min_w \hat{err}(w)&=\min_w \frac{1}{m}\sum_{i=1}^m \mathbb{I}[\text{sign}(w^\top x_i)\neq y_i]\\ &= \min_w \frac{1}{m}\sum_{i=1}^m \mathbb{I}[\text{sign}(y_iw^\top x_i<0)] \end{aligned}

The $w$ that minimizes ERM is denoted as $\hat w$ :

\begin{aligned} \hat w&=\arg\min_w \frac{1}{m}\sum_{i=1}^m \mathbb{I}[\text{sign}(w^\top x_i)\neq y_i]\\ &=\arg \min_w \frac{1}{m}\sum_{i=1}^m \mathbb{I}[\text{sign}(y_iw^\top x_i<0)] \end{aligned}

In the case of linearly separable data, $\hat w$ is referred to as $w^*$ . At this point, $w^*$ satisfies $y_i=sign(\omega_*^\top x_i)$ for any point, and $\omega_*^\top x_i$ serves as the boundary between the two categories.

Perceptron Algorithm

Objective

Update $w$ until all data points are correctly classified.

Applicability

Data points must be completely linearly separable.
Once data like XOR or non-linearly separable data occurs, the perceptron cannot be used.
1. XOR and non-linearly separable data can be segmented using kernel SVM.

Algorithm

Update Rule

If misclassified, i.e., $\text{sign}(w^\top x_i)\neq y_i$ or $\text{sign}(y_iw^\top x_i<0)$ , update according to rule $w_{t+1}=w_t+y_ix_i$ .

If $y_i=1$ , then $w_{t+1}^\top x_i = w_t^\top x_i +\|x_i\|^2 > w_t^\top x_i$ . w updates towards the positive direction.

If $y_i=-1$ , then $w_{t+1}^\top x_i = w_t^\top x_i -\|x_i\|^2 < w_t^\top x_i$ . w updates towards the negative direction.

Margin $\gamma$ and Convergence of the Algorithm

Assumptions: All points lie within the unit circle $||x_i||_2≤1$ , and at this point $||\omega_*||=1$ .

Conclusion: The Perceptron algorithm converges in at most $1/\gamma^2$ rounds, returning a classifier where $\text{sign}(\omega^Tx_i)=y_i$ . At this stage, all points are correctly classified.

Proof: Refer to https://machine-learning-upenn.github.io/assets/notes/Lec2.pdf

Margin $\gamma$ : Represents the minimum distance of all points to the hyperplane.

\gamma=\min_{i\in[m]}\frac{|w_*^\top x_i|}{\|w\|}=\min_{i\in[m]}|w_*^\top x_i|

References:

https://www.cs.cornell.edu/courses/cs4780/2022sp/notes/LectureNotes06.html

https://machine-learning-upenn.github.io/calendar/

Declaration: The content of this blog is class notes and is for sharing purposes only. Some images and content are sourced from textbooks, lecture materials, and the internet. If there is any infringement, please contact aursus.blog@gmail.com for removal.

Binary Classification

Assumptions

Objective

Set up

Classifier | Hypothesis class F\mathcal{F}F

Empirical Risk Minimizer

Perceptron Algorithm

Objective

Applicability

Algorithm

Update Rule

Margin γ\gammaγ and Convergence of the Algorithm

Classifier | Hypothesis class $\mathcal{F}$

Margin $\gamma$ and Convergence of the Algorithm