BME | Machine Learning - K means Clustering

K means Clustering

Assumptions

There are $k$ subsets $C_1, C_2, C_3, ..., C_k$ , and $1, ..., n$ data points, and all data points satisfy the following conditions:

Each point belongs to a cluster.
Each cluster is non-overlapping.

Defining Distance

Z ( C _ { 1 } , \cdots , C _ { k } ) = \sum _ {l = 1 } ^ { k } \frac { 1 } { 2 | C _ { l } | } \sum _ {i , j \in C _ { l }} { \| x _ { i } - x _ j\|_2 ^ { 2 }}

where $|C_k|$ denotes the number of observations in the kth cluster.

The goal is to minimize the pairwise squared distance for each cluster.

Z ( C _ { 1 } , \cdots , C _ { k } ) = \sum _ { l = 1 } ^ { k } \sum _ { i \in C_l } ^ { k } | | x _ { i } - \mu_l | | _ { 2 } ^ { 2 }

where

\mu _ { l } = \frac { 1 } { | C _ { l } | } \sum _ { i \in C _ { l } } x _ { i }

Iterative Objective

\operatorname { m i n } _ { c _ { 1 }, \cdots , c _ { k } } Z ( C _ { 1 } , \cdots , C _ { k } )

Algorithm | Lloyd’s algorithm

Not necessarily optimal because if the initialized points are not good, it can converge to a suboptimal clustering.

input: $x_1, x_2, ..., x_k$
initial: $C_1, C_2, ..., C_k$
loss function

Z ( C _ { 1 } , \cdots , C _ { k } ) = \sum _ {l = 1 } ^ { k } \frac { 1 } { 2 | C _ { l } | } \sum _ {i , j \in C _ { l }} { \| x _ { i } - x _ j\|_2 ^ { 2 }}

If not yet converged,
1. Update centroids.
2. Update clusters.

Choosing k: Determining Size

If $Z_{k+1}<<Z_k$ , it is necessary to increase k, i.e., $k\rightarrow k+1$ .

Finding the position of the knee in the loss function-k function

Choosing initial cluster assignment

Everyone tries several random initial cluster assignments and then selects the one with the lowest result using Lloyd’s algorithm.
Or, through the k-means++ method: randomly select the first data point as a centroid, then choose subsequent centroids based on a distance-weighted distribution from the already selected centroids.

Decision Boundaries

The region $\{x\in \mathbb{R}^d\ \ |\ \ ||x-\mu_l||_2 = ||x-\mu_{l'}||_2\}$ can be computed using two cluster centroids $\mu_l$ , $\mu_{l'}$ .

Reference:

https://www.cs.cornell.edu/courses/cs4780/2022sp/notes/LectureNotes04.html

Note: The content in this blog is class notes shared for educational purposes only. Some images and content are sourced from textbooks, teacher materials, and the internet. If there is any infringement, please contact aursus.blog@gmail.com for removal.