In this lesson we are going discuss clustering under the following topic:

Clustering is a class of unsupervised learning concept or machine learning whose goal is to identify groups or clusters within datapoints in a multi-dimensional space.

Suppose we have data sets

We can define a cluster as a group of data points whose inter-point distances are minimal compared with the distance with points that are outside the cluster.

Let's assume a set of D-dimensional vectors μ

Our goal is to find the assignment of data points to clusters as well as a set of vectors { μ

The 1-of-k Coding Scheme is a coding scheme applied to a K-dimensional vector x in which one of the elements x

In this case, for each data point x, we introduce a corresponding set of binary indicator variables

If x

We can then define an objective function J know as distortion measure

This represents the sum of the squares of the distances of each of the data points from its assigned vector μ

The goal is to choose values for {r

The algorithm is an iterative algorithm where for each iteration, two steps are taken that optimizes the variables

First, choose an initial value for μ

In the first phase: we minimize J with respect to

In the second phase, we minimize J with respect to μ

Repeat the two steps until convergence.

- Introduction to Clustering
- Formalized Definition of k-Means Clustering
- 1-of-K Coding Scheme
- The Expectation and Maximization Algorithm

**Introduction to Clustering**Clustering is a class of unsupervised learning concept or machine learning whose goal is to identify groups or clusters within datapoints in a multi-dimensional space.

Suppose we have data sets

*{x*consisting of N observations of a random D-dimensional Euclidean variable x. The goal is to partition the data into K clusters when the value of K is preknown._{1}, x_{2}, ..., x_{n}}We can define a cluster as a group of data points whose inter-point distances are minimal compared with the distance with points that are outside the cluster.

**Formalization of k-Means Clustering**Let's assume a set of D-dimensional vectors μ

_{k}where k = 1, 2, ..., K and μ_{k}represents the centroid of the cluster k. Take centroid to be the mean of data points in the cluster.Our goal is to find the assignment of data points to clusters as well as a set of vectors { μ

_{k}} such that the sum of squares of distances of each data point to its closest vector μ_{k}is at a minimum.**1-of-K Coding Scheme**The 1-of-k Coding Scheme is a coding scheme applied to a K-dimensional vector x in which one of the elements x

_{k}equals 1 and all the remaining elements 0. This means that a data point cannot belong to two clusters at the same time.In this case, for each data point x, we introduce a corresponding set of binary indicator variables

*r*_{nk}∈ {0,1}.If x

_{n}is assigned to μ_{k}then*r*and r_{nk}= 1_{nj}= 0 for j ≠ kWe can then define an objective function J know as distortion measure

This represents the sum of the squares of the distances of each of the data points from its assigned vector μ

_{k}The goal is to choose values for {r

_{nk}} and {μ_{k}} so as to minimize J. This brings us to the next section where we would examine the algorithm.**The Expectation and Maximization Algorithm(k-Means)**The algorithm is an iterative algorithm where for each iteration, two steps are taken that optimizes the variables

*r*and μ_{nk}_{k}*The algorithm goes this way.*First, choose an initial value for μ

_{k}In the first phase: we minimize J with respect to

*r*while keeping_{nk}*μ*fixed._{k}In the second phase, we minimize J with respect to μ

_{k}while keeping*r*fixed._{nk}Repeat the two steps until convergence.