We are going to explain the basic concept of k-means clustering and the k-means clustering algorithm.

It is a clustering method that tends to partition your data into partitions called clusters. Let's say you have you have n data points, the k-means algorithms would assign each of the data point to the cluster with the nearest mean.

Note: the k in k-means represents the number of clusters, that is k clusters.

Suppose we have a set of observations

A cluster is a group of data points whose inter-point distances are minimal when compare with distance to points outside the cluster.

The first step is to find the

We now assign each of the data points to clusters, such that the sum of squares of the distances of each data point to its closest mean m

Convergence occurs when the points not move between clusters and the centroids are stabilized

This algorithm would become clearer when we consider an example later in this article.

Let's now look at a little more details on how the clusters are assigned

(based on "Pattern Recognition and Machine Learning" by Christopher M. Bishop)

For each data point x

r

Note the the two superscript in indicates that there would be two loops:

Now, for each iteration, you need to calculate a value for J which is called the distortion measure and is given by the sum of squares function:

First note that this

In other words, J represents the sum of squares of the distance of each data point to its assigned vector m

This is achieved by the use of k-Means Algorithm

This algorithm is an iterative procedure involving steps:

Input set of points x

Choose a value for K

Place m

Repeat the following steps until convergence

for each point x

The example below is based on

K = 2 and N = 14

Take some time to examine the progression through the steps and make sure you understand how it works

**What is k-Means Clustering**It is a clustering method that tends to partition your data into partitions called clusters. Let's say you have you have n data points, the k-means algorithms would assign each of the data point to the cluster with the nearest mean.

Note: the k in k-means represents the number of clusters, that is k clusters.

**How it works**Suppose we have a set of observations

*{x*which consists in a set of N random variable_{1}, x_{2},... x_{n}}*x*(x is a D d-dimensional real vector). The goal is to partition the data set into some number K of clusters, where the value of K is known.A cluster is a group of data points whose inter-point distances are minimal when compare with distance to points outside the cluster.

The first step is to find the

*m*for_{k},*k = 1,..., K,*in which*m*is the mean associated to the_{k}*k*cluster._{th}We now assign each of the data points to clusters, such that the sum of squares of the distances of each data point to its closest mean m

_{k}is minimum.**Steps of the k-Means Algorithm****Step 1:**For each unit, x, in the input space, place it in the cluster whose current centroid it is nearest to.**Step 2:**After all points have been assigned (x1,...xn), adjust the locations of the centroids of the k clusters**Step 3:**Reassign all the points to their nearest centroid**Repeat steps 2 and 3 until convergence**Convergence occurs when the points not move between clusters and the centroids are stabilized

This algorithm would become clearer when we consider an example later in this article.

Let's now look at a little more details on how the clusters are assigned

**Assignment of Clusters**(based on "Pattern Recognition and Machine Learning" by Christopher M. Bishop)

For each data point x

_{n}, we would use a corresponding set of binary variables we denoted as r_{nk}={0,1} where k =1,...,K.r

_{nk}describes which of the K clusters the data point x_{n}is assigned to. If r_{nk}is assigned to k, then r_{nk}= 1 and r_{nj}= 0 for j not equal to k.Note the the two superscript in indicates that there would be two loops:

**Loop 1(1-N)**: Iterate through all the data points**Loop 2(1-K):**Iterate through all the clusterNow, for each iteration, you need to calculate a value for J which is called the distortion measure and is given by the sum of squares function:

**Lets try to understand this formula**First note that this

__formula__is sometimes called distortion measure since it represent how much each data point say x_{k}is separated from the mean m_{k}of that cluster.In other words, J represents the sum of squares of the distance of each data point to its assigned vector m

_{k}.- N is to total number of data points,
- K is the number of clusters
- x
_{n}is the vector of measurement n - m
_{k}is the mean for cluster k - r
_{nk}is an indicator variable that indicates whether to assign x_{n}to k

_{nk}} and mk that gives the least value of J.This is achieved by the use of k-Means Algorithm

**More Explanation of the k=Means Algorithm**This algorithm is an iterative procedure involving steps:

Input set of points x

_{1},...x_{n}Choose a value for K

Place m

_{1},..., m_{k}(let's call this centroid) are random locations within your data spaceRepeat the following steps until convergence

for each point x

- find the nearest centroid, c
_{j}(compute the distance between xi, compute the distance between x_{i}and m_{j}for every centroid m_{j}, and pick the cluster with the minimum distance) - find the nearest centroid
- assign this point x
_{j}to the cluster of the nearest centroid

- Take all the data points that fall into this cluster
- new centroid c
_{j}= mean of all points x - assign to cluster j in the previous step

**Example of k-Means Clustering**The example below is based on

K = 2 and N = 14

Take some time to examine the progression through the steps and make sure you understand how it works