How does cluster analysis work? In simple terms

This article is designed as a simple description of the process of k-means cluster analysis for prepared for students brand new to the concept. If you require a more technical description, please review how the calculation in the Excel template works.

Cluster analysis – in simple terms

To best understand the concept of cluster analysis, let’s use a simple example involving ten consumers. To keep it really simple, let’s assume that we only are using only one marketing variable to segment these consumers – which is their customer satisfaction scores. This has been obtained from a survey where they were asked, “on a scale of 1 to 9, with 9 being extremely satisfied and 1 being extremely dissatisfied, how satisfied are you with the brand?

The results of this question for each of the ten consumers were:

  • Consumer A = 2
  • Consumer B = 4
  • Consumer C = 6
  • Consumer D = 6
  • Consumer E = 4
  • Consumer F = 2
  • Consumer G = 8
  • Consumer H= 8
  • Consumer I = 3
  • Consumer J = 8

Cluster analysis will then use these scores to group these consumers into two different market segments (in this example, but you can have more than two segments if required).

Step One – Randomly select two starting points to represent the two segments

The first step is to randomly select any two consumers as a “guess” as to what the typical (average) consumer might be in each segment – these will be the initial centers (or centroids) of the market segments. There is no logic to this “guess” using this method (see this external link for other methods), it is just used as a starting point (or initial seed).

For this example, let’s assume that Consumer D (with a score of 6) and Consumer G (with a score of 8) were randomly selected to be the typical representative of the two segments.

Step two – Classify the other consumers to the closest segment

Given we now have our random guess for each segment – with the first segment having a customer satisfaction score of 6 on average and the second segment having a score of 8 – we can then allocate the remaining consumers to the closest segment. In this example, scores of 1-7 will be allocated to segment one (with a random starting point of 6), and scores of 7-9 will be allocated to segment two (with its starting point of 8).

We then end up with this initial “guess” at which consumers fit into which segment:

Segment One

  • Consumer A = 2
  • Consumer B = 4
  • Consumer C = 6
  • Consumer D = 6
  • Consumer E = 4
  • Consumer F = 2
  • Consumer I =3
  • Segment average = 3.9

Segment Two

  • Consumer G = 8
  • Consumer H= 8
  • Consumer J = 8
  • Segment average = 8.0

Step three – reallocate consumers according to the segment averages

We now have seven consumers initially allocated to segment one and three consumers initially allocated to segment two. If we average the customer satisfaction scores of the consumers currently in segment one we find that the average is 3.9. And the average of the customer satisfaction scores for segment two is 8.0.

Now, instead of using a randomly selected consumer/respondent to be the representative of the segment, we use the average score as determined above. This will mean that consumers scoring between 1.0 and 5.9 will now be allocated to segment one, and consumers scoring between 6.0 and 9.0 will now be allocated to segment two. These consumers are simply being allocated to the closest average. For example, a consumer with a score of 6 is slightly closer to segment two (with its revised average of 8.0) than it is to segment one (with its average of 3.9).

Once we have undertaken this reallocation process, we end up with the following groupings or segments:

Segment One

  • Consumer A = 2
  • Consumer B = 4
  • Consumer E = 4
  • Consumer F = 2
  • Consumer I =3
  • Segment average = 3.0

Segment Two

  • Consumer C = 6
  • Consumer D = 6
  • Consumer G = 8
  • Consumer H= 8
  • Consumer J = 8
  • Segment average = 7.2

Step four – repeat several times (iterations) until no more improvements can be made

This process of finding the average/meaning of the segment and then reallocating consumers is repeated several times until consumers are allocated to the right segment.

You’ll notice that in this simple example, despite starting with two random points for the segments, the process was very quickly able to separate high and low scores into two distinct segments.

In summary

Conceptually, cluster analysis is generally as simple as it has been described above. It is a statistical process that slowly refines which segment a consumer belongs to based on scores and other data, to form related sets of consumers (market segments with similar characteristics).

Essentially, while the Excel spreadsheet template available on this site handles up to eight variables (not just one as in this example) allocates consumers to up to five segments (not just two as in this case) – the general principle of allocating consumers to the most relevant segment through a series of repeated statistical steps is the same.

Therefore, if you can understand this simple example, then you should have a good idea of how cluster analysis works for marketing analysis purposes.

Related topics

How the cluster analysis calculation works in the template