Why does cluster analysis sometimes give different results?

Contents

1 Cluster analysis = different results
- 1.1 Random starting points will influence results = which is good!
- 1.2 So which approach should I use?
  - 1.2.1 Why not have fixed seeds in the clustering template?

Cluster analysis = different results

Cluster analysis is a statistical approach to the data – there are no underlying assumptions or knowledge of the data in terms of its marketing implications – it just groups (clusters) data based on proximity. In other words, the formula looks at the pure number only, not what it represents.

The approach to cluster analysis used in the free template on this website is known as k-means clustering. While there are many variations (algorithms) that can be used in cluster analysis, the approach adopted generally requires the initial centers of each of the proposed market segments to be randomly selected (please review the articles: a simple guide to cluster analysis and how the template calculation works for more information).

Random starting points will influence results = which is good!

Because of this randomized approach to kick-starting the cluster analysis process, it is quite likely that different market segment centers (averages/means) might be formed. This can occur even when using the same data set with the same statistical package and simply re-running the data into the same number of market segments.

In the approach used by the templates available on this website, the Excel Solver has been instructed to seek different seeds (starting points) each time it attempts to minimize SSE and find a best fit of the data provided to clusters (or segments). In

This means that each time the macro calculation runs in Excel on the template, a different construction of clusters will be designed. This is not mean that one approach is superior to another – but it is simply a different way of looking at the same data set. This is good! By looking differently at the same data, as opposed to others or your competitors, you have more likelihood of finding key market insights and a more effective way of looking at the market.

So which approach should I use?

Cluster analysis is not a simple “one-run” type of analysis. Markets, customers, and other data sets can be split up (segmented or clustered) in many different ways. So when using cluster analysis, we would typically run the program multiple times and look at the various results.

Each run is likely to deliver different configurations = different central means and different allocations of cases/respondents to different clusters. So then we would look at multiple outcomes of segmentation from the same data set, and then choose accordingly based upon the following criteria:

The overall SSE score – not necessarily the lowest, but certainly one of the lowest
The SSE scores per segment – is the total SSE metric relatively shared across each of the segments/clusters? We want to avoid outputs where one segment/cluster has a relatively high proportion of the total SSE, to ensure that all segments/clusters are relatively homogenous
The number of cases/respondents allocated to each segment/cluster – we want to avoid situations where there is one overly large cluster or one very small cluster (as this suggests that the consideration is not suitable for this number of clusters and perhaps we need to look at more or less clusters)
Is there a degree of logic to the clusters? Can we make sense of each of them, based on their central means? Can we give them a simple description and paint a picture of them in words?
And finally, and probably most importantly, are these actionable clusters/segments? Remember that the goal is to work with these clusters/segments in business decisions.

Why not have fixed seeds in the clustering template?

Indeed it is possible to configure the Excel template to select the same starting point for its optimization analysis. While this would have the effect of producing the same cluster configuration for each run, unfortunately it only ever looks at one solution as a result – which may or may not be a suitable solution.

As you can see from the list of questions above in which approach should be taken, clusters need to be assessed across a criteria and if the Excel template only provided one outcome, then chances are this may not be the optimal solution for us to use.

Additional resources

Don’t forget to check out the premium template that has greater capabilities