Join 1,000s of professionals who are building real-world skills to truly harness the power of AI technologies like Microsoft's Copilot in Excel.

I won't send you spam. Unsubscribe at any time.

Issue #23 - Machine Learning with Python in Excel Part 3: Customer Segmentation

This Week’s Tutorial

This week's issue is the third tutorial of a series demonstrating how Microsoft is positioning Excel as the DIY data science platform of the future.

Please read Parts 1 & 2 of this tutorial series if you haven't already.

If you want to follow along by writing code (highly recommended), you can get the PythonInExcelML.xlsx workbook from the newsletter GitHub repository.

On to the tutorial...

When you segment your data, you attempt to craft groupings that provide valuable insights to change business processes/behaviors.

A classic example comes from marketing - customer segmentation.

Customer segmentation is used by marketing teams to:

Improve existing advertising campaigns.
Inspire new marketing strategies.
Optimize ad copy.

However, you would be making a big mistake if you think segmentation is only useful in marketing. Here some real-world segmentation examples:

Patients in healthcare.
Claims in insurance.
Security in IT.

Skills with segmentation are wildly useful for any professional - especially in the age of AI.

For example, Copilot in Excel will regularly suggest performing segmentation when you are using Copilot's advanced analysis mode.

In machine learning, segmentation falls into the class of unsupervised learning algorithms. More specifically, clustering algorithms.

To continue the Copilot in Excel example, I've spent hours testing it to see how useful it might be for real-world DIY data science.

In all of my testing, when Copilot in Excel performs a segmentation on your behalf, it defaults to using the k-means clustering algorithm.

The reasons for this aren't surprising. K-means is:

Easy to understand, requiring only middle school math.
It's surprisingly effective across industries and business domains.

Because k-means is so wildly useful, it's why I teach it in my Cluster Analysis with Python online course.

To see k-means clustering in action using Python in Excel, the first step is to load the dataset:

NOTE - While this tutorial uses a customer segmentation example, everything covered applies to any segmentation you might perform (e.g., healthcare).

The info() method from the DataFrame class provides useful summary information about the dataset:

Two things in the above output are critical for performing segmentation using k-means:

All the columns (i.e., features) are numeric.
There is no missing data.

I'll address the second bullet first. The k-means algorithm requires a complete (i.e., no missing data) dataset. Even if a single cell in dataset is missing, k-means will error out.

Regarding the first bullet, k-means relies on calculations that only work with numeric data. This is a common requirement of clustering algorithms (e.g., DBSCAN).

However, it's common in real-world DIY data science to have both numeric and categorical features. This is known as mixed data.

If you need to segment mixed data, you need to use specialized techniques for algorithms like k-means to work correctly.

My Cluster Analysis with Python online course will teach you these techniques.

Next, we should check the data to see if anything jumps out:

The above Python formula is set to output to the Excel worksheet. Here's a subset of the DataFrame output:

Compare the min and max rows across the columns. Notice how much larger Income values are compared to the other two features?

The technical terms for this is scale. The Income feature is on a different scale compared to the other features.

Numeric features on different scales are usually a problem for clustering algorithms like k-means. The solution is to process the data so that all the features are on a similar scale.

The most common way to process the data is to use the StandardScaler class from the scikit-learn library:

The StandardScaler class calculates the Z-score for each of the numeric features.

If you're unfamiliar, here's a summary of the Z-score:

Calculate the mean (average) value for each feature.
Calculate the standard deviation for each feature.
Subtract the mean from each feature value.
Divide by the standard deviation.

The StandardScaler class transforms the data to improve the results of k-means clustering. The transformed data will appear to lose its business meaning:

Clicking the card in the Excel worksheet displays the above preview of the transformed DataFrame.

Notice how the values are all similar, but won't make sense to a business stakeholder?

Don't worry, you never show the transformed data to your stakeholders, only the results of the clustering using the transformed data.

With the data transformed (i.e., standardized), it can now be segmented using k-means clustering:

To use k-means effectively, it needs to be optimized (i.e., tuned) for your dataset.

In the case of k-means, tuning takes the form of choosing an optimal number of clusters for k-means to mine from your dataset.

The above code uses the techniques taught in my Cluster Analysis with Python online course to select 4 as the number of clusters to be mined.

The last line of code accesses the labels_ attribute of the k_means object. This attribute contains the cluster assignment for each row of the dataset.

By adding the cluster assignments to transformed DataFrame, we can see how many rows are assigned to each cluster:

Notice how the cluster assignments are just numbers (e.g., 0 and 1)?

Clustering algorithms like k-means only find the clusters. They do not understand the business context of the data or the mined clusters.

Therefore, clustering algorithms cannot interpret the clusters for insights that will resonate with your business stakeholders.

This is where you, as the DIY data scientist, are critical to the process - whether you use an AI like Copilot in Excel or not.

With the data assigned to clusters, the next step in segmentation is interpretation.

This is where techniques like visual data analysis come into play to help you understand the clusters in business terms and communicate insights to your stakeholders.

This Week’s Book

If you're looking for a book covering unsupervised learning techniques using Python, this book is a good choice:

One feature of this book is that it not only covers the algorithms you use 99% of the time (e.g., k-means and DBSCAN), but it also covers more advanced techniques (e.g., t-SNE).

That's it for this week.

Next week's newsletter will demonstrate using Python in Excel to conduct powerful visual data analyses to interpret the clusters found in today's tutorial.

Stay healthy and happy data sleuthing!

Dave Langer

Whenever you're ready, there are 4 ways I can help you:

1 - Are you new to data analysis? My Visual Analysis with Python online course will teach you the fundamentals you need - fast. No complex math required, and Copilot in Excel AI prompts are included!

2 - Cluster Analysis with Python: Most of the world's data is unlabeled and can't be used for predictive models. This is where my self-paced online course teaches you how to extract insights from your unlabeled data. Copilot in Excel AI prompts are included!

3 - Introduction to Machine Learning: This self-paced online course teaches you how to build predictive models like regression trees and the mighty random forest using Python. Offered in partnership with TDWI, use code LANGER to save 20% off.

4 - Is machine learning right for your business, but don't know where to start? Check out my Machine Learning Accelerator.