Join 1,000s of professionals who are building real-world skills to truly harness the power of AI technologies like Microsoft's Copilot in Excel.

Issue #25 - Machine Learning with Python in Excel Part 5: Decision Trees

This Week’s Tutorial

This week's issue is the fourth tutorial of a series demonstrating how Microsoft is positioning Excel as the DIY data science platform of the future.

This week's tutorial builds upon Part 3 and Part 4 of the series - be sure to check them out if you haven't already read them.

If you want to follow along by writing code (highly recommended), you can get the PythonInExcelML.xlsx workbook from the newsletter GitHub repository.

On to the tutorial...

Part 3 of this series demonstrated one of the most useful of all DIY data science skills - cluster analysis.

The value of cluster analysis skills are two-fold:

  1. Cluster analysis is useful for any professional working in non-profit, government, healthcare, or for-profit organizations. It'a truly a universal skill.

  2. Interpreting cluster analyses for business stakeholders is still a human-centric skill: even in the age of AI.

While in the past this sort of work was firmly in the domain of technical roles like Data Analysts/Scientists, Python in Excel has changed everything.

For example, the Copilot in Excel AI will often generate cluster analysis Python code in response to a prompt.

However, Copilot isn't very good at interpreting the clusters. This is an opportunity for you to stand out at work.

Today's tutorial will focus on using a decision tree machine learning predictive model to interpret cluster analyses.

The first step is to remove the cluster assignments added in Part 4:

One of the best things about machine learning algorithms based on decision trees is that you don't have to scale your numeric features.

So, the decision tree can be trained using the original customer_behavior dataset. This allows for easy interpretation of the decision tree model for your business stakeholders.

The following code trains the decision tree model:

The above code demonstrates three important ideas covered in previous newsletter tutorials:

  • Humans prefer explanations using up to 3 factors/criteria. That's why the above code sets the depth of the decision tree model to 3.

  • To ensure reproducibility if you're following along, the random_state hyperparameter is set (BTW - 12345 is an arbitrary value).

  • The model is trained to predict the cluster assignments.

This last bullet deserves a bit of additional explanation.

If you can train a machine learning model that is both interpretable and accurately predicts cluster assignments, then it can be useful for communicating cluster findings with your business stakeholders.

Let's update the above code to include scoring the model's predictions:

The output above shows that the model is 84.35% accurate at predicting cluster labels.

This output shows the tradeoff of using ML models for cluster interpretation:

  • More complex models (e.g., a tree 4 levels deep) will usually be more accurate.

  • More complex models lead to more complex interpretations - which might not resonate so well with your business stakeholders.

My experience has been that a decision tree model 3 levels deep with 84% accuracy is a reasonable tradeoff and helps to provide useful cluster interpretations.

Decision trees can be very useful for interpreting a clusters because, at 3 levels deep, their visualizations are easily understood.

The following code creates a visualization of the decision tree model:

Clicking PngImageFile > in Excel's Python Editor displays the decision tree model:

Unfortunately, this image is too small to be useful. We can ask Python in Excel to output the image directly to the worksheet so we can enlarge it:

Selecting Excel Value from the Output drop-down in the Python Editor will then output the image in the worksheet cell:

Luckily, Excel allows us to enlarge the image quite easily by right-clicking the output cell:

The image is now enlarged and easy to read:

While the image looks small here, it's quite large in the Excel worksheet. 😁

The best way to use a decision tree to interpret clusters is to examine the leaf nodes at the bottom of the tree. What you're looking for are leaf nodes that are predominantly one of the clusters.

The leaf node at the lower left of the decision tree is a perfect example:

As covered in Part 3 of the tutorial, there were 4 clusters mined from the data labeled 0, 1, 2, and 3.

Here's how you read the decision tree visualization to determine the cluster counts in the leaf nodes:

The lower left leaf node has the following counts:

  • 931 instances of Cluster 0.

  • 0 instances of Cluster 1.

  • 5 instances of Cluster 2.

  • 0 instances of Cluster 3.

This leaf node can provide us with a powerful interpretation of which customers where assigned to Cluster 0.

To do this, we need to look at all the decision nodes that lead to this leaf node:

The vast majority of customers assigned to Cluster 0 had the following behaviors:

  • TotalFoodPurchases less than or equal to 231.

  • NumDealPurchases of 5 or less.

  • NumWebPurchases of 4 or less.

In natural language, you could explain Cluster 0 like this:

"Customers in belonging to Cluster 0 typically have lower food purchases and don't take advantage of our deals or website when they do purchase."

DIY data scientists use cluster analysis to uncover insights like this for:

  • Employees

  • Customers

  • Products

  • Patients

  • Claims

  • Users

The list is endless!

This Week’s Book

If you're unfamiliar with cluster analysis and decision trees, check out this free online book. Don't let the "Statistical Learning" part of the title fool you, it's what statisticians call machine learning 🤣:

This book is a great introduction to many ML techniques including k-means clustering, decision trees, and the mighty random forest.

That's it for this week.

Next week's newsletter will start a new tutorial series on one of my favorite data analysis techniques you probably never heard of - market basket analysis.

Stay healthy and happy data sleuthing!

Dave Langer

Whenever you're ready, there are 4 ways I can help you:

1 - Are you new to data analysis? My Visual Analysis with Python online course will teach you the fundamentals you need - fast. No complex math required, and Copilot in Excel AI prompts are included!

2 - Cluster Analysis with Python: Most of the world's data is unlabeled and can't be used for predictive models. This is where my self-paced online course teaches you how to extract insights from your unlabeled data. Copilot in Excel AI prompts are included!

3 - Introduction to Machine Learning: This self-paced online course teaches you how to build predictive models like regression trees and the mighty random forest using Python. Offered in partnership with TDWI, use code LANGER to save 20% off.

4 - Is machine learning right for your business, but don't know where to start? Check out my Machine Learning Accelerator.