09_Clustering_finished

Lecture 8: Clustering

In a previous notebook we looked at the classifier problem. We knew what species a flower could be and we built a model to predict which species an observation was based on its characteristics. This was what's known as a suprvised machine learning problem.

In this notebook we look at the clustering problem. In the clustering problem, we do not know how many species there are and are interested in grouping observations together in clusters based on how similar the observations are. This is what is known as an unsupervised machine learning problem. In unsupervised learning our problem is to collapse high-dimensional data features into something more manageable.

I've found clustering to be useful for segmenting consumer markets (both in characteristics and geography) to simplify demand estimation. Others find clustering useful to identify bots on twitter, find blocks of voters that might be on the fence, match people on dating websites, or to group incoming freshman into dorms based on preferences.

We'll start our discussion using something called "K-Means" to identify clusters in the data since it's pretty simple and intuitive. At the end, we'll enter the Wild, Wild West and expand our analysis to different clustering ideas. The agenda for today's lecture is as follows:

Why Clusters? An Example

Choosing the Number of Clusters: The Elbow Graph

Other Clustering Methods

When Might This be Useful?

Class Announcements

PS3 is on eLC and is due Sunday, November 13, 2022. Usual office hours this week and, of course, by appointment (ie, send me an email and we'll find a time to talk).

1. Why Clusters? An Example ¶

Clustering is useful because it enables us to reduce the dimensionality of large data sets; we can make Big Data potentially small. We'll start with our iris data set because we know that the data are indeed clustered around species. What if someone gave us similar data which didn't include the species variable. Could we still identify these groupings?

We can cluster on many characteristics, but if we use only two, we can visualize them in a scatter plot. Let's use the petal measurements.

We are going to pretend that we do not know the species of each observation. We will use the species information later to see how well the cluster method worked. Note that our approach here is similar to when we introduced econometrics (OLS) and started by generating the true Data Generating Process (DGP). This approach enabled us to know "the truth" so we could evaluate how well our approach worked absent specification error. Same idea here.

Let's load the data and take a look to remind ourselves what we have.

In [1]:

import pandas as pd
import sklearn.cluster
import matplotlib.pyplot as plt
import seaborn as sns

iris = pd.read_csv('./Data/iris.csv')
iris.head(4)

Out[1]:

	sepal length	sepal width	petal length	petal width	species
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa

In [2]:

fig, ax = plt.subplots(figsize=(10,6))

ax.scatter(iris['petal length'], iris['petal width'], color='blue')

ax.set_xlabel('petal length (cm)')
ax.set_ylabel('petal width (cm)')

sns.despine()

plt.show()

Notice that I made all the dots the same color. I am pretending that I do now know which species each dot belongs to. In fact, I am pretending that I do not even know how many species there are.

I see two distinct blobs. Define $k$ as the number of clusters we observe in the data. From the picture, it seems reasonable to guess there are two clusters (ie, $k=2$) in this data but of course we know there are three. Let's fit the model based on our (bad) guess.

Like we saw in the classifier problem, we need numpy arrays to work with sklearn. Below, I convert the two characteristics to numpy arrays.

In [3]:

X = iris[['petal width', 'petal length']].to_numpy()

K-mean Clustering¶

K-mean clustering generates clusters by looking for centers of circles of data. The algorithm varies the location of the cluster centers until it has minimized the average distance from the center of the clusters to their members. The recipe is therefore:

Select a number of classes/groups to use and randomly initialize their respective center points. To figure out the number of classes to use, it's good to take a quick look at the data and try to identify any distinct groupings (ie, what we did above).

Each data point is classified by computing the distance between that point and each group center. Each point is then classifed the point to be in the group whose center is closest to it.

We then recompute the group center by taking the mean of all the vectors in the group.

Repeat these steps for a set number of iterations or until the group centers don’t change much between iterations. You can also opt to randomly initialize the group centers a few times, and then select the run that looks like it provided the best results.

K-Means has the advantage that it's fast (and it's intuitive) since we're only really computing the distances between points and group centers. Of course, there is no free lunch and we'll see that kmeans has disadvantages. BTW there is also a K-Medians algorithm which is less sensitive to outliers but also slower.

In [4]:

# The random_state option ensures we all get the same answer.
kmeans = sklearn.cluster.KMeans(n_clusters=2, random_state=0).fit(X)

The object returned from the model contains (among other things) the coordinates of the centers. They are numpy arrays. I will convert them to a dataframe.

In [5]:

centers = pd.DataFrame(kmeans.cluster_centers_, columns=['y', 'x'])

And let's graph the results where the raw data I'll denote as blue dots as before and I'm putting a star on the centers found above.

In [6]:

fig, ax = plt.subplots(figsize=(10,6))

ax.scatter(iris['petal length'], iris['petal width'], color='blue')
ax.scatter(centers['x'], centers['y'], color='red', s=300, marker='*')

ax.set_xlabel('petal length (cm)')
ax.set_ylabel('petal width (cm)')

sns.despine()

plt.show()

This is about what I expected from 'eyeballing' it.

We know that there are three types of irises in the data set. Let's try 3 clusters and graph the results.

In [7]:

kmeans = sklearn.cluster.KMeans(n_clusters=3, random_state=0).fit(X)
centers = pd.DataFrame(kmeans.cluster_centers_, columns=['y', 'x'])

fig, ax = plt.subplots(figsize=(10,6))

ax.scatter(iris['petal length'], iris['petal width'], color='blue')
ax.scatter(centers['x'], centers['y'], color='red', s=300, marker='*')

ax.set_xlabel('petal length (cm)')
ax.set_ylabel('petal width (cm)')

sns.despine()

plt.show()

Do these three centers make sense? Let's grab the predicted values (the labels) for each data point and plot them. Then I will merge them onto the original DataFrame.

In [8]:

pred = pd.DataFrame(kmeans.labels_, columns=['predicted'])
iris = pd.merge(left=pred, right=iris, left_index=True, right_index=True)

Let's take a look to make sure things merged correctly:

In [9]:

iris.sample(10)

Out[9]:

	predicted	sepal length	sepal width	petal length	petal width	species
52	2	6.9	3.1	4.9	1.5	Iris-versicolor
124	1	6.7	3.3	5.7	2.1	Iris-virginica
47	0	4.6	3.2	1.4	0.2	Iris-setosa
121	1	5.6	2.8	4.9	2.0	Iris-virginica
20	0	5.4	3.4	1.7	0.2	Iris-setosa
91	2	6.1	3.0	4.6	1.4	Iris-versicolor
120	1	6.9	3.2	5.7	2.3	Iris-virginica
61	2	5.9	3.0	4.2	1.5	Iris-versicolor
46	0	5.1	3.8	1.6	0.2	Iris-setosa
134	1	6.1	2.6	5.6	1.4	Iris-virginica

So what happened here? The clustering algorithm identified three clusters of data and labeled them 0, 1, and 2. We know the data are grouped by species so let's take a look how the categories line-up with species.

In [10]:

iris.groupby('species')['predicted'].mean()

Out[10]:

species
Iris-setosa        0.00
Iris-versicolor    1.96
Iris-virginica     1.08
Name: predicted, dtype: float64

In [11]:

#iris['predicted'] = iris['predicted'].replace({0:'setosa', 1:'versicolor', 2:'virginica'})

In [12]:

fig, ax = plt.subplots(1,2,figsize=(15,6))

sns.scatterplot(ax=ax[0], x='petal length', y='petal width', hue='predicted', data=iris,palette=['#377eb8', '#4daf4a', '#ff7f00'])

sns.scatterplot(ax=ax[1], x='petal length', y='petal width', hue='species', data=iris)

ax[0].scatter(centers['x'], centers['y'], color='red', s=300, marker='*')

ax[0].set_title('Predicted Clusters',size=20)
ax[1].set_title('Actual Clusters',size=20)

ax[0].legend(frameon=False)
ax[1].legend(frameon=False)

sns.despine(ax=ax[0])
sns.despine(ax=ax[1])

plt.show()

We observe in the groupby data and the pictures that the algorithm does a good job. The setosa cluster is perfect — which is not surprising. The split between the other two types is decent but not perfect since these data overlap more.

2. Choosing the Number of Clusters: The Elbow Graph ¶

When I looked at the data and pretended that I did not know how many species there were, I thought there were two clusters. When I added a third, it did a pretty good job matching the data.

But the whole point of these models is that we do not know how many clusters there are. So how do we pick the number of clusters?

The short answer: Try a bunch as see what happens.

We can loop over the number of clusters and keep track of the sum of squared distance between the cluster centers and the members of the clusters. If there are gaps in the data (ie, clusters), adding an additional cluster will reduce these distances dramatically. The sum of squared distance measure is captured in the interia method. Let's see how this works:

In [13]:

y=[]  # Start with an empty df
for n in range(1,8):  # loop from 1 to 7 clusters
    kmeans = sklearn.cluster.KMeans(n_clusters=n, random_state=0).fit(X)
    y.append(kmeans.inertia_)
    
y

C:\Users\jt83241\Anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1036: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1.
  warnings.warn(

Out[13]:

[550.6434666666667,
 86.40394533571005,
 31.387758974358977,
 19.499400899685114,
 13.933308757908756,
 11.107174889156013,
 9.225808730158729]

It looks like these numbers change a lot. Let's take a look at measures in a picture:

In [14]:

fig, ax = plt.subplots(figsize=(15,6))

ax.plot(range(1,8), y, color='blue', marker='.')

ax.set_xlabel('number of clusters (k)')

ax.set_ylabel('sum of squared distances to nearest cluster')

sns.despine(ax=ax)

We see that moving from 1 to 2 clusters made a big improvement. ML people call this the "elbow" of the graph because the figure looks like an arm, I guess. Moving from 2 to 3 clusters made less of an improvement.

We can always lower the sum of squared distance (ssd) by increasing k. When k is the same as the number of observations, then each cluster has exactly one observation in it and the ssd is zero. But that isn't very helpful. As economists, we would say that there are diminishing returns to k.

From this figure, I would choose either k=2 or k=3. After that, we do not get much improvement in the ssd.

Practice¶

Which Minnesota counties are similar to which Wisconsin counties? Perhaps your company has spent a lot of money researching Wisconsin as a marketplace. Now it wants to expand into Minnesota. Can we find counties in Minnesota that are similar to counties in Wisconsin? Let's use our clustering model to see.

Load the file counties.csv.

The variables are

income in thousands of USD, for 2018
population number of persons, for 2018
ALAND area of the county in square meters

In [16]:

counties = pd.read_csv('./Data/counties.csv')
counties.sample(5)

Out[16]:

	GEOID	Description	income	pop	ALAND
23	27047	Freeborn, MN	1394588.0	30444.0	1831823843
151	55127	Walworth, WI	5188239.0	103718.0	1438568339
143	55111	Sauk, WI	3107378.0	64249.0	2153686396
14	27029	Clearwater, MN	387647.0	8810.0	2586701770
10	27021	Cass, MN	1400484.0	29519.0	5235592037

Create an income per capita variable: income/pop

In [17]:

counties['income_cap'] = counties['income']/counties['pop']

Create a density variable: pop/(ALAND/1000000). Note: Dividing ALAND by 1,000,000 converts this measure to people per square kilometer

In [18]:

counties['density'] = counties['pop']/(counties['ALAND']/1000000)

Make a scatter plot of income per capita and density. Do you see a relationship? Would a linear model be helpful here?

In [19]:

fig, ax = plt.subplots(figsize=(15,7))

ax.scatter(counties['density'], counties['income_cap'], color='blue')

ax.set_xlabel('density')
ax.set_ylabel('income per capita')

sns.despine(ax=ax)
plt.show()

Compute the elbow graph using density and income_cap as the variables. Try 1 through 10 possible clusters. Remember, you need to convert your DataFrame data to numpy arrays.

In [20]:

# Convert to numpy
X = counties[['density', 'income_cap']].to_numpy()

# Try the model with 1, 2, ..., 10 clusters. Keep track of the ssd ('inertia')
y=[]
for n in range(1, 11):
    kmeans = sklearn.cluster.KMeans(n_clusters=n, random_state=0).fit(X)
    y.append(kmeans.inertia_)

C:\Users\jt83241\Anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1036: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1.
  warnings.warn(

In [21]:

fig, ax = plt.subplots(figsize=(15,6))

ax.plot(range(1,11), y, color='blue', marker='.')

ax.set_xlabel('number of clusters (k)')

ax.set_ylabel('sum of squared distances to nearest cluster')

sns.despine(ax=ax)

Based on your elbow graph, how many clusters seem appropriate?

Answer: It looks like 3 clusters is reasonable.

Let's explore our results some more.

Run the kmeans model with k=3

In [22]:

kmeans = sklearn.cluster.KMeans(n_clusters=3, random_state=0).fit(X)

Retrieve the labels_ from the results.

In [23]:

clusters = pd.DataFrame(kmeans.labels_, columns=['cluster'])

Merge the labels onto your original DataFrame, using the index as the merge key.

In [24]:

detail = pd.merge(left=counties, right=clusters, left_index=True, right_index=True)
detail.head(2)

Out[24]:

	GEOID	Description	income	pop	ALAND	income_cap	density	cluster
0	27001	Aitkin, MN	632584.0	15902.0	4718433597	39.780153	3.370186	2
1	27003	Anoka, MN	18135923.0	353813.0	1092919450	51.258498	323.732001	0

Plot income_cap and density again using seaborn's scatterplot. Set the hue to the labels_ you retrieved. What patterns do you see? You might add labels to the points so you can tell which counties are which.

In [25]:

fig, ax = plt.subplots(figsize=(15,6))

sns.scatterplot(ax=ax, x='density', y='income_cap', hue='cluster', data=detail, palette=['#377eb8', '#ff7f00', '#4daf4a'])

for x,y,t in zip(detail['density'], detail['income_cap'], detail['Description']):
    if x > 150:
        ax.text(x+10, y+1, t)

ax.set_xlabel('density')
ax.set_ylabel('income per capita')

sns.despine(ax=ax)
ax.legend(frameon=False)

plt.show()

Redo 7 through 10 with k=4. How do the results change?

In [26]:

kmeans = sklearn.cluster.KMeans(n_clusters=4, random_state=0).fit(X)
clusters = pd.DataFrame(kmeans.labels_, columns=['cluster'])
detail = pd.merge(left=counties, right=clusters, left_index=True, right_index=True)

fig, ax = plt.subplots(figsize=(15,6))

sns.scatterplot(ax=ax, x='density', y='income_cap', hue='cluster', data=detail, palette=['#377eb8', '#ff7f00', '#4daf4a','#f781bf'])

for x,y,t in zip(detail['density'], detail['income_cap'], detail['Description']):
    if x > 150:
        ax.text(x+10, y+1, t)

sns.despine(ax=ax)
ax.legend(frameon=False)

ax.set_xlabel('density')
ax.set_ylabel('income per capita')

plt.show()

Answer: Hennepin, MN becomes its own cluster.

Comments¶

As I mentioned above, K-Means has a couple of disadvantages.

Tou have to select how many groups/classes there are. The Elbow Graph approach helps but there's no statistical test for us to lean on -- or at least identify a best-practice across different researchers. This isn't always trivial and ideally with a clustering algorithm we'd want it to figure those out for us because the point of it is to gain some insight from the data.

K-means starts with a random choice of cluster centers and therefore it may yield different clustering results on different runs of the algorithm. Thus, the results may not be repeatable and lack consistency. Other cluster methods are more consistent.

K-means fundamentally draws circles around the data. What if circles are not the best way to cluster the data. Look at the scatterplots in the practice exercise. The data near zero density look like a column (ie, a lot of heterogeneity in income per capita) while there appears to be a positive relationship between density and income per capita between density of 25 and 200. It seems like these two groups should be different clusters but K-Means will have difficulty doing that since it looks for centers of circles and will therefore always choose a center in the middle of both groups in order to capture both.

3. Other Clustering Methods ¶

3.1 Mean-Shift Clustering ¶

Mean shift clustering is a sliding-window-based algorithm that attempts to find dense areas of data points. It is a centroid-based algorithm meaning that the goal is to locate the center points of each group/class, which works by updating candidates for center points to be the mean of the points within the sliding-window. These candidate windows are then filtered in a post-processing stage to eliminate near-duplicates, forming the final set of center points and their corresponding groups.

In contrast to K-means clustering, there is no need to select the number of clusters as mean-shift automatically discovers this -- that's a nice advantage. The fact that the cluster centers converge towards the points of maximum density is also quite desirable as it is quite intuitive to understand and fits well in a naturally data-driven sense.

In [27]:

from sklearn.cluster import MeanShift, estimate_bandwidth

# The following bandwidth can be automatically detected using
bandwidth = estimate_bandwidth(X, quantile=0.2, n_samples=500)

ms = MeanShift(bandwidth=bandwidth, bin_seeding=True).fit(X)
clusters = pd.DataFrame(ms.labels_, columns=['cluster'])

col = ['#377eb8', '#ff7f00', '#4daf4a','#f781bf','k']
detail = pd.merge(left=counties, right=clusters, left_index=True, right_index=True)

fig, ax = plt.subplots(figsize=(15,6))

sns.scatterplot(ax=ax, x='density', y='income_cap', hue='cluster', palette=col, s=50, data=detail)

for x,y,t in zip(detail['density'], detail['income_cap'], detail['Description']):
    if x > 150:
        ax.text(x+10, y+1, t)

sns.despine(ax=ax)
ax.legend(frameon=False)

ax.set_xlabel('density')
ax.set_ylabel('income per capita')

plt.show()

The algorithm selected five clusters. Note that the increase in clusters was because it designated Hennepin, Ramsey, and Milwaukee as different. We know that Milwaukee is different -- Alice Cooper taught us that.

Was this a good clustering choice for our data? I would argue "no."

3.2 Density-Based Spatial Clustering of Applications with Noise (DBSCAN)¶

DBSCAN is a form of a nearest neighbor clustering algorithm which you commonly hear being applied to data. The rough idea is that you work through the data along the dimensions of the data and keep adding data points to a cluster until you get to set of data which are sufficiently different. It is a density-based clustered algorithm similar to mean-shift, but with a couple of notable advantages. First, it does not require a pre-set number of clusters at all. The central component to the DBSCAN is the concept of core samples -- samples that are in areas of high density. A "cluster" is therefore a set of core samples which are close to each other (measured by some distance measure). Any core sample is part of a cluster, by definition. Any sample that is not a core sample, and is at least 'eps' in distance from any core sample, is considered an outlier. Second, identifies outliers as noise (which may be useful in our case) whereas k-means and mean-shift which simply throw them into a cluster even if the data point is very different. Third, it can find arbitrarily sized and arbitrarily shaped clusters quite well.

On the downside, it can struggle when there is a lot of variation in cluster density.**

There are two parameters to the algorithm: min_samples and eps. These formally define formally what we mean when we say "dense."

The parameter min_samples primarily controls how tolerant the algorithm is towards noise (on noisy and large data sets it may be desirable to increase this parameter), though results are not terribly sensitive to this parameter so it's not very important IMO. The parameter eps is however crucial to choose appropriately for the data set. It controls the local neighborhood of the points. When chosen too small, most data will not be clustered at all (and labeled as -1 for "noise"). When chosen too large, it causes close clusters to be merged into one cluster, and eventually the entire data set to be returned as a single cluster. Some heuristics for choosing this parameter have been discussed in the literature, for example based on an elbow in the nearest neighbor distances plot. Let's give it a rip in our county data.

Solving for the optimal `eps`: Another Elbow¶

We can calculate the distance from each point to its closest neighbour using the NearestNeighbors package. The point itself is included in n_neighbors. The kneighbors method returns two arrays, one which contains the distance to the closest n_neighbors points and the other which contains the index for each of those points. As with our previous elbow graph, we're looking for an inflection point where changing eps really has no effect on the distances.

In [28]:

import numpy as np
from sklearn.neighbors import NearestNeighbors

neigh = NearestNeighbors(n_neighbors=2)
nbrs = neigh.fit(X)
distances, indices = nbrs.kneighbors(X)

# Sort the results
distances = np.sort(distances, axis=0)
distances = distances[:,1]

fig, ax = plt.subplots(figsize=(15,6))

plt.plot(distances, color='blue', marker='.')

ax.set_xlabel('distance')

ax.set_ylabel('eps')

sns.despine(ax=ax)

Looks like a lot of curvature when eps equals 100. Let's try that.

In [29]:

from sklearn.cluster import DBSCAN

# Compute DBSCAN with eps = 100
db = DBSCAN(eps=100, min_samples=5).fit(X)

clusters = pd.DataFrame(db.labels_, columns=['cluster'])

col = ['#377eb8', '#ff7f00', '#4daf4a']
detail = pd.merge(left=counties, right=clusters, left_index=True, right_index=True)

fig, ax = plt.subplots(figsize=(15,6))
sns.scatterplot(ax=ax, x='density', y='income_cap', hue='cluster', s=50, data=detail)

for x,y,t in zip(detail['density'], detail['income_cap'], detail['Description']):
    if x > 150:
        ax.text(x+10, y+1, t)

sns.despine(ax=ax)
ax.legend(frameon=False)

ax.set_xlabel('density')
ax.set_ylabel('income per capita')

plt.show()

Hmmm... not great. In fact, playing around with eps demonstrates that we're only shifting which observations are noise. The clusters we observe with our eyeballs are just too heterogenous for this method to work well (see the above "downside" discussion).

3.3 Expectation–Maximization (EM) Clustering using Gaussian Mixture Models (GMM).¶

One of the major drawbacks of K-Means is its naive use of the mean value for the cluster center and its insistence on identifying clusters with circles. The work-around is to relax the circle assumption and use Gaussian Mixture Models (GMMs) to give us more flexibility. With GMMs we assume that the data points are Gaussian distributed; this is a less restrictive assumption than saying they are circular by using the mean. That way, we have two parameters to describe the shape of the clusters: the mean and the standard deviation! Taking an example in two dimensions, this means that the clusters can take any kind of elliptical shape (since we have a standard deviation in both the x and y directions). Thus, each Gaussian distribution is assigned to a single cluster. Hence, we could use this clustering approach to identify the different clusters in our practice data. To find the parameters of the Gaussian for each cluster (e.g the mean and standard deviation), we will use an optimization algorithm called Expectation–Maximization (EM).

This seems like just the thing we'd like to use with our county data. Let's give it a try using four clusters.

In [30]:

from sklearn.mixture import GaussianMixture

gm = GaussianMixture(n_components=4, random_state=0).fit(X)
clusters = pd.DataFrame(gm.predict(X), columns=['cluster'])
detail = pd.merge(left=counties, right=clusters, left_index=True, right_index=True)

fig, ax = plt.subplots(figsize=(15,6))

sns.scatterplot(ax=ax, x='density', y='income_cap', hue='cluster', palette=['#377eb8', '#ff7f00', '#4daf4a','#f781bf'], s=50, data=detail)

for x,y,t in zip(detail['density'], detail['income_cap'], detail['Description']):
    if x > 150:
        ax.text(x+10, y+1, t)

sns.despine(ax=ax)
ax.legend(frameon=False)

ax.set_xlabel('density')
ax.set_ylabel('income per capita')

plt.show()

3.4 Agglomerative Hierarchical Clustering ¶

Hierarchical clustering algorithms fall into 2 categories: top-down or bottom-up. Bottom-up algorithms treat each data point as a single cluster at the outset and then successively merge (or agglomerate) pairs of clusters until all clusters have been merged into a single cluster that contains all data points. Bottom-up hierarchical clustering is therefore called hierarchical agglomerative clustering (HAC). This hierarchy of clusters is represented as a tree (or dendrogram). The root of the tree is the unique cluster that gathers all the samples, the leaves being the clusters with only one sample. Cute, right? I had no idea what a dendogram was before I started working in ML. Here is the wiki page.

Note that this technique can be particularly usef for "feature reduction" which is popular in finance (where this type of analysis is known as "Principal Component Analaysis"). Fundamentally, it's a useful dimensionality reduction tool which in ML lingo is known as "unsupervised learning."

Let's try it with our data...

Picking the Number of Clusters: The Dendrogram¶

First, we have to decide how many clusters to designate. As before, we can evaluate the data directly to look for gaps. Instead of an elbow, though, we can use a dendrogram to visualize the history of groupings and figure out the optimal number of clusters. We identify the optimal number of clusters as follows:

Determine the largest vertical distance that doesn't intersect any of the other clusters.
Draw a horizontal line bwteen the extremities.
The optimal number of clusters is equal to the number of vertical lines going through the horizontal line.

In [31]:

import scipy.cluster.hierarchy as sch # Used to create a dendrogram

fig, ax = plt.subplots(figsize=(15,6))

dendrogram = sch.dendrogram(sch.linkage(X, method='ward'),ax=ax)

# Different methods to measure linkags / distance
# Ward - minimizes the sum of squared differences within all clusters. It is a variance-minimizing approach and in this sense is similar to the k-means objective function but tackled with an agglomerative hierarchical approach.
# Maximum or "complete" - minimizes the maximum distance between observations of pairs of clusters.
# Average - minimizes the average of the distances between all observations of pairs of clusters.
# Single - minimizes the distance between the closest observations of pairs of clusters.

Based on the above recipe, I should choose two clusters but I'm going to instead choose three because I know the outliers will occupy one cluster and I want to focus my (potential) analysis on the remaining counties. Equivalently, I could just drop the outlier counties.

In [32]:

from sklearn.cluster import AgglomerativeClustering

AHC = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward').fit(X)

clusters = pd.DataFrame(AHC.labels_, columns=['cluster'])

col = ['#377eb8', '#ff7f00', '#4daf4a']
detail = pd.merge(left=counties, right=clusters, left_index=True, right_index=True)

fig, ax = plt.subplots(figsize=(15,6))
sns.scatterplot(ax=ax, x='density', y='income_cap', hue='cluster', s=50, palette=col, data=detail)

for x,y,t in zip(detail['density'], detail['income_cap'], detail['Description']):
    if x > 150:
        ax.text(x+10, y+1, t)

sns.despine(ax=ax)
ax.legend(frameon=False)

ax.set_xlabel('density')
ax.set_ylabel('income per capita')

plt.show()

That seems to work pretty well too. These groups seem pretty natural.

Summary of Different Approaches ¶

The figure below is from the scikit clustering documentation and demonstrates how different clustering approaches (some I have not discussed) do for different variations in the (pretend) data. The numbers in the bottom-right present computation time which may be an issue as your data become larger and clustering algorithm becomes more complex.

4. When Might This be Useful? ¶

We'll see that clustering will be very useful when we have too many features to work with in a tractable way (our preferred ML model takes forever to fit or maybe requires too much computing resources). Clustering can also be useful to construct data clusters which make prediction (estimation in the following example) better.

Suppose we had the following data on vehicle sales:

In [33]:

autos = pd.read_stata('./Data/cars.dta')

# Let's create some new variables
segment_dummies = pd.get_dummies(autos['SEGMENT'])
autos = pd.concat([autos, segment_dummies], axis=1)
autos.rename(columns={1:'SMALL', 2:'COMPACT', 3:'SEDAN', 4:'LUXURY', 5:'MINIVAN'},inplace=True)

# Imported Cars
autos['NONEURO'] = 0
autos.loc[(autos['ORIG']>1),'NONEURO'] = 1

# Rename "Fuel" "Diesel" since this is an indicator where 1 => diesel engine
autos.rename(columns={'FUEL':'DIESEL'},inplace=True)

# Apply logs so we can estimate elasticity
autos['LOGQ'] = np.log(autos['QUANTITY'])
autos['LOGP'] = np.log(autos['PRICE'])

autos.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1269 entries, 0 to 1268
Data columns (total 26 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   CODE       1269 non-null   int16  
 1   ORIG       1269 non-null   int8   
 2   FIRM_ID    1269 non-null   float32
 3   FIRM       1269 non-null   object 
 4   BRAND      1269 non-null   object 
 5   MODEL      1269 non-null   object 
 6   YEAR       1269 non-null   int16  
 7   PRICE      1269 non-null   float32
 8   QUANTITY   1269 non-null   float32
 9   HP         1269 non-null   float32
 10  LENGTH     1269 non-null   float32
 11  WIDTH      1269 non-null   float32
 12  SIZE       1269 non-null   float32
 13  WEIGHT     1269 non-null   float32
 14  DIESEL     1269 non-null   int8   
 15  MPG        1269 non-null   float32
 16  FUELPRICE  1269 non-null   float32
 17  SEGMENT    1269 non-null   int8   
 18  SMALL      1269 non-null   uint8  
 19  COMPACT    1269 non-null   uint8  
 20  SEDAN      1269 non-null   uint8  
 21  LUXURY     1269 non-null   uint8  
 22  MINIVAN    1269 non-null   uint8  
 23  NONEURO    1269 non-null   int64  
 24  LOGQ       1269 non-null   float32
 25  LOGP       1269 non-null   float32
dtypes: float32(12), int16(2), int64(1), int8(3), object(3), uint8(5)
memory usage: 123.9+ KB

And we wanted to estimate demand and are particularly interested in estimating price elasticity:

In [34]:

import statsmodels.formula.api as smf  # provides a way to directly spec models from formulas

# OLS 
auto_OLS1 = smf.ols('LOGQ ~  LOGP + MPG + HP + DIESEL + NONEURO + SMALL + SEDAN + LUXURY + MINIVAN', data=autos).fit(cov_type = 'HC3')

print(auto_OLS1.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   LOGQ   R-squared:                       0.925
Model:                            OLS   Adj. R-squared:                  0.925
Method:                 Least Squares   F-statistic:                     1198.
Date:                Wed, 09 Nov 2022   Prob (F-statistic):               0.00
Time:                        18:44:31   Log-Likelihood:                -1735.6
No. Observations:                1269   AIC:                             3491.
Df Residuals:                    1259   BIC:                             3543.
Df Model:                           9                                         
Covariance Type:                  HC3                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     27.2383      0.491     55.441      0.000      26.275      28.201
LOGP          -6.3574      0.135    -47.193      0.000      -6.621      -6.093
MPG           -0.0127      0.006     -2.036      0.042      -0.025      -0.000
HP            -4.7571      5.031     -0.946      0.344     -14.617       5.103
DIESEL        -0.4234      0.092     -4.585      0.000      -0.604      -0.242
NONEURO        0.3850      0.059      6.502      0.000       0.269       0.501
SMALL         -0.1625      0.100     -1.631      0.103      -0.358       0.033
SEDAN         -0.5817      0.077     -7.592      0.000      -0.732      -0.432
LUXURY        -1.7088      0.122    -14.051      0.000      -1.947      -1.470
MINIVAN       -1.2488      0.117    -10.683      0.000      -1.478      -1.020
==============================================================================
Omnibus:                      120.332   Durbin-Watson:                   1.625
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              155.702
Skew:                          -0.793   Prob(JB):                     1.55e-34
Kurtosis:                       3.655   Cond. No.                     7.13e+03
==============================================================================

Notes:
[1] Standard Errors are heteroscedasticity robust (HC3)
[2] The condition number is large, 7.13e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

The above (biased) OLS regression indicates that the price elasticity is 6.35. Knowing the vehicle segment turns out to be important to understanding the price elasticity (okay, it's not a huge difference but a difference nonetheless):

In [35]:

# OLS 
auto_OLS2 = smf.ols('LOGQ ~  LOGP + MPG + HP + DIESEL + NONEURO', data=autos).fit(cov_type = 'HC3')

print(auto_OLS2.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   LOGQ   R-squared:                       0.909
Model:                            OLS   Adj. R-squared:                  0.908
Method:                 Least Squares   F-statistic:                     1133.
Date:                Wed, 09 Nov 2022   Prob (F-statistic):               0.00
Time:                        18:44:31   Log-Likelihood:                -1861.5
No. Observations:                1269   AIC:                             3735.
Df Residuals:                    1263   BIC:                             3766.
Df Model:                           5                                         
Covariance Type:                  HC3                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     27.0949      0.474     57.159      0.000      26.166      28.024
LOGP          -6.8719      0.127    -53.912      0.000      -7.122      -6.622
MPG            0.0261      0.005      5.649      0.000       0.017       0.035
HP           -16.4147      4.334     -3.787      0.000     -24.909      -7.920
DIESEL        -0.8058      0.090     -8.921      0.000      -0.983      -0.629
NONEURO        0.4518      0.066      6.871      0.000       0.323       0.581
==============================================================================
Omnibus:                      127.456   Durbin-Watson:                   1.781
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              167.280
Skew:                          -0.822   Prob(JB):                     4.74e-37
Kurtosis:                       3.677   Cond. No.                     5.98e+03
==============================================================================

Notes:
[1] Standard Errors are heteroscedasticity robust (HC3)
[2] The condition number is large, 5.98e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Suppose we didn't know the segment information. There seems to be information embedded in SEGMENT that is correlated with log-price. Let's see if we can replicate vehicle segments using the other characteristics and kmeans clustering.

In [36]:

X = autos[['HP','SIZE', 'WEIGHT', 'MPG']].to_numpy()

# Try the model with 1, 2, ..., 10 clusters. Keep track of the ssd ('inertia')
y=[]
for n in range(1, 11):
    kmeans = sklearn.cluster.KMeans(n_clusters=n, random_state=0).fit(X)
    y.append(kmeans.inertia_)

fig, ax = plt.subplots(figsize=(15,6))

ax.plot(range(1,11), y, color='blue', marker='.')

ax.set_xlabel('number of clusters (k)')

ax.set_ylabel('sum of squared distances to nearest cluster')

sns.despine(ax=ax)

C:\Users\jt83241\Anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1036: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=5.
  warnings.warn(

I read the elbow graph as saying we should stop at three clusters.

In [37]:

kmeans = sklearn.cluster.KMeans(n_clusters=3, random_state=0).fit(X)
clusters = pd.DataFrame(kmeans.labels_, columns=['cluster'])
autos = pd.merge(left=autos, right=clusters, left_index=True, right_index=True, how='left')

So now run OLS again with the clusters we built based on vehicle characteristics.

In [38]:

# OLS 
auto_OLS3 = smf.ols('LOGQ ~  LOGP + MPG + HP + DIESEL + NONEURO + C(cluster)', data=autos).fit(cov_type = 'HC3')

print(auto_OLS3.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   LOGQ   R-squared:                       0.916
Model:                            OLS   Adj. R-squared:                  0.916
Method:                 Least Squares   F-statistic:                     1115.
Date:                Wed, 09 Nov 2022   Prob (F-statistic):               0.00
Time:                        18:44:33   Log-Likelihood:                -1805.1
No. Observations:                1269   AIC:                             3626.
Df Residuals:                    1261   BIC:                             3667.
Df Model:                           7                                         
Covariance Type:                  HC3                                         
===================================================================================
                      coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept          27.8570      0.452     61.628      0.000      26.971      28.743
C(cluster)[T.1]    -1.1963      0.121     -9.853      0.000      -1.434      -0.958
C(cluster)[T.2]    -0.3482      0.080     -4.325      0.000      -0.506      -0.190
LOGP               -6.5444      0.135    -48.485      0.000      -6.809      -6.280
MPG                -0.0072      0.006     -1.284      0.199      -0.018       0.004
HP                -17.5704      4.402     -3.992      0.000     -26.197      -8.944
DIESEL             -0.3768      0.095     -3.957      0.000      -0.563      -0.190
NONEURO             0.3493      0.064      5.491      0.000       0.225       0.474
==============================================================================
Omnibus:                      138.358   Durbin-Watson:                   1.836
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              185.560
Skew:                          -0.866   Prob(JB):                     5.08e-41
Kurtosis:                       3.712   Cond. No.                     6.13e+03
==============================================================================

Notes:
[1] Standard Errors are heteroscedasticity robust (HC3)
[2] The condition number is large, 6.13e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

In [39]:

results = [auto_OLS1.params['LOGP'],auto_OLS2.params['LOGP'],auto_OLS3.params['LOGP']]
params = pd.DataFrame({'Estimated Price':results},
                      index = ['Truth', 'No Clusters', 'Clusters'],
                     )
params

Out[39]:

	Estimated Price
Truth	-6.357438
No Clusters	-6.871897
Clusters	-6.544450

Imputing the clusters based on observable characteristics enabled us to improve our estimated price elasticity!