Lecture 8: Clustering
In a previous notebook we looked at the classifier problem. We knew what species a flower could be and we built a model to predict which species an observation was based on its characteristics. This was what's known as a suprvised machine learning problem.
In this notebook we look at the clustering problem. In the clustering problem, we do not know how many species there are and are interested in grouping observations together in clusters based on how similar the observations are. This is what is known as an unsupervised machine learning problem. In unsupervised learning our problem is to collapse high-dimensional data features into something more manageable.
I've found clustering to be useful for segmenting consumer markets (both in characteristics and geography) to simplify demand estimation. Others find clustering useful to identify bots on twitter, find blocks of voters that might be on the fence, match people on dating websites, or to group incoming freshman into dorms based on preferences.
We'll start our discussion using something called "K-Means" to identify clusters in the data since it's pretty simple and intuitive. At the end, we'll enter the Wild, Wild West and expand our analysis to different clustering ideas. The agenda for today's lecture is as follows:
Class Announcements
PS3 is on eLC and is due Sunday, November 13, 2022. Usual office hours this week and, of course, by appointment (ie, send me an email and we'll find a time to talk).
1. Why Clusters? An Example ¶
Clustering is useful because it enables us to reduce the dimensionality of large data sets; we can make Big Data potentially small. We'll start with our iris data set because we know that the data are indeed clustered around species. What if someone gave us similar data which didn't include the species
variable. Could we still identify these groupings?
We can cluster on many characteristics, but if we use only two, we can visualize them in a scatter plot. Let's use the petal measurements.
We are going to pretend that we do not know the species of each observation. We will use the species information later to see how well the cluster method worked. Note that our approach here is similar to when we introduced econometrics (OLS) and started by generating the true Data Generating Process (DGP). This approach enabled us to know "the truth" so we could evaluate how well our approach worked absent specification error. Same idea here.
Let's load the data and take a look to remind ourselves what we have.
import pandas as pd
import sklearn.cluster
import matplotlib.pyplot as plt
import seaborn as sns
iris = pd.read_csv('./Data/iris.csv')
iris.head(4)
sepal length | sepal width | petal length | petal width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(iris['petal length'], iris['petal width'], color='blue')
ax.set_xlabel('petal length (cm)')
ax.set_ylabel('petal width (cm)')
sns.despine()
plt.show()
Notice that I made all the dots the same color. I am pretending that I do now know which species each dot belongs to. In fact, I am pretending that I do not even know how many species there are.
I see two distinct blobs. Define $k$ as the number of clusters we observe in the data. From the picture, it seems reasonable to guess there are two clusters (ie, $k=2$) in this data but of course we know there are three. Let's fit the model based on our (bad) guess.
Like we saw in the classifier problem, we need numpy arrays to work with sklearn. Below, I convert the two characteristics to numpy arrays.
X = iris[['petal width', 'petal length']].to_numpy()
K-mean Clustering¶
K-mean clustering generates clusters by looking for centers of circles of data. The algorithm varies the location of the cluster centers until it has minimized the average distance from the center of the clusters to their members. The recipe is therefore:
- Select a number of classes/groups to use and randomly initialize their respective center points. To figure out the number of classes to use, it's good to take a quick look at the data and try to identify any distinct groupings (ie, what we did above).
- Each data point is classified by computing the distance between that point and each group center. Each point is then classifed the point to be in the group whose center is closest to it.
- We then recompute the group center by taking the mean of all the vectors in the group.
- Repeat these steps for a set number of iterations or until the group centers don’t change much between iterations. You can also opt to randomly initialize the group centers a few times, and then select the run that looks like it provided the best results.
K-Means has the advantage that it's fast (and it's intuitive) since we're only really computing the distances between points and group centers. Of course, there is no free lunch and we'll see that kmeans has disadvantages. BTW there is also a K-Medians algorithm which is less sensitive to outliers but also slower.
# The random_state option ensures we all get the same answer.
kmeans = sklearn.cluster.KMeans(n_clusters=2, random_state=0).fit(X)
The object returned from the model contains (among other things) the coordinates of the centers. They are numpy arrays. I will convert them to a dataframe.
centers = pd.DataFrame(kmeans.cluster_centers_, columns=['y', 'x'])
And let's graph the results where the raw data I'll denote as blue dots as before and I'm putting a star on the centers found above.
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(iris['petal length'], iris['petal width'], color='blue')
ax.scatter(centers['x'], centers['y'], color='red', s=300, marker='*')
ax.set_xlabel('petal length (cm)')
ax.set_ylabel('petal width (cm)')
sns.despine()
plt.show()
This is about what I expected from 'eyeballing' it.
We know that there are three types of irises in the data set. Let's try 3 clusters and graph the results.
kmeans = sklearn.cluster.KMeans(n_clusters=3, random_state=0).fit(X)
centers = pd.DataFrame(kmeans.cluster_centers_, columns=['y', 'x'])
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(iris['petal length'], iris['petal width'], color='blue')
ax.scatter(centers['x'], centers['y'], color='red', s=300, marker='*')
ax.set_xlabel('petal length (cm)')
ax.set_ylabel('petal width (cm)')
sns.despine()
plt.show()
Do these three centers make sense? Let's grab the predicted values (the labels
) for each data point and plot them. Then I will merge them onto the original DataFrame.
pred = pd.DataFrame(kmeans.labels_, columns=['predicted'])
iris = pd.merge(left=pred, right=iris, left_index=True, right_index=True)
Let's take a look to make sure things merged correctly:
iris.sample(10)
predicted | sepal length | sepal width | petal length | petal width | species | |
---|---|---|---|---|---|---|
52 | 2 | 6.9 | 3.1 | 4.9 | 1.5 | Iris-versicolor |
124 | 1 | 6.7 | 3.3 | 5.7 | 2.1 | Iris-virginica |
47 | 0 | 4.6 | 3.2 | 1.4 | 0.2 | Iris-setosa |
121 | 1 | 5.6 | 2.8 | 4.9 | 2.0 | Iris-virginica |
20 | 0 | 5.4 | 3.4 | 1.7 | 0.2 | Iris-setosa |
91 | 2 | 6.1 | 3.0 | 4.6 | 1.4 | Iris-versicolor |
120 | 1 | 6.9 | 3.2 | 5.7 | 2.3 | Iris-virginica |
61 | 2 | 5.9 | 3.0 | 4.2 | 1.5 | Iris-versicolor |
46 | 0 | 5.1 | 3.8 | 1.6 | 0.2 | Iris-setosa |
134 | 1 | 6.1 | 2.6 | 5.6 | 1.4 | Iris-virginica |
So what happened here? The clustering algorithm identified three clusters of data and labeled them 0, 1, and 2. We know the data are grouped by species so let's take a look how the categories line-up with species.
iris.groupby('species')['predicted'].mean()
species Iris-setosa 0.00 Iris-versicolor 1.96 Iris-virginica 1.08 Name: predicted, dtype: float64
#iris['predicted'] = iris['predicted'].replace({0:'setosa', 1:'versicolor', 2:'virginica'})
fig, ax = plt.subplots(1,2,figsize=(15,6))
sns.scatterplot(ax=ax[0], x='petal length', y='petal width', hue='predicted', data=iris,palette=['#377eb8', '#4daf4a', '#ff7f00'])
sns.scatterplot(ax=ax[1], x='petal length', y='petal width', hue='species', data=iris)
ax[0].scatter(centers['x'], centers['y'], color='red', s=300, marker='*')
ax[0].set_title('Predicted Clusters',size=20)
ax[1].set_title('Actual Clusters',size=20)
ax[0].legend(frameon=False)
ax[1].legend(frameon=False)
sns.despine(ax=ax[0])
sns.despine(ax=ax[1])
plt.show()
We observe in the groupby data and the pictures that the algorithm does a good job. The setosa cluster is perfect — which is not surprising. The split between the other two types is decent but not perfect since these data overlap more.
2. Choosing the Number of Clusters: The Elbow Graph ¶
When I looked at the data and pretended that I did not know how many species there were, I thought there were two clusters. When I added a third, it did a pretty good job matching the data.
But the whole point of these models is that we do not know how many clusters there are. So how do we pick the number of clusters?
The short answer: Try a bunch as see what happens.
We can loop over the number of clusters and keep track of the sum of squared distance between the cluster centers and the members of the clusters. If there are gaps in the data (ie, clusters), adding an additional cluster will reduce these distances dramatically. The sum of squared distance measure is captured in the interia
method. Let's see how this works:
y=[] # Start with an empty df
for n in range(1,8): # loop from 1 to 7 clusters
kmeans = sklearn.cluster.KMeans(n_clusters=n, random_state=0).fit(X)
y.append(kmeans.inertia_)
y
C:\Users\jt83241\Anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1036: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1. warnings.warn(
[550.6434666666667, 86.40394533571005, 31.387758974358977, 19.499400899685114, 13.933308757908756, 11.107174889156013, 9.225808730158729]
It looks like these numbers change a lot. Let's take a look at measures in a picture:
fig, ax = plt.subplots(figsize=(15,6))
ax.plot(range(1,8), y, color='blue', marker='.')
ax.set_xlabel('number of clusters (k)')
ax.set_ylabel('sum of squared distances to nearest cluster')
sns.despine(ax=ax)
We see that moving from 1 to 2 clusters made a big improvement. ML people call this the "elbow" of the graph because the figure looks like an arm, I guess. Moving from 2 to 3 clusters made less of an improvement.
We can always lower the sum of squared distance (ssd) by increasing k. When k is the same as the number of observations, then each cluster has exactly one observation in it and the ssd is zero. But that isn't very helpful. As economists, we would say that there are diminishing returns to k.
From this figure, I would choose either k=2 or k=3. After that, we do not get much improvement in the ssd.
Practice¶
Which Minnesota counties are similar to which Wisconsin counties? Perhaps your company has spent a lot of money researching Wisconsin as a marketplace. Now it wants to expand into Minnesota. Can we find counties in Minnesota that are similar to counties in Wisconsin? Let's use our clustering model to see.
- Load the file
counties.csv
.
The variables are
income
in thousands of USD, for 2018population
number of persons, for 2018ALAND
area of the county in square meters
counties = pd.read_csv('./Data/counties.csv')
counties.sample(5)
GEOID | Description | income | pop | ALAND | |
---|---|---|---|---|---|
23 | 27047 | Freeborn, MN | 1394588.0 | 30444.0 | 1831823843 |
151 | 55127 | Walworth, WI | 5188239.0 | 103718.0 | 1438568339 |
143 | 55111 | Sauk, WI | 3107378.0 | 64249.0 | 2153686396 |
14 | 27029 | Clearwater, MN | 387647.0 | 8810.0 | 2586701770 |
10 | 27021 | Cass, MN | 1400484.0 | 29519.0 | 5235592037 |
- Create an income per capita variable: income/pop
counties['income_cap'] = counties['income']/counties['pop']
- Create a density variable: pop/(ALAND/1000000). Note: Dividing
ALAND
by 1,000,000 converts this measure to people per square kilometer
counties['density'] = counties['pop']/(counties['ALAND']/1000000)
- Make a scatter plot of income per capita and density. Do you see a relationship? Would a linear model be helpful here?
fig, ax = plt.subplots(figsize=(15,7))
ax.scatter(counties['density'], counties['income_cap'], color='blue')
ax.set_xlabel('density')
ax.set_ylabel('income per capita')
sns.despine(ax=ax)
plt.show()
- Compute the elbow graph using
density
andincome_cap
as the variables. Try 1 through 10 possible clusters. Remember, you need to convert your DataFrame data to numpy arrays.
# Convert to numpy
X = counties[['density', 'income_cap']].to_numpy()
# Try the model with 1, 2, ..., 10 clusters. Keep track of the ssd ('inertia')
y=[]
for n in range(1, 11):
kmeans = sklearn.cluster.KMeans(n_clusters=n, random_state=0).fit(X)
y.append(kmeans.inertia_)
C:\Users\jt83241\Anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1036: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1. warnings.warn(
fig, ax = plt.subplots(figsize=(15,6))
ax.plot(range(1,11), y, color='blue', marker='.')
ax.set_xlabel('number of clusters (k)')
ax.set_ylabel('sum of squared distances to nearest cluster')
sns.despine(ax=ax)
- Based on your elbow graph, how many clusters seem appropriate?
Answer: It looks like 3 clusters is reasonable.
Let's explore our results some more.
- Run the kmeans model with k=3
kmeans = sklearn.cluster.KMeans(n_clusters=3, random_state=0).fit(X)
- Retrieve the
labels_
from the results.
clusters = pd.DataFrame(kmeans.labels_, columns=['cluster'])
- Merge the labels onto your original DataFrame, using the index as the merge key.
detail = pd.merge(left=counties, right=clusters, left_index=True, right_index=True)
detail.head(2)
GEOID | Description | income | pop | ALAND | income_cap | density | cluster | |
---|---|---|---|---|---|---|---|---|
0 | 27001 | Aitkin, MN | 632584.0 | 15902.0 | 4718433597 | 39.780153 | 3.370186 | 2 |
1 | 27003 | Anoka, MN | 18135923.0 | 353813.0 | 1092919450 | 51.258498 | 323.732001 | 0 |
- Plot
income_cap
anddensity
again using seaborn's scatterplot. Set the hue to thelabels_
you retrieved. What patterns do you see? You might add labels to the points so you can tell which counties are which.
fig, ax = plt.subplots(figsize=(15,6))
sns.scatterplot(ax=ax, x='density', y='income_cap', hue='cluster', data=detail, palette=['#377eb8', '#ff7f00', '#4daf4a'])
for x,y,t in zip(detail['density'], detail['income_cap'], detail['Description']):
if x > 150:
ax.text(x+10, y+1, t)
ax.set_xlabel('density')
ax.set_ylabel('income per capita')
sns.despine(ax=ax)
ax.legend(frameon=False)
plt.show()
- Redo 7 through 10 with k=4. How do the results change?
kmeans = sklearn.cluster.KMeans(n_clusters=4, random_state=0).fit(X)
clusters = pd.DataFrame(kmeans.labels_, columns=['cluster'])
detail = pd.merge(left=counties, right=clusters, left_index=True, right_index=True)
fig, ax = plt.subplots(figsize=(15,6))
sns.scatterplot(ax=ax, x='density', y='income_cap', hue='cluster', data=detail, palette=['#377eb8', '#ff7f00', '#4daf4a','#f781bf'])
for x,y,t in zip(detail['density'], detail['income_cap'], detail['Description']):
if x > 150:
ax.text(x+10, y+1, t)
sns.despine(ax=ax)
ax.legend(frameon=False)
ax.set_xlabel('density')
ax.set_ylabel('income per capita')
plt.show()
Answer: Hennepin, MN becomes its own cluster.
Comments¶
As I mentioned above, K-Means has a couple of disadvantages.
- Tou have to select how many groups/classes there are. The Elbow Graph approach helps but there's no statistical test for us to lean on -- or at least identify a best-practice across different researchers. This isn't always trivial and ideally with a clustering algorithm we'd want it to figure those out for us because the point of it is to gain some insight from the data.
- K-means starts with a random choice of cluster centers and therefore it may yield different clustering results on different runs of the algorithm. Thus, the results may not be repeatable and lack consistency. Other cluster methods are more consistent.
- K-means fundamentally draws circles around the data. What if circles are not the best way to cluster the data. Look at the scatterplots in the practice exercise. The data near zero density look like a column (ie, a lot of heterogeneity in income per capita) while there appears to be a positive relationship between density and income per capita between density of 25 and 200. It seems like these two groups should be different clusters but K-Means will have difficulty doing that since it looks for centers of circles and will therefore always choose a center in the middle of both groups in order to capture both.
3. Other Clustering Methods ¶
3.1 Mean-Shift Clustering ¶
Mean shift clustering is a sliding-window-based algorithm that attempts to find dense areas of data points. It is a centroid-based algorithm meaning that the goal is to locate the center points of each group/class, which works by updating candidates for center points to be the mean of the points within the sliding-window. These candidate windows are then filtered in a post-processing stage to eliminate near-duplicates, forming the final set of center points and their corresponding groups.
In contrast to K-means clustering, there is no need to select the number of clusters as mean-shift automatically discovers this -- that's a nice advantage. The fact that the cluster centers converge towards the points of maximum density is also quite desirable as it is quite intuitive to understand and fits well in a naturally data-driven sense.
from sklearn.cluster import MeanShift, estimate_bandwidth
# The following bandwidth can be automatically detected using
bandwidth = estimate_bandwidth(X, quantile=0.2, n_samples=500)
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True).fit(X)
clusters = pd.DataFrame(ms.labels_, columns=['cluster'])
col = ['#377eb8', '#ff7f00', '#4daf4a','#f781bf','k']
detail = pd.merge(left=counties, right=clusters, left_index=True, right_index=True)
fig, ax = plt.subplots(figsize=(15,6))
sns.scatterplot(ax=ax, x='density', y='income_cap', hue='cluster', palette=col, s=50, data=detail)
for x,y,t in zip(detail['density'], detail['income_cap'], detail['Description']):
if x > 150:
ax.text(x+10, y+1, t)
sns.despine(ax=ax)
ax.legend(frameon=False)
ax.set_xlabel('density')
ax.set_ylabel('income per capita')
plt.show()
The algorithm selected five clusters. Note that the increase in clusters was because it designated Hennepin, Ramsey, and Milwaukee as different. We know that Milwaukee is different -- Alice Cooper taught us that.
Was this a good clustering choice for our data? I would argue "no."
3.2 Density-Based Spatial Clustering of Applications with Noise (DBSCAN)¶
DBSCAN is a form of a nearest neighbor clustering algorithm which you commonly hear being applied to data. The rough idea is that you work through the data along the dimensions of the data and keep adding data points to a cluster until you get to set of data which are sufficiently different. It is a density-based clustered algorithm similar to mean-shift, but with a couple of notable advantages. First, it does not require a pre-set number of clusters at all. The central component to the DBSCAN is the concept of core samples -- samples that are in areas of high density. A "cluster" is therefore a set of core samples which are close to each other (measured by some distance measure). Any core sample is part of a cluster, by definition. Any sample that is not a core sample, and is at least 'eps' in distance from any core sample, is considered an outlier. Second, identifies outliers as noise (which may be useful in our case) whereas k-means and mean-shift which simply throw them into a cluster even if the data point is very different. Third, it can find arbitrarily sized and arbitrarily shaped clusters quite well.
- On the downside, it can struggle when there is a lot of variation in cluster density.**
There are two parameters to the algorithm: min_samples
and eps
. These formally define formally what we mean when we say "dense."
The parameter min_samples
primarily controls how tolerant the algorithm is towards noise (on noisy and large data sets it may be desirable to increase this parameter), though results are not terribly sensitive to this parameter so it's not very important IMO. The parameter eps
is however crucial to choose appropriately for the data set. It controls the local neighborhood of the points. When chosen too small, most data will not be clustered at all (and labeled as -1
for "noise"). When chosen too large, it causes close clusters to be merged into one cluster, and eventually the entire data set to be returned as a single cluster. Some heuristics for choosing this parameter have been discussed in the literature, for example based on an elbow in the nearest neighbor distances plot. Let's give it a rip in our county data.
Solving for the optimal eps
: Another Elbow¶
We can calculate the distance from each point to its closest neighbour using the NearestNeighbors
package. The point itself is included in n_neighbors
. The kneighbors
method returns two arrays, one which contains the distance to the closest n_neighbors
points and the other which contains the index for each of those points. As with our previous elbow graph, we're looking for an inflection point where changing eps
really has no effect on the distances.
import numpy as np
from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(n_neighbors=2)
nbrs = neigh.fit(X)
distances, indices = nbrs.kneighbors(X)
# Sort the results
distances = np.sort(distances, axis=0)
distances = distances[:,1]
fig, ax = plt.subplots(figsize=(15,6))
plt.plot(distances, color='blue', marker='.')
ax.set_xlabel('distance')
ax.set_ylabel('eps')
sns.despine(ax=ax)
Looks like a lot of curvature when eps
equals 100. Let's try that.
from sklearn.cluster import DBSCAN
# Compute DBSCAN with eps = 100
db = DBSCAN(eps=100, min_samples=5).fit(X)
clusters = pd.DataFrame(db.labels_, columns=['cluster'])
col = ['#377eb8', '#ff7f00', '#4daf4a']
detail = pd.merge(left=counties, right=clusters, left_index=True, right_index=True)
fig, ax = plt.subplots(figsize=(15,6))
sns.scatterplot(ax=ax, x='density', y='income_cap', hue='cluster', s=50, data=detail)
for x,y,t in zip(detail['density'], detail['income_cap'], detail['Description']):
if x > 150:
ax.text(x+10, y+1, t)
sns.despine(ax=ax)
ax.legend(frameon=False)
ax.set_xlabel('density')
ax.set_ylabel('income per capita')
plt.show()
Hmmm... not great. In fact, playing around with eps
demonstrates that we're only shifting which observations are noise. The clusters we observe with our eyeballs are just too heterogenous for this method to work well (see the above "downside" discussion).
3.3 Expectation–Maximization (EM) Clustering using Gaussian Mixture Models (GMM).¶
One of the major drawbacks of K-Means is its naive use of the mean value for the cluster center and its insistence on identifying clusters with circles. The work-around is to relax the circle assumption and use Gaussian Mixture Models (GMMs) to give us more flexibility. With GMMs we assume that the data points are Gaussian distributed; this is a less restrictive assumption than saying they are circular by using the mean. That way, we have two parameters to describe the shape of the clusters: the mean and the standard deviation! Taking an example in two dimensions, this means that the clusters can take any kind of elliptical shape (since we have a standard deviation in both the x and y directions). Thus, each Gaussian distribution is assigned to a single cluster. Hence, we could use this clustering approach to identify the different clusters in our practice data. To find the parameters of the Gaussian for each cluster (e.g the mean and standard deviation), we will use an optimization algorithm called Expectation–Maximization (EM).
This seems like just the thing we'd like to use with our county data. Let's give it a try using four clusters.
from sklearn.mixture import GaussianMixture
gm = GaussianMixture(n_components=4, random_state=0).fit(X)
clusters = pd.DataFrame(gm.predict(X), columns=['cluster'])
detail = pd.merge(left=counties, right=clusters, left_index=True, right_index=True)
fig, ax = plt.subplots(figsize=(15,6))
sns.scatterplot(ax=ax, x='density', y='income_cap', hue='cluster', palette=['#377eb8', '#ff7f00', '#4daf4a','#f781bf'], s=50, data=detail)
for x,y,t in zip(detail['density'], detail['income_cap'], detail['Description']):
if x > 150:
ax.text(x+10, y+1, t)
sns.despine(ax=ax)
ax.legend(frameon=False)
ax.set_xlabel('density')
ax.set_ylabel('income per capita')
plt.show()
3.4 Agglomerative Hierarchical Clustering¶
Hierarchical clustering algorithms fall into 2 categories: top-down or bottom-up. Bottom-up algorithms treat each data point as a single cluster at the outset and then successively merge (or agglomerate) pairs of clusters until all clusters have been merged into a single cluster that contains all data points. Bottom-up hierarchical clustering is therefore called hierarchical agglomerative clustering (HAC). This hierarchy of clusters is represented as a tree (or dendrogram). The root of the tree is the unique cluster that gathers all the samples, the leaves being the clusters with only one sample. Cute, right? I had no idea what a dendogram was before I started working in ML. Here is the wiki page.
Note that this technique can be particularly usef for "feature reduction" which is popular in finance (where this type of analysis is known as "Principal Component Analaysis"). Fundamentally, it's a useful dimensionality reduction tool which in ML lingo is known as "unsupervised learning."
Let's try it with our data...
Picking the Number of Clusters: The Dendrogram¶
First, we have to decide how many clusters to designate. As before, we can evaluate the data directly to look for gaps. Instead of an elbow, though, we can use a dendrogram to visualize the history of groupings and figure out the optimal number of clusters. We identify the optimal number of clusters as follows:
- Determine the largest vertical distance that doesn't intersect any of the other clusters.
- Draw a horizontal line bwteen the extremities.
- The optimal number of clusters is equal to the number of vertical lines going through the horizontal line.
import scipy.cluster.hierarchy as sch # Used to create a dendrogram
fig, ax = plt.subplots(figsize=(15,6))
dendrogram = sch.dendrogram(sch.linkage(X, method='ward'),ax=ax)
# Different methods to measure linkags / distance
# Ward - minimizes the sum of squared differences within all clusters. It is a variance-minimizing approach and in this sense is similar to the k-means objective function but tackled with an agglomerative hierarchical approach.
# Maximum or "complete" - minimizes the maximum distance between observations of pairs of clusters.
# Average - minimizes the average of the distances between all observations of pairs of clusters.
# Single - minimizes the distance between the closest observations of pairs of clusters.
Based on the above recipe, I should choose two clusters but I'm going to instead choose three because I know the outliers will occupy one cluster and I want to focus my (potential) analysis on the remaining counties. Equivalently, I could just drop the outlier counties.
from sklearn.cluster import AgglomerativeClustering
AHC = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward').fit(X)
clusters = pd.DataFrame(AHC.labels_, columns=['cluster'])
col = ['#377eb8', '#ff7f00', '#4daf4a']
detail = pd.merge(left=counties, right=clusters, left_index=True, right_index=True)
fig, ax = plt.subplots(figsize=(15,6))
sns.scatterplot(ax=ax, x='density', y='income_cap', hue='cluster', s=50, palette=col, data=detail)
for x,y,t in zip(detail['density'], detail['income_cap'], detail['Description']):
if x > 150:
ax.text(x+10, y+1, t)
sns.despine(ax=ax)
ax.legend(frameon=False)
ax.set_xlabel('density')
ax.set_ylabel('income per capita')
plt.show()
That seems to work pretty well too. These groups seem pretty natural.
Summary of Different Approaches¶
The figure below is from the scikit clustering documentation and demonstrates how different clustering approaches (some I have not discussed) do for different variations in the (pretend) data. The numbers in the bottom-right present computation time which may be an issue as your data become larger and clustering algorithm becomes more complex.
4. When Might This be Useful? ¶
We'll see that clustering will be very useful when we have too many features to work with in a tractable way (our preferred ML model takes forever to fit or maybe requires too much computing resources). Clustering can also be useful to construct data clusters which make prediction (estimation in the following example) better.
Suppose we had the following data on vehicle sales:
autos = pd.read_stata('./Data/cars.dta')
# Let's create some new variables
segment_dummies = pd.get_dummies(autos['SEGMENT'])
autos = pd.concat([autos, segment_dummies], axis=1)
autos.rename(columns={1:'SMALL', 2:'COMPACT', 3:'SEDAN', 4:'LUXURY', 5:'MINIVAN'},inplace=True)
# Imported Cars
autos['NONEURO'] = 0
autos.loc[(autos['ORIG']>1),'NONEURO'] = 1
# Rename "Fuel" "Diesel" since this is an indicator where 1 => diesel engine
autos.rename(columns={'FUEL':'DIESEL'},inplace=True)
# Apply logs so we can estimate elasticity
autos['LOGQ'] = np.log(autos['QUANTITY'])
autos['LOGP'] = np.log(autos['PRICE'])
autos.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1269 entries, 0 to 1268 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CODE 1269 non-null int16 1 ORIG 1269 non-null int8 2 FIRM_ID 1269 non-null float32 3 FIRM 1269 non-null object 4 BRAND 1269 non-null object 5 MODEL 1269 non-null object 6 YEAR 1269 non-null int16 7 PRICE 1269 non-null float32 8 QUANTITY 1269 non-null float32 9 HP 1269 non-null float32 10 LENGTH 1269 non-null float32 11 WIDTH 1269 non-null float32 12 SIZE 1269 non-null float32 13 WEIGHT 1269 non-null float32 14 DIESEL 1269 non-null int8 15 MPG 1269 non-null float32 16 FUELPRICE 1269 non-null float32 17 SEGMENT 1269 non-null int8 18 SMALL 1269 non-null uint8 19 COMPACT 1269 non-null uint8 20 SEDAN 1269 non-null uint8 21 LUXURY 1269 non-null uint8 22 MINIVAN 1269 non-null uint8 23 NONEURO 1269 non-null int64 24 LOGQ 1269 non-null float32 25 LOGP 1269 non-null float32 dtypes: float32(12), int16(2), int64(1), int8(3), object(3), uint8(5) memory usage: 123.9+ KB
And we wanted to estimate demand and are particularly interested in estimating price elasticity:
import statsmodels.formula.api as smf # provides a way to directly spec models from formulas
# OLS
auto_OLS1 = smf.ols('LOGQ ~ LOGP + MPG + HP + DIESEL + NONEURO + SMALL + SEDAN + LUXURY + MINIVAN', data=autos).fit(cov_type = 'HC3')
print(auto_OLS1.summary())
OLS Regression Results ============================================================================== Dep. Variable: LOGQ R-squared: 0.925 Model: OLS Adj. R-squared: 0.925 Method: Least Squares F-statistic: 1198. Date: Wed, 09 Nov 2022 Prob (F-statistic): 0.00 Time: 18:44:31 Log-Likelihood: -1735.6 No. Observations: 1269 AIC: 3491. Df Residuals: 1259 BIC: 3543. Df Model: 9 Covariance Type: HC3 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 27.2383 0.491 55.441 0.000 26.275 28.201 LOGP -6.3574 0.135 -47.193 0.000 -6.621 -6.093 MPG -0.0127 0.006 -2.036 0.042 -0.025 -0.000 HP -4.7571 5.031 -0.946 0.344 -14.617 5.103 DIESEL -0.4234 0.092 -4.585 0.000 -0.604 -0.242 NONEURO 0.3850 0.059 6.502 0.000 0.269 0.501 SMALL -0.1625 0.100 -1.631 0.103 -0.358 0.033 SEDAN -0.5817 0.077 -7.592 0.000 -0.732 -0.432 LUXURY -1.7088 0.122 -14.051 0.000 -1.947 -1.470 MINIVAN -1.2488 0.117 -10.683 0.000 -1.478 -1.020 ============================================================================== Omnibus: 120.332 Durbin-Watson: 1.625 Prob(Omnibus): 0.000 Jarque-Bera (JB): 155.702 Skew: -0.793 Prob(JB): 1.55e-34 Kurtosis: 3.655 Cond. No. 7.13e+03 ============================================================================== Notes: [1] Standard Errors are heteroscedasticity robust (HC3) [2] The condition number is large, 7.13e+03. This might indicate that there are strong multicollinearity or other numerical problems.
The above (biased) OLS regression indicates that the price elasticity is 6.35. Knowing the vehicle segment turns out to be important to understanding the price elasticity (okay, it's not a huge difference but a difference nonetheless):
# OLS
auto_OLS2 = smf.ols('LOGQ ~ LOGP + MPG + HP + DIESEL + NONEURO', data=autos).fit(cov_type = 'HC3')
print(auto_OLS2.summary())
OLS Regression Results ============================================================================== Dep. Variable: LOGQ R-squared: 0.909 Model: OLS Adj. R-squared: 0.908 Method: Least Squares F-statistic: 1133. Date: Wed, 09 Nov 2022 Prob (F-statistic): 0.00 Time: 18:44:31 Log-Likelihood: -1861.5 No. Observations: 1269 AIC: 3735. Df Residuals: 1263 BIC: 3766. Df Model: 5 Covariance Type: HC3 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 27.0949 0.474 57.159 0.000 26.166 28.024 LOGP -6.8719 0.127 -53.912 0.000 -7.122 -6.622 MPG 0.0261 0.005 5.649 0.000 0.017 0.035 HP -16.4147 4.334 -3.787 0.000 -24.909 -7.920 DIESEL -0.8058 0.090 -8.921 0.000 -0.983 -0.629 NONEURO 0.4518 0.066 6.871 0.000 0.323 0.581 ============================================================================== Omnibus: 127.456 Durbin-Watson: 1.781 Prob(Omnibus): 0.000 Jarque-Bera (JB): 167.280 Skew: -0.822 Prob(JB): 4.74e-37 Kurtosis: 3.677 Cond. No. 5.98e+03 ============================================================================== Notes: [1] Standard Errors are heteroscedasticity robust (HC3) [2] The condition number is large, 5.98e+03. This might indicate that there are strong multicollinearity or other numerical problems.
Suppose we didn't know the segment information. There seems to be information embedded in SEGMENT
that is correlated with log-price. Let's see if we can replicate vehicle segments using the other characteristics and kmeans clustering.
X = autos[['HP','SIZE', 'WEIGHT', 'MPG']].to_numpy()
# Try the model with 1, 2, ..., 10 clusters. Keep track of the ssd ('inertia')
y=[]
for n in range(1, 11):
kmeans = sklearn.cluster.KMeans(n_clusters=n, random_state=0).fit(X)
y.append(kmeans.inertia_)
fig, ax = plt.subplots(figsize=(15,6))
ax.plot(range(1,11), y, color='blue', marker='.')
ax.set_xlabel('number of clusters (k)')
ax.set_ylabel('sum of squared distances to nearest cluster')
sns.despine(ax=ax)
C:\Users\jt83241\Anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1036: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=5. warnings.warn(
I read the elbow graph as saying we should stop at three clusters.
kmeans = sklearn.cluster.KMeans(n_clusters=3, random_state=0).fit(X)
clusters = pd.DataFrame(kmeans.labels_, columns=['cluster'])
autos = pd.merge(left=autos, right=clusters, left_index=True, right_index=True, how='left')
So now run OLS again with the clusters we built based on vehicle characteristics.
# OLS
auto_OLS3 = smf.ols('LOGQ ~ LOGP + MPG + HP + DIESEL + NONEURO + C(cluster)', data=autos).fit(cov_type = 'HC3')
print(auto_OLS3.summary())
OLS Regression Results ============================================================================== Dep. Variable: LOGQ R-squared: 0.916 Model: OLS Adj. R-squared: 0.916 Method: Least Squares F-statistic: 1115. Date: Wed, 09 Nov 2022 Prob (F-statistic): 0.00 Time: 18:44:33 Log-Likelihood: -1805.1 No. Observations: 1269 AIC: 3626. Df Residuals: 1261 BIC: 3667. Df Model: 7 Covariance Type: HC3 =================================================================================== coef std err z P>|z| [0.025 0.975] ----------------------------------------------------------------------------------- Intercept 27.8570 0.452 61.628 0.000 26.971 28.743 C(cluster)[T.1] -1.1963 0.121 -9.853 0.000 -1.434 -0.958 C(cluster)[T.2] -0.3482 0.080 -4.325 0.000 -0.506 -0.190 LOGP -6.5444 0.135 -48.485 0.000 -6.809 -6.280 MPG -0.0072 0.006 -1.284 0.199 -0.018 0.004 HP -17.5704 4.402 -3.992 0.000 -26.197 -8.944 DIESEL -0.3768 0.095 -3.957 0.000 -0.563 -0.190 NONEURO 0.3493 0.064 5.491 0.000 0.225 0.474 ============================================================================== Omnibus: 138.358 Durbin-Watson: 1.836 Prob(Omnibus): 0.000 Jarque-Bera (JB): 185.560 Skew: -0.866 Prob(JB): 5.08e-41 Kurtosis: 3.712 Cond. No. 6.13e+03 ============================================================================== Notes: [1] Standard Errors are heteroscedasticity robust (HC3) [2] The condition number is large, 6.13e+03. This might indicate that there are strong multicollinearity or other numerical problems.
results = [auto_OLS1.params['LOGP'],auto_OLS2.params['LOGP'],auto_OLS3.params['LOGP']]
params = pd.DataFrame({'Estimated Price':results},
index = ['Truth', 'No Clusters', 'Clusters'],
)
params
Estimated Price | |
---|---|
Truth | -6.357438 |
No Clusters | -6.871897 |
Clusters | -6.544450 |
Imputing the clusters based on observable characteristics enabled us to improve our estimated price elasticity!