Lecture 25: Unsupervised Learning [MY SOLUTIONS]
In a previous notebook we looked at the classifier problem. We knew what species a flower could be and we built a model to predict which species an observation was based on its characteristics. This problem fit into a larger class of ML problems known as "supervised" learning where we developed models to make predictions.
In this notebook we look at a different category of ML problem known as "unsupervised" learning where we know we have a lot of features and want to collapse them down to something more manageable. Today, we'll talk about two different approaches:
We can consolidate the data features by clustering observations into groups based on how similar the observations are. I've found clustering to be useful for segmenting consumer markets (both in characteristics and geography) in order to third-degree price discriminate. Others find clustering useful to identify bots on twitter, to find blocks of voters that might be on the fence, match people on dating websites, or to group incoming freshman into dorms based on preferences.
We'll focus our discussion using something called "K-Means" to identify clusters in the data since it's pretty simple and intuitive. K-means is the most popular clustering algorithm but there are others.We can reduce the dimenisonality of the features (fancy math talk for reducing the number of features) using principal component analysis (PCA) which is perhaps one of the most popular unsupervised algorithms. PCA is often used as an input to build supervised models, but it can also be useful as a tool for:
- visualization,
- noise filtering,
- feature extraction and engineering.
Because of the versatility and interpretability of PCA, it has been shown to be effective in a wide variety of contexts and disciplines. You will often see it used to simplify covariate data in finance. If you've had linear algebra, what we'll be doing in this lecture with PCA is building basis functions to cover the data. If not, no big deal. We'll make pictures to visualize the idea.
The agenda for today's lecture is as follows:
Class Announcements
- PS6 is on eLC and is due end-of-day Sunday, December 4, 2022.
1. Why Clusters? An Example¶
Clustering is useful because it enables us to reduce the dimensionality of large data sets; we can make Big Data potentially small. We'll start with our iris data set because we know that the data are indeed clustered around species. What if someone gave us similar data which didn't include the species
variable. Could we still identify these groupings?
We can cluster on many characteristics, but if we use only two, we can visualize them in a scatter plot. Let's use the petal measurements.
We are going to pretend that we do not know the species of each observation. We will use the species information later to see how well the cluster method worked. Note that our approach here is similar to when we introduced econometrics (OLS) and started by generating the true Data Generating Process (DGP). This approach enabled us to know "the truth" so we could evaluate how well our approach worked absent specification error. Same idea here.
Let's load the data and take a look to remind ourselves what we have.
import pandas as pd
import sklearn.cluster
import matplotlib.pyplot as plt
import seaborn as sns
iris = pd.read_csv('./Data/iris.csv')
iris.head(4)
sepal length | sepal width | petal length | petal width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(iris['petal length'], iris['petal width'], color='blue')
ax.set_xlabel('petal length (cm)')
ax.set_ylabel('petal width (cm)')
sns.despine()
plt.show()
Notice that I made all the dots the same color. I am pretending that I do now know which species each dot belongs to. In fact, I am pretending that I do not even know how many species there are.
I see two distinct blobs. Define $k$ as the number of clusters we observe in the data. From the picture, it seems reasonable to guess there are two clusters (ie, $k=2$) in this data but of course we know there are three. Let's fit the model based on our (bad) guess.
Like we saw in the classifier problem, we need numpy arrays to work with sklearn. Below, I convert the two characteristics to numpy arrays.
X = iris[['petal width', 'petal length']].to_numpy()
K-mean Clustering¶
K-mean clustering generates clusters by looking for centers of circles of data. The algorithm varies the location of the cluster centers until it has minimized the average distance from the center of the clusters to their members. The recipe is therefore:
Select a number of classes/groups to use and randomly initialize their respective center points. To figure out the number of classes to use, it's good to take a quick look at the data and try to identify any distinct groupings (ie, what we did above).
Each data point is classified by computing the distance between that point and each group center. Each point is then classifed the point to be in the group whose center is closest to it.
We then recompute the group center by taking the mean of all the vectors in the group.
Repeat these steps for a set number of iterations or until the group centers don’t change much between iterations. You can also opt to randomly initialize the group centers a few times, and then select the run that looks like it provided the best results.
K-Means has the advantage that it's fast (and it's intuitive) since we're only really computing the distances between points and group centers. Of course, there is no free lunch and we'll see that kmeans has disadvantages. BTW there is also a K-Medians algorithm which is less sensitive to outliers but also slower.
# The random_state option ensures we all get the same answer.
kmeans = sklearn.cluster.KMeans(n_clusters=2, random_state=0).fit(X)
The object returned from the model contains (among other things) the coordinates of the centers. They are numpy arrays. I will convert them to a dataframe.
centers = pd.DataFrame(kmeans.cluster_centers_, columns=['y', 'x'])
And let's graph the results where the raw data I'll denote as blue dots as before and I'm putting a star on the centers found above.
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(iris['petal length'], iris['petal width'], color='blue')
ax.scatter(centers['x'], centers['y'], color='red', s=300, marker='*')
ax.set_xlabel('petal length (cm)')
ax.set_ylabel('petal width (cm)')
sns.despine()
plt.show()
This is about what I expected from 'eyeballing' it.
We know that there are three types of irises in the data set. Let's try 3 clusters and graph the results.
kmeans = sklearn.cluster.KMeans(n_clusters=3, random_state=0).fit(X)
centers = pd.DataFrame(kmeans.cluster_centers_, columns=['y', 'x'])
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(iris['petal length'], iris['petal width'], color='blue')
ax.scatter(centers['x'], centers['y'], color='red', s=300, marker='*')
ax.set_xlabel('petal length (cm)')
ax.set_ylabel('petal width (cm)')
sns.despine()
plt.show()
Do these three centers make sense? Let's grab the predicted values (the labels
) for each data point and plot them. Then I will merge them onto the original DataFrame.
pred = pd.DataFrame(kmeans.labels_, columns=['predicted'])
iris = pd.merge(left=pred, right=iris, left_index=True, right_index=True)
Let's take a look to make sure things merged correctly:
iris.sample(10)
predicted | sepal length | sepal width | petal length | petal width | species | |
---|---|---|---|---|---|---|
69 | 2 | 5.6 | 2.5 | 3.9 | 1.1 | Iris-versicolor |
0 | 0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
50 | 2 | 7.0 | 3.2 | 4.7 | 1.4 | Iris-versicolor |
46 | 0 | 5.1 | 3.8 | 1.6 | 0.2 | Iris-setosa |
127 | 1 | 6.1 | 3.0 | 4.9 | 1.8 | Iris-virginica |
22 | 0 | 4.6 | 3.6 | 1.0 | 0.2 | Iris-setosa |
145 | 1 | 6.7 | 3.0 | 5.2 | 2.3 | Iris-virginica |
87 | 2 | 6.3 | 2.3 | 4.4 | 1.3 | Iris-versicolor |
148 | 1 | 6.2 | 3.4 | 5.4 | 2.3 | Iris-virginica |
63 | 2 | 6.1 | 2.9 | 4.7 | 1.4 | Iris-versicolor |
So what happened here? The clustering algorithm identified three clusters of data and labeled them 0, 1, and 2. We know the data are grouped by species so let's take a look how the categories line-up with species.
iris.groupby('species')['predicted'].mean()
species Iris-setosa 0.00 Iris-versicolor 1.96 Iris-virginica 1.08 Name: predicted, dtype: float64
#iris['predicted'] = iris['predicted'].replace({0:'setosa', 1:'versicolor', 2:'virginica'})
fig, ax = plt.subplots(1,2,figsize=(15,6))
sns.scatterplot(ax=ax[0], x='petal length', y='petal width', hue='predicted', data=iris,palette=['#377eb8', '#4daf4a', '#ff7f00'])
sns.scatterplot(ax=ax[1], x='petal length', y='petal width', hue='species', data=iris)
ax[0].scatter(centers['x'], centers['y'], color='red', s=300, marker='*')
ax[0].set_title('Predicted Clusters',size=20)
ax[1].set_title('Actual Clusters',size=20)
ax[0].legend(frameon=False)
ax[1].legend(frameon=False)
sns.despine(ax=ax[0])
sns.despine(ax=ax[1])
plt.show()
We observe in the groupby data and the pictures that the algorithm does a good job. The setosa cluster is perfect — which is not surprising. The split between the other two types is decent but not perfect since these data overlap more.
2. Choosing the Number of Clusters: The Elbow Graph. ¶
When I looked at the data and pretended that I did not know how many species there were, I thought there were two clusters. When I added a third, it did a pretty good job matching the data.
But the whole point of these models is that we do not know how many clusters there are. So how do we pick the number of clusters?
The short answer: Try a bunch as see what happens.
We can loop over the number of clusters and keep track of the sum of squared distance between the cluster centers and the members of the clusters. If there are gaps in the data (ie, clusters), adding an additional cluster will reduce these distances dramatically. The sum of squared distance measure is captured in the interia
method. Let's see how this works:
y=[] # Start with an empty df
for n in range(1,8): # loop from 1 to 7 clusters
kmeans = sklearn.cluster.KMeans(n_clusters=n, random_state=0).fit(X)
y.append(kmeans.inertia_)
y
C:\Users\jt83241\Anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1036: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1. warnings.warn(
[550.6434666666667, 86.40394533571005, 31.387758974358977, 19.499400899685114, 13.933308757908756, 11.107174889156013, 9.225808730158729]
It looks like these numbers change a lot. Let's take a look at measures in a picture:
fig, ax = plt.subplots(figsize=(15,6))
ax.plot(range(1,8), y, color='blue', marker='.')
ax.set_xlabel('number of clusters (k)')
ax.set_ylabel('sum of squared distances to nearest cluster')
sns.despine(ax=ax)
We see that moving from 1 to 2 clusters made a big improvement. ML people call this the "elbow" of the graph because the figure looks like an arm, I guess. Moving from 2 to 3 clusters made less of an improvement.
We can always lower the sum of squared distance (ssd) by increasing k. When k is the same as the number of observations, then each cluster has exactly one observation in it and the ssd is zero. But that isn't very helpful. As economists, we would say that there are diminishing returns to k.
From this figure, I would choose either k=2 or k=3. After that, we do not get much improvement in the ssd.
Practice¶
Which Minnesota counties are similar to which Wisconsin counties? Perhaps your company has spent a lot of money researching Wisconsin as a marketplace. Now it wants to expand into Minnesota. Can we find counties in Minnesota that are similar to counties in Wisconsin? Let's use our clustering model to see.
- Load the file
counties.csv
.
The variables are
income
in thousands of USD, for 2018population
number of persons, for 2018ALAND
area of the county in square meters
counties = pd.read_csv('./Data/counties.csv')
counties.sample(5)
GEOID | Description | income | pop | ALAND | |
---|---|---|---|---|---|
51 | 27103 | Nicollet, MN | 1697480.0 | 34220.0 | 1161978443 |
110 | 55047 | Green Lake, WI | 873107.0 | 18918.0 | 905314818 |
77 | 27155 | Traverse, MN | 202631.0 | 3308.0 | 1486320975 |
54 | 27109 | Olmsted, MN | 9150699.0 | 156277.0 | 1692581423 |
133 | 55091 | Pepin, WI | 349590.0 | 7289.0 | 600894481 |
- Create an income per capita variable: income/pop
counties['income_cap'] = counties['income']/counties['pop']
- Create a density variable: pop/(ALAND/1000000). Note: Dividing
ALAND
by 1,000,000 converts this measure to people per square kilometer
counties['density'] = counties['pop']/(counties['ALAND']/1000000)
- Make a scatter plot of income per capita and density. Do you see a relationship? Would a linear model be helpful here?
fig, ax = plt.subplots(figsize=(15,7))
ax.scatter(counties['density'], counties['income_cap'], color='blue')
ax.set_xlabel('density')
ax.set_ylabel('income per capita')
sns.despine(ax=ax)
plt.show()
- Compute the elbow graph using
density
andincome_cap
as the variables. Try 1 through 10 possible clusters. Remember, you need to convert your DataFrame data to numpy arrays.
# Convert to numpy
X = counties[['density', 'income_cap']].to_numpy()
# Try the model with 1, 2, ..., 10 clusters. Keep track of the ssd ('inertia')
y=[]
for n in range(1, 11):
kmeans = sklearn.cluster.KMeans(n_clusters=n, random_state=0).fit(X)
y.append(kmeans.inertia_)
C:\Users\jt83241\Anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1036: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1. warnings.warn(
fig, ax = plt.subplots(figsize=(15,6))
ax.plot(range(1,11), y, color='blue', marker='.')
ax.set_xlabel('number of clusters (k)')
ax.set_ylabel('sum of squared distances to nearest cluster')
sns.despine(ax=ax)
- Based on your elbow graph, how many clusters seem appropriate?
Answer: It looks like 3 clusters is reasonble.
Let's explore our results some more.
- Run the kmeans model with k=3
kmeans = sklearn.cluster.KMeans(n_clusters=3, random_state=0).fit(X)
- Retrieve the
labels_
from the results.
clusters = pd.DataFrame(kmeans.labels_, columns=['cluster'])
- Merge the labels onto your original DataFrame, using the index as the merge key.
detail = pd.merge(left=counties, right=clusters, left_index=True, right_index=True)
detail.head(2)
GEOID | Description | income | pop | ALAND | income_cap | density | cluster | |
---|---|---|---|---|---|---|---|---|
0 | 27001 | Aitkin, MN | 632584.0 | 15902.0 | 4718433597 | 39.780153 | 3.370186 | 2 |
1 | 27003 | Anoka, MN | 18135923.0 | 353813.0 | 1092919450 | 51.258498 | 323.732001 | 0 |
- Plot
income_cap
anddensity
again using seaborn's scatterplot. Set the hue to thelabels_
you retrieved. What patterns do you see? You might add labels to the points so you can tell which counties are which.
fig, ax = plt.subplots(figsize=(15,6))
sns.scatterplot(ax=ax, x='density', y='income_cap', hue='cluster', data=detail, palette=['#377eb8', '#ff7f00', '#4daf4a'])
for x,y,t in zip(detail['density'], detail['income_cap'], detail['Description']):
if x > 150:
ax.text(x+10, y+1, t)
ax.set_xlabel('density')
ax.set_ylabel('income per capita')
sns.despine(ax=ax)
ax.legend(frameon=False)
plt.show()
- Redo 7 through 10 with k=4. How do the results change?
kmeans = sklearn.cluster.KMeans(n_clusters=4, random_state=0).fit(X)
clusters = pd.DataFrame(kmeans.labels_, columns=['cluster'])
detail = pd.merge(left=counties, right=clusters, left_index=True, right_index=True)
fig, ax = plt.subplots(figsize=(15,6))
sns.scatterplot(ax=ax, x='density', y='income_cap', hue='cluster', data=detail, palette=['#377eb8', '#ff7f00', '#4daf4a','#f781bf'])
for x,y,t in zip(detail['density'], detail['income_cap'], detail['Description']):
if x > 150:
ax.text(x+10, y+1, t)
sns.despine(ax=ax)
ax.legend(frameon=False)
ax.set_xlabel('density')
ax.set_ylabel('income per capita')
plt.show()
Answer: Hennepin, MN becomes its own cluster.
Comments¶
As I mentioned above, K-Means has a couple of disadvantages.
Tou have to select how many groups/classes there are. The Elbow Graph approach helps but there's no statistical test for us to lean on -- or at least identify a best-practice across different researchers. This isn't always trivial and ideally with a clustering algorithm we'd want it to figure those out for us because the point of it is to gain some insight from the data.
K-means starts with a random choice of cluster centers and therefore it may yield different clustering results on different runs of the algorithm. Thus, the results may not be repeatable and lack consistency. Other cluster methods are more consistent.
K-means fundamentally draws circles around the data. What if circles are not the best way to cluster the data. Look at the scatterplots in the practice exercise. The data near zero density look like a column (ie, a lot of heterogeneity in income per capita) while there appears to be a positive relationship between density and income per capita between density of 25 and 200. It seems like these two groups should be different clusters but K-Means will have difficulty doing that since it looks for centers of circles and will therefore always choose a center in the middle of both groups in order to capture both.
4. When Clusters Are Useful ¶
We'll see that clustering will be very useful when we have too many features to work with in a tractable way (our preferred ML model takes forever to fit or maybe requires too much computing resources). Clustering can also be useful to construct data clusters which make prediction (estimation in the following example) better.
Suppose we had the following data on vehicle sales:
import numpy as np
autos = pd.read_stata('./Data/cars.dta')
# Let's create some new variables
segment_dummies = pd.get_dummies(autos['SEGMENT'])
autos = pd.concat([autos, segment_dummies], axis=1)
autos.rename(columns={1:'SMALL', 2:'COMPACT', 3:'SEDAN', 4:'LUXURY', 5:'MINIVAN'},inplace=True)
# Imported Cars
autos['NONEURO'] = 0
autos.loc[(autos['ORIG']>1),'NONEURO'] = 1
# Rename "Fuel" "Diesel" since this is an indicator where 1 => diesel engine
autos.rename(columns={'FUEL':'DIESEL'},inplace=True)
# Apply logs so we can estimate elasticity
autos['LOGQ'] = np.log(autos['QUANTITY'])
autos['LOGP'] = np.log(autos['PRICE'])
autos.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1269 entries, 0 to 1268 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CODE 1269 non-null int16 1 ORIG 1269 non-null int8 2 FIRM_ID 1269 non-null float32 3 FIRM 1269 non-null object 4 BRAND 1269 non-null object 5 MODEL 1269 non-null object 6 YEAR 1269 non-null int16 7 PRICE 1269 non-null float32 8 QUANTITY 1269 non-null float32 9 HP 1269 non-null float32 10 LENGTH 1269 non-null float32 11 WIDTH 1269 non-null float32 12 SIZE 1269 non-null float32 13 WEIGHT 1269 non-null float32 14 DIESEL 1269 non-null int8 15 MPG 1269 non-null float32 16 FUELPRICE 1269 non-null float32 17 SEGMENT 1269 non-null int8 18 SMALL 1269 non-null uint8 19 COMPACT 1269 non-null uint8 20 SEDAN 1269 non-null uint8 21 LUXURY 1269 non-null uint8 22 MINIVAN 1269 non-null uint8 23 NONEURO 1269 non-null int64 24 LOGQ 1269 non-null float32 25 LOGP 1269 non-null float32 dtypes: float32(12), int16(2), int64(1), int8(3), object(3), uint8(5) memory usage: 123.9+ KB
And we wanted to estimate demand and are particularly interested in estimating price elasticity:
import statsmodels.formula.api as smf # provides a way to directly spec models from formulas
# OLS
auto_OLS1 = smf.ols('LOGQ ~ LOGP + MPG + HP + DIESEL + NONEURO + SMALL + SEDAN + LUXURY + MINIVAN', data=autos).fit(cov_type = 'HC3')
print(auto_OLS1.summary())
OLS Regression Results ============================================================================== Dep. Variable: LOGQ R-squared: 0.925 Model: OLS Adj. R-squared: 0.925 Method: Least Squares F-statistic: 1198. Date: Mon, 28 Nov 2022 Prob (F-statistic): 0.00 Time: 14:51:37 Log-Likelihood: -1735.6 No. Observations: 1269 AIC: 3491. Df Residuals: 1259 BIC: 3543. Df Model: 9 Covariance Type: HC3 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 27.2383 0.491 55.441 0.000 26.275 28.201 LOGP -6.3574 0.135 -47.193 0.000 -6.621 -6.093 MPG -0.0127 0.006 -2.036 0.042 -0.025 -0.000 HP -4.7571 5.031 -0.946 0.344 -14.617 5.103 DIESEL -0.4234 0.092 -4.585 0.000 -0.604 -0.242 NONEURO 0.3850 0.059 6.502 0.000 0.269 0.501 SMALL -0.1625 0.100 -1.631 0.103 -0.358 0.033 SEDAN -0.5817 0.077 -7.592 0.000 -0.732 -0.432 LUXURY -1.7088 0.122 -14.051 0.000 -1.947 -1.470 MINIVAN -1.2488 0.117 -10.683 0.000 -1.478 -1.020 ============================================================================== Omnibus: 120.332 Durbin-Watson: 1.625 Prob(Omnibus): 0.000 Jarque-Bera (JB): 155.702 Skew: -0.793 Prob(JB): 1.55e-34 Kurtosis: 3.655 Cond. No. 7.13e+03 ============================================================================== Notes: [1] Standard Errors are heteroscedasticity robust (HC3) [2] The condition number is large, 7.13e+03. This might indicate that there are strong multicollinearity or other numerical problems.
The above (biased) OLS regression indicates that the price elasticity is 6.35. Knowing the vehicle segment turns out to be important to understanding the price elasticity (okay, it's not a huge difference but a difference nonetheless):
# OLS
auto_OLS2 = smf.ols('LOGQ ~ LOGP + MPG + HP + DIESEL + NONEURO', data=autos).fit(cov_type = 'HC3')
print(auto_OLS2.summary())
OLS Regression Results ============================================================================== Dep. Variable: LOGQ R-squared: 0.909 Model: OLS Adj. R-squared: 0.908 Method: Least Squares F-statistic: 1133. Date: Mon, 28 Nov 2022 Prob (F-statistic): 0.00 Time: 14:51:37 Log-Likelihood: -1861.5 No. Observations: 1269 AIC: 3735. Df Residuals: 1263 BIC: 3766. Df Model: 5 Covariance Type: HC3 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 27.0949 0.474 57.159 0.000 26.166 28.024 LOGP -6.8719 0.127 -53.912 0.000 -7.122 -6.622 MPG 0.0261 0.005 5.649 0.000 0.017 0.035 HP -16.4147 4.334 -3.787 0.000 -24.909 -7.920 DIESEL -0.8058 0.090 -8.921 0.000 -0.983 -0.629 NONEURO 0.4518 0.066 6.871 0.000 0.323 0.581 ============================================================================== Omnibus: 127.456 Durbin-Watson: 1.781 Prob(Omnibus): 0.000 Jarque-Bera (JB): 167.280 Skew: -0.822 Prob(JB): 4.74e-37 Kurtosis: 3.677 Cond. No. 5.98e+03 ============================================================================== Notes: [1] Standard Errors are heteroscedasticity robust (HC3) [2] The condition number is large, 5.98e+03. This might indicate that there are strong multicollinearity or other numerical problems.
Suppose we didn't know the segment information. There seems to be information embedded in SEGMENT
that is correlated with log-price. Let's see if we can replicate vehicle segments using the other characteristics and kmeans clustering.
X = autos[['HP','SIZE', 'WEIGHT', 'MPG']].to_numpy()
# Try the model with 1, 2, ..., 10 clusters. Keep track of the ssd ('inertia')
y=[]
for n in range(1, 11):
kmeans = sklearn.cluster.KMeans(n_clusters=n, random_state=0).fit(X)
y.append(kmeans.inertia_)
fig, ax = plt.subplots(figsize=(15,6))
ax.plot(range(1,11), y, color='blue', marker='.')
ax.set_xlabel('number of clusters (k)')
ax.set_ylabel('sum of squared distances to nearest cluster')
sns.despine(ax=ax)
C:\Users\jt83241\Anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1036: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=5. warnings.warn(
I read the elbow graph as saying we should stop at three clusters.
kmeans = sklearn.cluster.KMeans(n_clusters=3, random_state=0).fit(X)
clusters = pd.DataFrame(kmeans.labels_, columns=['cluster'])
autos = pd.merge(left=autos, right=clusters, left_index=True, right_index=True, how='left')
So now run OLS again with the clusters we built based on vehicle characteristics.
# OLS
auto_OLS3 = smf.ols('LOGQ ~ LOGP + MPG + HP + DIESEL + NONEURO + C(cluster)', data=autos).fit(cov_type = 'HC3')
print(auto_OLS3.summary())
OLS Regression Results ============================================================================== Dep. Variable: LOGQ R-squared: 0.916 Model: OLS Adj. R-squared: 0.916 Method: Least Squares F-statistic: 1115. Date: Mon, 28 Nov 2022 Prob (F-statistic): 0.00 Time: 14:51:38 Log-Likelihood: -1805.1 No. Observations: 1269 AIC: 3626. Df Residuals: 1261 BIC: 3667. Df Model: 7 Covariance Type: HC3 =================================================================================== coef std err z P>|z| [0.025 0.975] ----------------------------------------------------------------------------------- Intercept 27.8570 0.452 61.628 0.000 26.971 28.743 C(cluster)[T.1] -1.1963 0.121 -9.853 0.000 -1.434 -0.958 C(cluster)[T.2] -0.3482 0.080 -4.325 0.000 -0.506 -0.190 LOGP -6.5444 0.135 -48.485 0.000 -6.809 -6.280 MPG -0.0072 0.006 -1.284 0.199 -0.018 0.004 HP -17.5704 4.402 -3.992 0.000 -26.197 -8.944 DIESEL -0.3768 0.095 -3.957 0.000 -0.563 -0.190 NONEURO 0.3493 0.064 5.491 0.000 0.225 0.474 ============================================================================== Omnibus: 138.358 Durbin-Watson: 1.836 Prob(Omnibus): 0.000 Jarque-Bera (JB): 185.560 Skew: -0.866 Prob(JB): 5.08e-41 Kurtosis: 3.712 Cond. No. 6.13e+03 ============================================================================== Notes: [1] Standard Errors are heteroscedasticity robust (HC3) [2] The condition number is large, 6.13e+03. This might indicate that there are strong multicollinearity or other numerical problems.
results = [auto_OLS1.params['LOGP'],auto_OLS2.params['LOGP'],auto_OLS3.params['LOGP']]
params = pd.DataFrame({'Estimated Price':results},
index = ['Truth', 'No Clusters', 'Clusters'],
)
params
Estimated Price | |
---|---|
Truth | -6.357438 |
No Clusters | -6.871897 |
Clusters | -6.544450 |
Imputing the clusters based on observable characteristics enabled us to improve our estimated price elasticity!
3. The Idea Behind PCA¶
Principal component analysis is a fast and flexible unsupervised method for dimensionality reduction in data. What happens is easiest to visualize by looking at a two-dimensional dataset.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
rng = np.random.RandomState(1)
X = np.dot(rng.rand(2, 2), rng.randn(2, 200)).T
fig, ax = plt.subplots(figsize=(10, 5))
ax.scatter(X[:, 0], X[:, 1])
ax.axis('equal')
ax.set_title('A Hypothetical Data Set', fontsize=20)
sns.despine(ax=ax)
plt.show()
It's clear that there is a nearly linear relationship between the x and y variables. Rather than attempting to predict the y values from the x values, our unsupervised learning problem attempts to learn about the relationship between the x and y values.
In principal component analysis, this relationship is quantified by finding a list of the principal axes in the data, and using those axes to describe the dataset.
Using Scikit-Learn's PCA
estimator, we can compute this as follows:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(X)
PCA(n_components=2)
The fit learns some quantities from the data, most importantly the "components" and "explained variance":
print(pca.components_)
[[-0.94446029 -0.32862557] [-0.32862557 0.94446029]]
print(pca.explained_variance_)
[0.7625315 0.0184779]
To see what these numbers mean, let's visualize them as vectors over the input data, using the "components" to define the direction of the vector, and the "explained variance" to define the squared-length of the vector:
def draw_vector(v0, v1, ax=None):
ax = ax or plt.gca()
arrowprops=dict(arrowstyle='->',
linewidth=2,
shrinkA=0, shrinkB=0,color='black')
ax.annotate('', v1, v0, arrowprops=arrowprops)
# plot data
fig, ax = plt.subplots(figsize=(10, 5))
ax.scatter(X[:, 0], X[:, 1], alpha=0.2)
for length, vector in zip(pca.explained_variance_, pca.components_):
v = vector * 3 * np.sqrt(length)
draw_vector(pca.mean_, pca.mean_ + v)
ax.set_title('Interpreting the Components',fontsize=20)
sns.despine(ax=ax)
plt.show()
These vectors represent the principal axes of the data, and the length of the vector is an indication of how "important" that axis is in describing the distribution of the data—more precisely, it is a measure of the variance of the data when projected onto that axis. The projection of each data point onto the principal axes are the "principal components" of the data.
If we plot these principal components beside the original data, we get the following plots:
The left graph shows the principle axes of the data (ie, the black arrows / vectors). The right graph shows the data transformed as if they are in reference to the principle axes. It's as if we rotated the left graph until the long vector was horizontal and scaled the y-axis by the remaining vector. Mathematically, each point in the left graph has been rewritten in coordinates according to the vectors presented in the left-graph. The result is the right graph.
This transformation from data axes to principal axes is an affine transformation, which basically means it is composed of a translation, rotation, and uniform scaling.
While this algorithm to find principal components may seem like just a mathematical curiosity, it turns out to have very far-reaching applications in the world of machine learning and data exploration.
4. PCA as Dimensionality Reduction ¶
Using PCA for dimensionality reduction involves zeroing out one or more of the smallest principal components, resulting in a lower-dimensional projection of the data that preserves the maximal data variance.
Here is an example of using PCA as a dimensionality reduction transform:
pca = PCA(n_components=1)
pca.fit(X)
X_pca = pca.transform(X)
print("original shape: ", X.shape)
print("transformed shape:", X_pca.shape)
original shape: (200, 2) transformed shape: (200, 1)
The transformed data has been reduced to a single dimension. To understand the effect of this dimensionality reduction, we can perform the inverse transform of this reduced data and plot it along with the original data:
fig, ax = plt.subplots(figsize=(10, 5))
X_new = pca.inverse_transform(X_pca)
plt.scatter(X[:, 0], X[:, 1], alpha=0.2)
plt.scatter(X_new[:, 0], X_new[:, 1], alpha=0.8)
sns.despine(ax=ax)
plt.show()
The light points are the original data, while the dark points are the projected version. This makes clear what a PCA dimensionality reduction means: the information along the least important principal axis or axes is removed, leaving only the component(s) of the data with the highest variance. The fraction of variance that is cut out (proportional to the spread of points about the line formed in this figure) is roughly a measure of how much "information" is discarded in this reduction of dimensionality.
This reduced-dimension dataset is in some senses "good enough" to encode the most important relationships between the points: despite reducing the dimension of the data by 50%, the overall relationship between the data points are mostly preserved.
PCA for visualization: Hand-written digits¶
The usefulness of the dimensionality reduction may not be entirely apparent in only two dimensions, but becomes much more clear when looking at high-dimensional data. To see this, let's take a quick look at the application of PCA to digits data.
We start by loading the data from scikit-learn:
from sklearn.datasets import load_digits
digits = load_digits()
digits.data.shape
(1797, 64)
As usual, let's visualize the data before delving deeper:
# set up the figure
fig = plt.figure(figsize=(8, 8)) # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)
# plot the digits: each image is 8x8 pixels
for i in range(64):
ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')
# label the image with the target value
ax.text(0, 7, str(digits.target[i]))
The data consists of 8×8 pixel images, meaning that they are 64-dimensional. To gain some intuition into the relationships between these points, we can use PCA to project the images from 64 dimensions down to a more manageable number of dimensions, say two:
pca = PCA(2) # project from 64 to 2 dimensions
projected = pca.fit_transform(digits.data)
print(digits.data.shape)
print(projected.shape)
(1797, 64) (1797, 2)
We can now plot the first two principal components of each point to learn about the data:
fig, ax = plt.subplots(figsize=(10, 5))
plt.scatter(projected[:, 0], projected[:, 1],
c=digits.target, edgecolor='none', alpha=0.5)
ax.set_xlabel('Component 1',fontsize=18)
ax.set_ylabel('Component 2',fontsize=18)
ax.set_title('Plotting the First Two Components',fontsize=20)
plt.colorbar()
sns.despine(ax=ax)
plt.show()
What these components mean? The full data is a 64-dimensional point cloud. The points above are the projection of each data point along the directions with the largest variance.
Essentially, we have found the optimal stretch and rotation in 64-dimensional space that allows us to see the layout of the digits in two dimensions, and have done this in an unsupervised manner—that is, without reference to the labels.
What do the components mean?¶
We can go a bit further here, and begin to ask what the reduced dimensions mean. This meaning can be understood in terms of combinations of basis vectors. For example, each image in the training set is defined by a collection of 64 pixel values, which we will call the vector $x$:
$$ x = [x_1, x_2, x_3 \cdots x_{64}] $$One way we can think about this is in terms of a pixel basis. That is, to construct the image, we multiply each element of the vector by the pixel it describes, and then add the results together to build the image:
$$ {\rm image}(x) = x_1 \cdot{\rm (pixel~1)} + x_2 \cdot{\rm (pixel~2)} + x_3 \cdot{\rm (pixel~3)} \cdots x_{64} \cdot{\rm (pixel~64)} $$One way we might imagine reducing the dimension of this data is to zero out all but a few of these basis vectors. For example, if we use only the first eight pixels, we get an eight-dimensional projection of the data, but it is not very reflective of the whole image: we've thrown out nearly 90% of the pixels!
The upper row of panels shows the individual pixels, and the lower row shows the cumulative contribution of these pixels to the construction of the image. Using only eight of the pixel-basis components, we can only construct a small portion of the 64-pixel image. Were we to continue this sequence and use all 64 pixels, we would recover the original image.
What basis to choose?¶
The pixel-wise representation is not the only choice of basis. We can use other basis functions, which each contain some pre-defined contribution from each pixel, and write something like
$$ image(x) = {\rm mean} + x_1 \cdot{\rm (basis~1)} + x_2 \cdot{\rm (basis~2)} + x_3 \cdot{\rm (basis~3)} \cdots $$PCA can be thought of as a process of choosing "optimal" basis functions. By optimal I mean that PCA creates teh fewest possible basis functions capable of approximating the original data. The "principal components" then are simply the coefficients that multiply each of the elements in this series.
This figure shows a similar depiction of reconstructing this digit using the mean plus the first eight PCA basis functions:
Unlike the pixel basis, the PCA basis allows us to recover the salient features of the input image with just a mean plus eight components! The amount of each pixel in each component is the corollary of the orientation of the vector in our two-dimensional example. This is the sense in which PCA provides a low-dimensional representation of the data: it discovers a set of basis functions that are more efficient than the native pixel-basis of the input data.
5. Choosing the number of components ¶
A vital part of using PCA in practice is the ability to estimate how many components are needed to describe the data. This can be determined by looking at the cumulative explained variance ratio as a function of the number of components:
pca = PCA().fit(digits.data)
fig, ax = plt.subplots(figsize=(10, 5))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components',fontsize=18)
plt.ylabel('Cumulative Explained Variance',fontsize=18)
ax.set_title('Choosing the Number of Components',fontsize=20)
# Format y-axis to include percent.
ax.set_yticklabels(['{:.1%}'.format(x) for x in ax.get_yticks().tolist()])
sns.despine(ax=ax)
plt.show()
C:\Users\jt83241\AppData\Local\Temp\ipykernel_7248\66311243.py:10: UserWarning: FixedFormatter should only be used together with FixedLocator ax.set_yticklabels(['{:.1%}'.format(x) for x in ax.get_yticks().tolist()])
This curve quantifies how much of the total, 64-dimensional variance is contained within the first $N$ components. For example, we see that with the digits the first 10 components contain approximately 75% of the variance, while you need around 50 components to describe close to 100% of the variance.
Here we see that our two-dimensional projection loses a lot of information (as measured by the explained variance) and that we'd need about 20 components to retain 90% of the variance. Looking at this plot for a high-dimensional dataset can help you understand the level of redundancy present in multiple observations.
6. PCA to Filter Noise ¶
PCA can also be used as a filtering approach for noisy data. Why? PCA selects basis functions ("prinicpal components") that are most meaningful to explaining the data. If there is noise in the data (by "noise" I mean something that does not contribute to explaining the data in a systemic way), PCA will necessarilly throw out that noise. Hence, you can reconstruct the data using just the largest subset of principal components and in so doing you will keep the signal and throwing out the noise.
We'll demonstrate the idea with the digits data. First, plot several of the input "noise-free" data:
def plot_digits(data):
fig, axes = plt.subplots(4, 10, figsize=(10, 4),
subplot_kw={'xticks':[], 'yticks':[]},
gridspec_kw=dict(hspace=0.1, wspace=0.1))
for i, ax in enumerate(axes.flat):
ax.imshow(data[i].reshape(8, 8),
cmap='binary', interpolation='nearest',
clim=(0, 16))
plot_digits(digits.data)
Now lets add some random noise to create a noisy dataset, and re-plot it:
np.random.seed(42)
noisy = np.random.normal(digits.data, 4)
plot_digits(noisy)
It's clear by eye that the images are noisy, and contain spurious pixels. Let's train a PCA on the noisy data, requesting that the projection preserve 50% of the variance:
pca = PCA(0.50).fit(noisy)
pca.n_components_
12
Here 50% of the variance amounts to 12 principal components. Now we compute these components, and then use the inverse of the transform to reconstruct the filtered digits:
components = pca.transform(noisy)
filtered = pca.inverse_transform(components)
plot_digits(filtered)
This signal preserving/noise filtering property makes PCA a very useful feature selection routine—for example, rather than training a classifier on very high-dimensional data, you might instead train the classifier on the lower-dimensional representation, which will automatically serve to filter out random noise in the inputs.
7. Discussion ¶
Using PCA in the Data Scientist Workflow¶
We've used PCA to reduce the dimensionality of data in order to build a supervised model more efficiently. Starting with PCA on a high dimension data set is also useful to visualize variation in the data as well as understand the intrinsic dimensionality embedded in the data (by plotting the explained variance ratio).
Weaknesses¶
PCA's main weakness is that it tends to be highly affected by outliers in the data.
Practice¶
You'll practice using PCA to do create a facial recognition program for you problem set.