Lecture 10: Prinicipal Component Analysis (PCA) [FINISHED]
In this lecture, we'll explore what is perhaps one of the most broadly used of unsupervised algorithms: principal component analysis (PCA). PCA is fundamentally a dimensionality reduction algorithm (ie, an "unsupervised" learning algorithm) which is often used as an input to build supervised models. It can also be useful as a tool for:
- visualization,
- noise filtering,
- feature extraction and engineering.
Because of the versatility and interpretability of PCA, it has been shown to be effective in a wide variety of contexts and disciplines. You will often see it used to simplify covariate data in finance. If you've had linear algebra, what we'll be doing in this lecture with PCA is building basis functions to cover the data.
The agenda for today is as follows:
An added feature of this notebook is that we'll see that models, particularly machine learning models, can be applied to a variety of data types. Put differently, we'll see that "data" is a very broad term -- almost anything can become "data" which means we can train a model on almost anything Today, we will consider using images as data will build some very simple optical character recognition (OCR) and facial recognition models. In upcoming lectures we will consider text as data.
Class Announcements
- PS4 is on eLC and is due end-of-day Tuesday, November 22, 2022. It's more time-consuming that PS3 but not as bad as PS2 (ie, no data work required). Wednesday office hours this week plus extra zoom office hours Monday (11/21) from 10:00 am to 12:00 pm via the usual zoom link.
- Makeup class this Friday from 9:35 to 10:50 in Correll 321. This is not obligatory and, as usual, I will post the lecture notebook plus the solution notebook on the class website.
1. The Idea ¶
Principal component analysis is a fast and flexible unsupervised method for dimensionality reduction in data. What happens is easiest to visualize by looking at a two-dimensional dataset.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
rng = np.random.RandomState(1)
X = np.dot(rng.rand(2, 2), rng.randn(2, 200)).T
fig, ax = plt.subplots(figsize=(10, 5))
ax.scatter(X[:, 0], X[:, 1])
ax.axis('equal')
ax.set_title('A Hypothetical Data Set', fontsize=20)
sns.despine(ax=ax)
plt.show()
It's clear that there is a nearly linear relationship between the x and y variables. Rather than attempting to predict the y values from the x values, our unsupervised learning problem attempts to learn about the relationship between the x and y values.
In principal component analysis, this relationship is quantified by finding a list of the principal axes in the data, and using those axes to describe the dataset.
Using Scikit-Learn's PCA
estimator, we can compute this as follows:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(X)
PCA(n_components=2)
The fit learns some quantities from the data, most importantly the "components" and "explained variance":
print(pca.components_)
[[-0.94446029 -0.32862557] [-0.32862557 0.94446029]]
print(pca.explained_variance_)
[0.7625315 0.0184779]
To see what these numbers mean, let's visualize them as vectors over the input data, using the "components" to define the direction of the vector, and the "explained variance" to define the squared-length of the vector:
def draw_vector(v0, v1, ax=None):
ax = ax or plt.gca()
arrowprops=dict(arrowstyle='->',
linewidth=2,
shrinkA=0, shrinkB=0,color='black')
ax.annotate('', v1, v0, arrowprops=arrowprops)
# plot data
fig, ax = plt.subplots(figsize=(10, 5))
ax.scatter(X[:, 0], X[:, 1], alpha=0.2)
for length, vector in zip(pca.explained_variance_, pca.components_):
v = vector * 3 * np.sqrt(length)
draw_vector(pca.mean_, pca.mean_ + v)
ax.set_title('Interpreting the Components',fontsize=20)
sns.despine(ax=ax)
plt.show()
These vectors represent the principal axes of the data, and the length of the vector is an indication of how "important" that axis is in describing the distribution of the data—more precisely, it is a measure of the variance of the data when projected onto that axis. The projection of each data point onto the principal axes are the "principal components" of the data.
If we plot these principal components beside the original data, we get the following plots:
The left graph shows the principle axes of the data (ie, the black arrows / vectors). The right graph shows the data transformed as if they are in reference to the principle axes. It's as if we rotated the left graph until the long vector was horizontal and scaled the y-axis by the remaining vector. Mathematically, each point in the left graph has been rewritten in coordinates according to the vectors presented in the left-graph. The result is the right graph.
This transformation from data axes to principal axes is an affine transformation, which basically means it is composed of a translation, rotation, and uniform scaling.
While this algorithm to find principal components may seem like just a mathematical curiosity, it turns out to have very far-reaching applications in the world of machine learning and data exploration.
2. PCA as Dimensionality Reduction ¶
Using PCA for dimensionality reduction involves zeroing out one or more of the smallest principal components, resulting in a lower-dimensional projection of the data that preserves the maximal data variance.
Here is an example of using PCA as a dimensionality reduction transform:
pca = PCA(n_components=1)
pca.fit(X)
X_pca = pca.transform(X)
print("original shape: ", X.shape)
print("transformed shape:", X_pca.shape)
original shape: (200, 2) transformed shape: (200, 1)
The transformed data has been reduced to a single dimension. To understand the effect of this dimensionality reduction, we can perform the inverse transform of this reduced data and plot it along with the original data:
fig, ax = plt.subplots(figsize=(10, 5))
X_new = pca.inverse_transform(X_pca)
plt.scatter(X[:, 0], X[:, 1], alpha=0.2)
plt.scatter(X_new[:, 0], X_new[:, 1], alpha=0.8)
sns.despine(ax=ax)
plt.show()
The light points are the original data, while the dark points are the projected version. This makes clear what a PCA dimensionality reduction means: the information along the least important principal axis or axes is removed, leaving only the component(s) of the data with the highest variance. The fraction of variance that is cut out (proportional to the spread of points about the line formed in this figure) is roughly a measure of how much "information" is discarded in this reduction of dimensionality.
This reduced-dimension dataset is in some senses "good enough" to encode the most important relationships between the points: despite reducing the dimension of the data by 50%, the overall relationship between the data points are mostly preserved.
PCA for visualization: Hand-written digits¶
The usefulness of the dimensionality reduction may not be entirely apparent in only two dimensions, but becomes much more clear when looking at high-dimensional data. To see this, let's take a quick look at the application of PCA to digits data.
We start by loading the data from scikit-learn:
from sklearn.datasets import load_digits
digits = load_digits()
digits.data.shape
(1797, 64)
As usual, let's visualize the data before delving deeper:
# set up the figure
fig = plt.figure(figsize=(8, 8)) # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)
# plot the digits: each image is 8x8 pixels
for i in range(64):
ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')
# label the image with the target value
ax.text(0, 7, str(digits.target[i]))
The data consists of 8×8 pixel images, meaning that they are 64-dimensional. To gain some intuition into the relationships between these points, we can use PCA to project the images from 64 dimensions down to a more manageable number of dimensions, say two:
pca = PCA(2) # project from 64 to 2 dimensions
projected = pca.fit_transform(digits.data)
print(digits.data.shape)
print(projected.shape)
(1797, 64) (1797, 2)
We can now plot the first two principal components of each point to learn about the data:
fig, ax = plt.subplots(figsize=(10, 5))
plt.scatter(projected[:, 0], projected[:, 1],
c=digits.target, edgecolor='none', alpha=0.5)
ax.set_xlabel('Component 1',fontsize=18)
ax.set_ylabel('Component 2',fontsize=18)
ax.set_title('Plotting the First Two Components',fontsize=20)
plt.colorbar()
sns.despine(ax=ax)
plt.show()
What these components mean? The full data is a 64-dimensional point cloud. The points above are the projection of each data point along the directions with the largest variance.
Essentially, we have found the optimal stretch and rotation in 64-dimensional space that allows us to see the layout of the digits in two dimensions, and have done this in an unsupervised manner—that is, without reference to the labels.
What do the components mean?¶
We can go a bit further here, and begin to ask what the reduced dimensions mean. This meaning can be understood in terms of combinations of basis vectors. For example, each image in the training set is defined by a collection of 64 pixel values, which we will call the vector $x$:
$$ x = [x_1, x_2, x_3 \cdots x_{64}] $$One way we can think about this is in terms of a pixel basis. That is, to construct the image, we multiply each element of the vector by the pixel it describes, and then add the results together to build the image:
$$ {\rm image}(x) = x_1 \cdot{\rm (pixel~1)} + x_2 \cdot{\rm (pixel~2)} + x_3 \cdot{\rm (pixel~3)} \cdots x_{64} \cdot{\rm (pixel~64)} $$One way we might imagine reducing the dimension of this data is to zero out all but a few of these basis vectors. For example, if we use only the first eight pixels, we get an eight-dimensional projection of the data, but it is not very reflective of the whole image: we've thrown out nearly 90% of the pixels!
The upper row of panels shows the individual pixels, and the lower row shows the cumulative contribution of these pixels to the construction of the image. Using only eight of the pixel-basis components, we can only construct a small portion of the 64-pixel image. Were we to continue this sequence and use all 64 pixels, we would recover the original image.
What basis to choose?¶
The pixel-wise representation is not the only choice of basis. We can use other basis functions, which each contain some pre-defined contribution from each pixel, and write something like
$$ image(x) = {\rm mean} + x_1 \cdot{\rm (basis~1)} + x_2 \cdot{\rm (basis~2)} + x_3 \cdot{\rm (basis~3)} \cdots $$PCA can be thought of as a process of choosing "optimal" basis functions. By optimal I mean that PCA creates teh fewest possible basis functions capable of approximating the original data. The "principal components" then are simply the coefficients that multiply each of the elements in this series.
This figure shows a similar depiction of reconstructing this digit using the mean plus the first eight PCA basis functions:
Unlike the pixel basis, the PCA basis allows us to recover the salient features of the input image with just a mean plus eight components! The amount of each pixel in each component is the corollary of the orientation of the vector in our two-dimensional example. This is the sense in which PCA provides a low-dimensional representation of the data: it discovers a set of basis functions that are more efficient than the native pixel-basis of the input data.
3. Choosing the number of components ¶
A vital part of using PCA in practice is the ability to estimate how many components are needed to describe the data. This can be determined by looking at the cumulative explained variance ratio as a function of the number of components:
pca = PCA().fit(digits.data)
fig, ax = plt.subplots(figsize=(10, 5))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components',fontsize=18)
plt.ylabel('Cumulative Explained Variance',fontsize=18)
ax.set_title('Choosing the Number of Components',fontsize=20)
# Format y-axis to include percent.
ax.set_yticklabels(['{:.1%}'.format(x) for x in ax.get_yticks().tolist()])
sns.despine(ax=ax)
plt.show()
C:\Users\jt83241\AppData\Local\Temp\ipykernel_46980\66311243.py:10: UserWarning: FixedFormatter should only be used together with FixedLocator ax.set_yticklabels(['{:.1%}'.format(x) for x in ax.get_yticks().tolist()])
This curve quantifies how much of the total, 64-dimensional variance is contained within the first $N$ components. For example, we see that with the digits the first 10 components contain approximately 75% of the variance, while you need around 50 components to describe close to 100% of the variance.
Here we see that our two-dimensional projection loses a lot of information (as measured by the explained variance) and that we'd need about 20 components to retain 90% of the variance. Looking at this plot for a high-dimensional dataset can help you understand the level of redundancy present in multiple observations.
4. PCA to Filter Noise ¶
PCA can also be used as a filtering approach for noisy data. Why? PCA selects basis functions ("prinicpal components") that are most meaningful to explaining the data. If there is noise in the data (by "noise" I mean something that does not contribute to explaining the data in a systemic way), PCA will necessarilly throw out that noise. Hence, you can reconstruct the data using just the largest subset of principal components and in so doing you will keep the signal and throwing out the noise.
We'll demonstrate the idea with the digits data. First, plot several of the input "noise-free" data:
def plot_digits(data):
fig, axes = plt.subplots(4, 10, figsize=(10, 4),
subplot_kw={'xticks':[], 'yticks':[]},
gridspec_kw=dict(hspace=0.1, wspace=0.1))
for i, ax in enumerate(axes.flat):
ax.imshow(data[i].reshape(8, 8),
cmap='binary', interpolation='nearest',
clim=(0, 16))
plot_digits(digits.data)
Now lets add some random noise to create a noisy dataset, and re-plot it:
np.random.seed(42)
noisy = np.random.normal(digits.data, 4)
plot_digits(noisy)
It's clear by eye that the images are noisy, and contain spurious pixels. Let's train a PCA on the noisy data, requesting that the projection preserve 50% of the variance:
pca = PCA(0.50).fit(noisy)
pca.n_components_
12
Here 50% of the variance amounts to 12 principal components. Now we compute these components, and then use the inverse of the transform to reconstruct the filtered digits:
components = pca.transform(noisy)
filtered = pca.inverse_transform(components)
plot_digits(filtered)
This signal preserving/noise filtering property makes PCA a very useful feature selection routine—for example, rather than training a classifier on very high-dimensional data, you might instead train the classifier on the lower-dimensional representation, which will automatically serve to filter out random noise in the inputs.
Practice¶
Here we will use PCA to support a vector machine (SVM) classification problem -- specifically, we'll build a facial recognition program using the Labeled Faces data set embedded in scikit-learn. This data set consists of several thousand collated photos of various public figures. We'll build a model to match names to the faces.
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=60)
print(faces.target_names)
['Ariel Sharon' 'Colin Powell' 'Donald Rumsfeld' 'George W Bush' 'Gerhard Schroeder' 'Hugo Chavez' 'Junichiro Koizumi' 'Tony Blair']
Let's plot a few of these faces to see what we're working with:
fig, ax = plt.subplots(3, 5, figsize=(10,15))
for i, axi in enumerate(ax.flat):
axi.imshow(faces.images[i], cmap='bone')
axi.set(xticks=[], yticks=[],
xlabel=faces.target_names[faces.target[i]])
plt.subplots_adjust(hspace=-0.65)
plt.show()
There are 1,348 images and each image contains [62×47] or nearly 3,000 pixels. Our goal will be to develop a facial recognition model which will attach a name to a facial image. We could proceed by simply using each pixel value as a feature but 3,000 features is a lot and we only have 1,348 images so we really wouldn't be successful going this route. Note that in a regression our X matrix wouldn't be full rank.
Below we'll instead use a preprocessor to extract more meaningful features -- we'll use PCA to extract the most important fundamental components to feed into a support vector machine classifier (SVM).
Let's get to work:
- Split the data into a training and testing set. Set the random_state to 10. Note that the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(faces.data, faces.target,
test_size=0.2,
random_state=10)
- Take a look at the principal axes that span this dataset. Because this is a large dataset, we'll apply a randomized method to approximate the first $N$ principal components. This enables a faster implementation of PCA and is very useful for high-dimensional data like this (nb, the dimensionality here his roughly 3,000). We will take a look at the first 150 components. Do this by applying the following code:Note that in the code above we're developing the principal components with the training data and then transforming the testing data using these basis functions.
pca = PCA(n_components=150,svd_solver='randomized',random_state=10) pca.fit(X_train) X_train_pca = pca.transform(X_train) X_test_pca = pca.transform(X_test)
pca = PCA(n_components=150,svd_solver='randomized',random_state=10)
pca.fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
- Plot the cumulative explained variance ratio as a function of the number of components.
fig, ax = plt.subplots(figsize=(10, 5))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
ax.set_xlabel('Number of Components',fontsize=18)
ax.set_ylabel('')
ax.set_title('Cumulative Explained Variance',fontsize=20)
ax.set_ylim([0,1])
ax.set_xlim([0,160])
# Format y-axis to include percent.
ax.set_yticklabels(['{:.1%}'.format(x) for x in ax.get_yticks().tolist()])
sns.despine(ax=ax)
plt.show()
C:\Users\jt83241\AppData\Local\Temp\ipykernel_46980\211086440.py:10: UserWarning: FixedFormatter should only be used together with FixedLocator ax.set_yticklabels(['{:.1%}'.format(x) for x in ax.get_yticks().tolist()])
- Interpret the figure you created.
Answer: We only need a small number of components to capture a lot of information about the images.
- Fit a SVM model to classify the images.
from sklearn.svm import SVC # Load the model svc = SVC() svc.fit(X_train_pca, y_train)
from sklearn.svm import SVC
svc = SVC()
svc.fit(X_train_pca, y_train); # Train the model using the training sets
- Report how well the model did and discuss why PCA worked.
from sklearn import metrics
y_pred = svc.predict(X_test_pca) # Predict the response for test dataset
# Model Accuracy: how often is the classifier correct?
print(f'The SVM facial recognition model is correct {100*metrics.accuracy_score(y_test, y_pred):.2f}% of the time.')
The SVM facial recognition model is correct 75.19% of the time.
- Generate the Confusion Matrix and discuss where the model does well and where it does poorly.
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(15, 10))
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
xticklabels=faces.target_names,
yticklabels=faces.target_names)
ax.set_xlabel('\nTRUE LABEL',fontsize=18)
ax.set_ylabel('PREDICTED LABEL',fontsize=18)
ax.set_title('Confusion Matrix\n',fontsize=20)
plt.show()
- Improve the model.
- Try different kernels (see documentation for possibilities.
- Tune the 'hyperparameters' via a grid search plus cross-validation. SVM has two parameters you can tune to the data. Adjust
C
(which controls the margin hardness) andgamma
(which controls the size of the radial basis function kernel).
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
pca = PCA(svd_solver='randomized',random_state=10) # Initialize PCA (redundant)
svc = SVC() # Initialize SVM (redundant)
model = make_pipeline(pca, svc) # Make a "pipeline": do pca first and then svc
# Define grid of values to do CV.
param_grid = {'svc__C': [1, 5, 10, 50],
'svc__gamma': [0.0001, 0.0005, 0.001, 0.005],
'svc__kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
'pca__n_components': [100, 150, 175, 200],
'pca__whiten': ['True','False']} # We are training on images so the much of the raw input is redundant
# since adjacent pixel values are highly correlated.
# "Whitening" makes the input data less redundant
# By scaling the raw data such that the covariances are one.
grid = GridSearchCV(model, param_grid) # Conduct cross-validation via grid search
# Fit the model via cross-validation on
grid.fit(X_train, y_train)
print('The optimal parameters are:')
print(grid.best_params_)
The optimal parameters are: {'pca__n_components': 100, 'pca__whiten': 'True', 'svc__C': 5, 'svc__gamma': 0.005, 'svc__kernel': 'rbf'}
model = grid.best_estimator_ # Recover the best model pipeline
X_test_pca = model['pca'].transform(X_test)
y_pred2 = model['svc'].predict(X_test_pca) # Predict the response for test dataset.
# Model Accuracy: how often is the classifier correct?
print(f'Initial Model: {100*metrics.accuracy_score(y_test, y_pred):.2f}% correct facial recognition.')
print(f'Tuned Model: {100*metrics.accuracy_score(y_test, y_pred2):.2f}% correct facial recognition.')
The SVM facial recognition model is correct 81.85% of the time.
Let's take a look at a few of the test images along with their predicted values:
fig, ax = plt.subplots(4, 6,figsize=(15, 10))
for i, axi in enumerate(ax.flat):
axi.imshow(X_test[i].reshape(62, 47), cmap='bone')
axi.set(xticks=[], yticks=[])
axi.set_ylabel(faces.target_names[y_pred2[i]].split()[-1],
color='black' if y_pred2[i] == y_test[i] else 'red')
fig.subplots_adjust(top=0.90)
fig.suptitle('Predicted Names\n(Incorrect Labels in Red)', size=20)
plt.show()
We can get a better sense of our estimator's performance using the classification report, which lists recovery statistics label by label:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred2,
target_names=faces.target_names))
precision recall f1-score support Ariel Sharon 0.81 0.76 0.79 17 Colin Powell 0.75 0.93 0.83 41 Donald Rumsfeld 0.94 0.62 0.75 24 George W Bush 0.84 0.90 0.87 117 Gerhard Schroeder 0.73 0.73 0.73 22 Hugo Chavez 0.73 0.85 0.79 13 Junichiro Koizumi 1.00 0.40 0.57 5 Tony Blair 0.91 0.68 0.78 31 accuracy 0.82 270 macro avg 0.84 0.73 0.76 270 weighted avg 0.83 0.82 0.81 270
We might also display the confusion matrix between these classes:
mat = confusion_matrix(y_test, y_pred2)
fig, ax = plt.subplots(figsize=(15, 10))
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
xticklabels=faces.target_names,
yticklabels=faces.target_names)
ax.set_xlabel('\nTRUE LABEL',fontsize=18)
ax.set_ylabel('PREDICTED LABEL',fontsize=18)
ax.set_title('Confusion Matrix\n',fontsize=20)
plt.show()
5. Discussion ¶
Using PCA in the Data Scientist Workflow¶
We've used PCA to reduce the dimensionality of data in order to build a supervised model more efficiently. Starting with PCA on a high dimension data set is also useful to visualize variation in the data as well as understand the intrinsic dimensionality embedded in the data (by plotting the explained variance ratio).
Weaknesses¶
PCA's main weakness is that it tends to be highly affected by outliers in the data.