Lecture 24: Classification [FINISHED]
Classification is a task that requires the use of machine learning algorithms to learn how to assign a class label to examples from the problem domain. We may want to classify:
- email as "spam" or "not spam".
- a tumor as "benign" or "malignant".
- a flower as a particular species.
- which car a person is likely to buy.
The first two example are binary -- the outcome is either true or not. The second two examples are multi-class -- there are several different discrete values the outcome could take.
There are many different types of classification tasks that you may encounter in machine learning and specialized approaches to modeling that may be used for each. In this lecture we'll touch on some of the more popular:
Process
The process for fitting a model is:
(i) Initialize the model by creating a model object.
(ii) Fit the model on training data.
(iii) Test model accuracy by comparing test data predictions with test data results.
Note: Each model has what are called "hyperparameters" which we could "tune" using cross-validation as in the last lecture. For expediency, we won't do that in this lecture but you should in a real-world application.
Class Announcements
PS5 is on eLC. New deadline: Tuesday, November 22nd.
1.1 Load and Explore Data on Breast Cancer¶
sklearn
comes with a bunch of pre-loaded datasets. You can see the available datasets here. Some of these we've already used in their csv form. Let's load the data on breast cancer by using the following function to transform the data into a dataframe. This isn't required to fit our models but let's stick with pandas since it's so powerful.
import pandas as pd
import numpy as np
from sklearn import datasets
# Load data as a dataframe
def sklearn_to_df(sklearn_dataset):
df = pd.DataFrame(sklearn_dataset.data, columns=sklearn_dataset.feature_names)
df['target'] = pd.Series(sklearn_dataset.target)
return df
cancer = sklearn_to_df(datasets.load_breast_cancer())
cancer.rename(columns={'target':'malignant'},inplace=True)
cancer.head()
mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | ... | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension | malignant | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | 0.07871 | ... | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | 0 |
1 | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | 0.05667 | ... | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 | 0 |
2 | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | 0.05999 | ... | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 | 0 |
3 | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | 0.2597 | 0.09744 | ... | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 | 0 |
4 | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | 0.1809 | 0.05883 | ... | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 | 0 |
5 rows × 31 columns
The data descriptions are here. Let's take a look at the data:
cancer.columns
Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension', 'radius error', 'texture error', 'perimeter error', 'area error', 'smoothness error', 'compactness error', 'concavity error', 'concave points error', 'symmetry error', 'fractal dimension error', 'worst radius', 'worst texture', 'worst perimeter', 'worst area', 'worst smoothness', 'worst compactness', 'worst concavity', 'worst concave points', 'worst symmetry', 'worst fractal dimension', 'malignant'], dtype='object')
cancer.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
mean radius | 569.0 | 14.127292 | 3.524049 | 6.981000 | 11.700000 | 13.370000 | 15.780000 | 28.11000 |
mean texture | 569.0 | 19.289649 | 4.301036 | 9.710000 | 16.170000 | 18.840000 | 21.800000 | 39.28000 |
mean perimeter | 569.0 | 91.969033 | 24.298981 | 43.790000 | 75.170000 | 86.240000 | 104.100000 | 188.50000 |
mean area | 569.0 | 654.889104 | 351.914129 | 143.500000 | 420.300000 | 551.100000 | 782.700000 | 2501.00000 |
mean smoothness | 569.0 | 0.096360 | 0.014064 | 0.052630 | 0.086370 | 0.095870 | 0.105300 | 0.16340 |
mean compactness | 569.0 | 0.104341 | 0.052813 | 0.019380 | 0.064920 | 0.092630 | 0.130400 | 0.34540 |
mean concavity | 569.0 | 0.088799 | 0.079720 | 0.000000 | 0.029560 | 0.061540 | 0.130700 | 0.42680 |
mean concave points | 569.0 | 0.048919 | 0.038803 | 0.000000 | 0.020310 | 0.033500 | 0.074000 | 0.20120 |
mean symmetry | 569.0 | 0.181162 | 0.027414 | 0.106000 | 0.161900 | 0.179200 | 0.195700 | 0.30400 |
mean fractal dimension | 569.0 | 0.062798 | 0.007060 | 0.049960 | 0.057700 | 0.061540 | 0.066120 | 0.09744 |
radius error | 569.0 | 0.405172 | 0.277313 | 0.111500 | 0.232400 | 0.324200 | 0.478900 | 2.87300 |
texture error | 569.0 | 1.216853 | 0.551648 | 0.360200 | 0.833900 | 1.108000 | 1.474000 | 4.88500 |
perimeter error | 569.0 | 2.866059 | 2.021855 | 0.757000 | 1.606000 | 2.287000 | 3.357000 | 21.98000 |
area error | 569.0 | 40.337079 | 45.491006 | 6.802000 | 17.850000 | 24.530000 | 45.190000 | 542.20000 |
smoothness error | 569.0 | 0.007041 | 0.003003 | 0.001713 | 0.005169 | 0.006380 | 0.008146 | 0.03113 |
compactness error | 569.0 | 0.025478 | 0.017908 | 0.002252 | 0.013080 | 0.020450 | 0.032450 | 0.13540 |
concavity error | 569.0 | 0.031894 | 0.030186 | 0.000000 | 0.015090 | 0.025890 | 0.042050 | 0.39600 |
concave points error | 569.0 | 0.011796 | 0.006170 | 0.000000 | 0.007638 | 0.010930 | 0.014710 | 0.05279 |
symmetry error | 569.0 | 0.020542 | 0.008266 | 0.007882 | 0.015160 | 0.018730 | 0.023480 | 0.07895 |
fractal dimension error | 569.0 | 0.003795 | 0.002646 | 0.000895 | 0.002248 | 0.003187 | 0.004558 | 0.02984 |
worst radius | 569.0 | 16.269190 | 4.833242 | 7.930000 | 13.010000 | 14.970000 | 18.790000 | 36.04000 |
worst texture | 569.0 | 25.677223 | 6.146258 | 12.020000 | 21.080000 | 25.410000 | 29.720000 | 49.54000 |
worst perimeter | 569.0 | 107.261213 | 33.602542 | 50.410000 | 84.110000 | 97.660000 | 125.400000 | 251.20000 |
worst area | 569.0 | 880.583128 | 569.356993 | 185.200000 | 515.300000 | 686.500000 | 1084.000000 | 4254.00000 |
worst smoothness | 569.0 | 0.132369 | 0.022832 | 0.071170 | 0.116600 | 0.131300 | 0.146000 | 0.22260 |
worst compactness | 569.0 | 0.254265 | 0.157336 | 0.027290 | 0.147200 | 0.211900 | 0.339100 | 1.05800 |
worst concavity | 569.0 | 0.272188 | 0.208624 | 0.000000 | 0.114500 | 0.226700 | 0.382900 | 1.25200 |
worst concave points | 569.0 | 0.114606 | 0.065732 | 0.000000 | 0.064930 | 0.099930 | 0.161400 | 0.29100 |
worst symmetry | 569.0 | 0.290076 | 0.061867 | 0.156500 | 0.250400 | 0.282200 | 0.317900 | 0.66380 |
worst fractal dimension | 569.0 | 0.083946 | 0.018061 | 0.055040 | 0.071460 | 0.080040 | 0.092080 | 0.20750 |
malignant | 569.0 | 0.627417 | 0.483918 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.00000 |
import matplotlib.pyplot as plt
import seaborn as sns
# g = sns.pairplot(cancer, hue='malignant') # Run this on your own. It takes a while...
Split the Data into Training and Testing Data Sets¶
# splitting X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
cancer.drop('malignant',axis=1),
cancer['malignant'],
test_size=0.20,
random_state=10)
1.2 Logistic Regression ¶
The logistic regression passes the linear model through a non-linear function that constrains the output to lie between zero and one. In the logistic case, the function looks like
$$\text{Prob}(Y=1|X) = \frac{\exp \left({\beta_0+\beta_1 X}\right)}{1+\exp \left({\beta_0+\beta_1 X}\right)}.$$The classifier gives us a set of outputs or classes based on probability when we pass the inputs through the above prediction function and returns a probability score between 0 and 1. We're currently considering the binary case where Y=1 or Y=0. We ecide with a threshold value above which we classify values into Class 1 and of the value goes below the threshold then we classify it in Class 2. Here, we predict the outcome variable is true whenever $\text{Prob}(Y=1|X) \ge 0.5$. Graphically, this amounts to:
X = np.arange(-8, 8, 0.1);
# Determine Y
Y = 1/(1+np.exp(-X))
# Create Figure
fig, ax = plt.subplots(figsize=(15,8))
ax.axhline(y=0.5, color='red',linewidth=1,ls='--')
ax.annotate('Class One: Y=1',xy=(-6,.6),va='center',ha='left',size=18)
ax.annotate('Class Two: Y=0',xy=(-6,.4),va='center',ha='left',size=18)
ax.axhspan(.5, 1, alpha=0.2, color='blue')
ax.plot(X,Y, color = 'black')
ax.set_ylim(0,1)
ax.set_yticks(np.arange(0, 1.01, step=0.1))
ax.set_xlim(-8,8)
ax.set_xlabel('Independent Variable (X)',size=14)
ax.set_ylabel('Dependent Variable (Y)',size=14)
ax.set_title('Logistic Function and Decision Rule',size=20)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
When we fit the model to the training data, the algorithm chooses $\beta$ such that:
$$ \hat\beta = \max_\beta \sum_i \Big[y_i\log\big(f(x_i,\beta)\big) + (1-y_i)\log\big(1-f(x_i,\beta)\big)\Big] $$where
$$ f(x_i,\beta)=\frac{\exp \left({\beta_0+\beta_1 x_i}\right)}{1+\exp \left({\beta_0+\beta_1 x_i}\right)} $$What's happening here? In the first equation, the optimal $\beta$ generates predictions from $f(X,\beta)$ as consistent as possible with the observed ouctomes (y).
Some things to note:
- The objective function in the first equation is called a log-likelihood since it represents the model's ability the "likelihood" of predicting the outcome (y).
- The computer actually solves for the minimum so replace the $\max$ operator with the $\min$ operator and put a negative in front of the $\sum$ to convert the equation to what the computer is atually doing.
- In the math above there is only one feature in X but of course that's just a simplification. There can be many features!
Using Logistic Regression to Diagnose Cancer¶
Let's see how well the logistic model diagnoses breast cancer.
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
# Fit model based on training data
logit = LogisticRegression(random_state=0,solver='liblinear')
logit.fit(X_train, y_train)
# making predictions on the testing set
y_pred = logit.predict(X_test)
# Model Accuracy: how often is the classifier correct?
print(f'The Logistic model correctly predicts breast cancer {100*metrics.accuracy_score(y_test, y_pred):.2f}% of the time.')
The Logistic model correctly predicts breast cancer 93.86% of the time.
Wow, that worked really well! Let's try a different model and see if we can do better.
1.3 Testing Results¶
Confusion Matrix¶
We can evaluate how well the model is doing by decomposing the successes and failures using a tool called a "confusion matrix." A confusion matrix is a table which we use describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known.
Let's look at the confusion matrix in our Logit ML:
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(7,7))
sns.set(font_scale=1.4) # for label size
sns.heatmap(cm, ax=ax,annot=True, annot_kws={"size": 16}) # font size
plt.xlabel('Predictions', fontsize=18)
plt.ylabel('Actuals', fontsize=18)
plt.title('Confusion Matrix', fontsize=18)
plt.show()
Diagonal entries are when the model was successful. As for errors:
- 18 times the model predicted the tumor was malignant when it fact it was benign ("False Positive" or "Type I Error")
- 13 times the model predicted the tumor was bening when it fact it was malignant ("False Negative" or "Type II Error")
We can convert the confusion matrix into useful statistics:
- Precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier to not label a sample as positive if it is negative (i.e., avoid Type 1 errors).
- Recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.
- F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0.
We can access all of these statistics in the classification_report
.
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
precision recall f1-score support 0 0.88 0.95 0.91 39 1 0.97 0.93 0.95 75 accuracy 0.94 114 macro avg 0.93 0.94 0.93 114 weighted avg 0.94 0.94 0.94 114
The receiver operating characteristic (ROC) curve is another common tool used with binary classifiers. The dotted line represents the ROC curve of a purely random classifier; a good classifier stays as far away from that line as possible (toward the top-left corner). Area under the ROC Curve (AUC) is a way of translating the figure into a statistic where bigger equals better: A model whose predictions are 100% wrong has an AUC of 0 while a model whose predictions are 100% correct has an AUC of 1.
As you will see, this classifier does better than random guesses.
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(y_test, logit.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, logit.predict_proba(X_test)[:,1])
fig, ax = plt.subplots(figsize=(9,9))
ax.plot(fpr, tpr, label='Logistic Regression (AUC = %0.2f)' % logit_roc_auc,)
ax.plot([0, 1], [0, 1],'r--')
ax.set_xlim([0.0, 1.0])
ax.set_ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()
2. Multi-Class Classification Models (top)¶
Data may be discrete and also non-binary. In this section we discuss models to deal with multiclass discrete outcomes. Note also that any of these can be used in the binary case.
In a previous lecture we used the logit model to classify flower species based on the dimensions of its petal and sepal. Let's start there and then introduce additional tools.
iris = sklearn_to_df(datasets.load_iris())
iris.rename(columns={'target':'species'},inplace=True)
iris.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
sepal length (cm) | 150.0 | 5.843333 | 0.828066 | 4.3 | 5.1 | 5.80 | 6.4 | 7.9 |
sepal width (cm) | 150.0 | 3.057333 | 0.435866 | 2.0 | 2.8 | 3.00 | 3.3 | 4.4 |
petal length (cm) | 150.0 | 3.758000 | 1.765298 | 1.0 | 1.6 | 4.35 | 5.1 | 6.9 |
petal width (cm) | 150.0 | 1.199333 | 0.762238 | 0.1 | 0.3 | 1.30 | 1.8 | 2.5 |
species | 150.0 | 1.000000 | 0.819232 | 0.0 | 0.0 | 1.00 | 2.0 | 2.0 |
g = sns.pairplot(iris, hue='species')
2.2 Multiclass Logistic Regression ¶
The Logistic model can be used in multicase contexts as well. Let's fit the model again to the flower data.
Practice ¶
- Split the data into training and testing data sets. Set the
test_size=0.2
and therandom_state=10
.
X_train, X_test, y_train, y_test = train_test_split(
iris.drop('species',axis=1),
iris['species'],
test_size=0.20,
random_state=10)
- Fit the logistic model to the training data.
# Fit model based on training data
logit = LogisticRegression(random_state=0,solver='liblinear')
logit.fit(X_train, y_train)
LogisticRegression(random_state=0, solver='liblinear')
- Use the testing data to identify how well the model performs.
# making predictions on the testing set
y_pred = logit.predict(X_test)
# comparing actual response values (y_test) with predicted response values (y_pred)
print(f'The Logistic model correctly predicts flower species {100*metrics.accuracy_score(y_test, y_pred):.2f}% of the time.')
The Logistic model correctly predicts flower species 90.00% of the time.
- Describe the observations which are misclassified. Are there any trends?
cm = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(9,9))
sns.set(font_scale=1.4) # for label size
sns.heatmap(cm, ax=ax,annot=True, annot_kws={"size": 16}) # font size
plt.xlabel('Predictions', fontsize=18)
plt.ylabel('Actuals', fontsize=18)
plt.title('Logit Model Confusion Matrix\n Iris Classification', fontsize=18)
plt.show()