Lecture 24: Classification [FINISHED]

Classification is a task that requires the use of machine learning algorithms to learn how to assign a class label to examples from the problem domain. We may want to classify:

email as "spam" or "not spam".
a tumor as "benign" or "malignant".
a flower as a particular species.
which car a person is likely to buy.

The first two example are binary -- the outcome is either true or not. The second two examples are multi-class -- there are several different discrete values the outcome could take.

There are many different types of classification tasks that you may encounter in machine learning and specialized approaches to modeling that may be used for each. In this lecture we'll touch on some of the more popular:

Binary Classification Models
1.1 Load and Explore Data
1.2 Logistic

Multi-Class Classification Models
2.1 Load and Explore Data
2.2 Logistic

Process
The process for fitting a model is:

(i) Initialize the model by creating a model object.
(ii) Fit the model on training data.
(iii) Test model accuracy by comparing test data predictions with test data results.

Note: Each model has what are called "hyperparameters" which we could "tune" using cross-validation as in the last lecture. For expediency, we won't do that in this lecture but you should in a real-world application.

Class Announcements

PS5 is on eLC. New deadline: Tuesday, November 22nd.

1. Binary Classification Models (top)¶

When the outcome variable is either true (1) or not (0), we can predict the outcome with a binary classification model. Let's load some data and get started.

1.1 Load and Explore Data on Breast Cancer ¶

sklearn comes with a bunch of pre-loaded datasets. You can see the available datasets here. Some of these we've already used in their csv form. Let's load the data on breast cancer by using the following function to transform the data into a dataframe. This isn't required to fit our models but let's stick with pandas since it's so powerful.

In [80]:

import pandas as pd
import numpy as np
from sklearn import datasets

# Load data as a dataframe
def sklearn_to_df(sklearn_dataset):
    df = pd.DataFrame(sklearn_dataset.data, columns=sklearn_dataset.feature_names)
    df['target'] = pd.Series(sklearn_dataset.target)
    return df

cancer = sklearn_to_df(datasets.load_breast_cancer())
cancer.rename(columns={'target':'malignant'},inplace=True)
cancer.head()

Out[80]:

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	...	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension
0	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.3001	0.14710	0.2419	0.07871	...	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890
1	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.0869	0.07017	0.1812	0.05667	...	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902
2	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.1974	0.12790	0.2069	0.05999	...	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758
3	11.42	20.38	77.58	386.1	0.14250	0.28390	0.2414	0.10520	0.2597	0.09744	...	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300
4	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.1980	0.10430	0.1809	0.05883	...	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678

5 rows × 31 columns

The data descriptions are here. Let's take a look at the data:

In [81]:

cancer.columns

Out[81]:

Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension',
       'malignant'],
      dtype='object')

In [82]:

cancer.describe().T

Out[82]:

	count	mean	std	min	25%	50%	75%	max
mean radius	569.0	14.127292	3.524049	6.981000	11.700000	13.370000	15.780000	28.11000
mean texture	569.0	19.289649	4.301036	9.710000	16.170000	18.840000	21.800000	39.28000
mean perimeter	569.0	91.969033	24.298981	43.790000	75.170000	86.240000	104.100000	188.50000
mean area	569.0	654.889104	351.914129	143.500000	420.300000	551.100000	782.700000	2501.00000
mean smoothness	569.0	0.096360	0.014064	0.052630	0.086370	0.095870	0.105300	0.16340
mean compactness	569.0	0.104341	0.052813	0.019380	0.064920	0.092630	0.130400	0.34540
mean concavity	569.0	0.088799	0.079720	0.000000	0.029560	0.061540	0.130700	0.42680
mean concave points	569.0	0.048919	0.038803	0.000000	0.020310	0.033500	0.074000	0.20120
mean symmetry	569.0	0.181162	0.027414	0.106000	0.161900	0.179200	0.195700	0.30400
mean fractal dimension	569.0	0.062798	0.007060	0.049960	0.057700	0.061540	0.066120	0.09744
radius error	569.0	0.405172	0.277313	0.111500	0.232400	0.324200	0.478900	2.87300
texture error	569.0	1.216853	0.551648	0.360200	0.833900	1.108000	1.474000	4.88500
perimeter error	569.0	2.866059	2.021855	0.757000	1.606000	2.287000	3.357000	21.98000
area error	569.0	40.337079	45.491006	6.802000	17.850000	24.530000	45.190000	542.20000
smoothness error	569.0	0.007041	0.003003	0.001713	0.005169	0.006380	0.008146	0.03113
compactness error	569.0	0.025478	0.017908	0.002252	0.013080	0.020450	0.032450	0.13540
concavity error	569.0	0.031894	0.030186	0.000000	0.015090	0.025890	0.042050	0.39600
concave points error	569.0	0.011796	0.006170	0.000000	0.007638	0.010930	0.014710	0.05279
symmetry error	569.0	0.020542	0.008266	0.007882	0.015160	0.018730	0.023480	0.07895
fractal dimension error	569.0	0.003795	0.002646	0.000895	0.002248	0.003187	0.004558	0.02984
worst radius	569.0	16.269190	4.833242	7.930000	13.010000	14.970000	18.790000	36.04000
worst texture	569.0	25.677223	6.146258	12.020000	21.080000	25.410000	29.720000	49.54000
worst perimeter	569.0	107.261213	33.602542	50.410000	84.110000	97.660000	125.400000	251.20000
worst area	569.0	880.583128	569.356993	185.200000	515.300000	686.500000	1084.000000	4254.00000
worst smoothness	569.0	0.132369	0.022832	0.071170	0.116600	0.131300	0.146000	0.22260
worst compactness	569.0	0.254265	0.157336	0.027290	0.147200	0.211900	0.339100	1.05800
worst concavity	569.0	0.272188	0.208624	0.000000	0.114500	0.226700	0.382900	1.25200
worst concave points	569.0	0.114606	0.065732	0.000000	0.064930	0.099930	0.161400	0.29100
worst symmetry	569.0	0.290076	0.061867	0.156500	0.250400	0.282200	0.317900	0.66380
worst fractal dimension	569.0	0.083946	0.018061	0.055040	0.071460	0.080040	0.092080	0.20750
malignant	569.0	0.627417	0.483918	0.000000	0.000000	1.000000	1.000000	1.00000

In [83]:

import matplotlib.pyplot as plt
import seaborn as sns

# g = sns.pairplot(cancer, hue='malignant')  # Run this on your own. It takes a while...

Split the Data into Training and Testing Data Sets¶

In [84]:

# splitting X and y into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
                                                    cancer.drop('malignant',axis=1), 
                                                    cancer['malignant'], 
                                                    test_size=0.20,
                                                    random_state=10)

1.2 Logistic Regression ¶

The logistic regression passes the linear model through a non-linear function that constrains the output to lie between zero and one. In the logistic case, the function looks like

$$\text{Prob}(Y=1|X) = \frac{\exp \left({\beta_0+\beta_1 X}\right)}{1+\exp \left({\beta_0+\beta_1 X}\right)}.$$

The classifier gives us a set of outputs or classes based on probability when we pass the inputs through the above prediction function and returns a probability score between 0 and 1. We're currently considering the binary case where Y=1 or Y=0. We ecide with a threshold value above which we classify values into Class 1 and of the value goes below the threshold then we classify it in Class 2. Here, we predict the outcome variable is true whenever $\text{Prob}(Y=1|X) \ge 0.5$. Graphically, this amounts to:

In [85]:

X = np.arange(-8, 8, 0.1);
    
# Determine Y
Y = 1/(1+np.exp(-X))

# Create Figure
fig, ax = plt.subplots(figsize=(15,8))

ax.axhline(y=0.5, color='red',linewidth=1,ls='--')

ax.annotate('Class One: Y=1',xy=(-6,.6),va='center',ha='left',size=18)
ax.annotate('Class Two: Y=0',xy=(-6,.4),va='center',ha='left',size=18)
ax.axhspan(.5, 1, alpha=0.2, color='blue')

ax.plot(X,Y, color = 'black')

ax.set_ylim(0,1)
ax.set_yticks(np.arange(0, 1.01, step=0.1))
ax.set_xlim(-8,8)
ax.set_xlabel('Independent Variable (X)',size=14)
ax.set_ylabel('Dependent Variable (Y)',size=14)
ax.set_title('Logistic Function and Decision Rule',size=20)

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.show()

When we fit the model to the training data, the algorithm chooses $\beta$ such that:

$$ \hat\beta = \max_\beta \sum_i \Big[y_i\log\big(f(x_i,\beta)\big) + (1-y_i)\log\big(1-f(x_i,\beta)\big)\Big] $$

where

$$ f(x_i,\beta)=\frac{\exp \left({\beta_0+\beta_1 x_i}\right)}{1+\exp \left({\beta_0+\beta_1 x_i}\right)} $$

What's happening here? In the first equation, the optimal $\beta$ generates predictions from $f(X,\beta)$ as consistent as possible with the observed ouctomes (y).

Some things to note:

The objective function in the first equation is called a log-likelihood since it represents the model's ability the "likelihood" of predicting the outcome (y).
The computer actually solves for the minimum so replace the $\max$ operator with the $\min$ operator and put a negative in front of the $\sum$ to convert the equation to what the computer is atually doing.
In the math above there is only one feature in X but of course that's just a simplification. There can be many features!

Using Logistic Regression to Diagnose Cancer¶

Let's see how well the logistic model diagnoses breast cancer.

In [86]:

from sklearn import metrics
from sklearn.linear_model import LogisticRegression

# Fit model based on training data
logit = LogisticRegression(random_state=0,solver='liblinear')
logit.fit(X_train, y_train)

# making predictions on the testing set
y_pred = logit.predict(X_test)
  
# Model Accuracy: how often is the classifier correct?
print(f'The Logistic model correctly predicts breast cancer {100*metrics.accuracy_score(y_test, y_pred):.2f}% of the time.')

The Logistic model correctly predicts breast cancer 93.86% of the time.

Wow, that worked really well! Let's try a different model and see if we can do better.

1.3 Testing Results¶

Confusion Matrix¶

We can evaluate how well the model is doing by decomposing the successes and failures using a tool called a "confusion matrix." A confusion matrix is a table which we use describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known.

Let's look at the confusion matrix in our Logit ML:

In [87]:

import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

fig, ax = plt.subplots(figsize=(7,7))

sns.set(font_scale=1.4) # for label size
sns.heatmap(cm, ax=ax,annot=True, annot_kws={"size": 16}) # font size

plt.xlabel('Predictions', fontsize=18)
plt.ylabel('Actuals', fontsize=18)
plt.title('Confusion Matrix', fontsize=18)
plt.show()

Diagonal entries are when the model was successful. As for errors:

18 times the model predicted the tumor was malignant when it fact it was benign ("False Positive" or "Type I Error")

13 times the model predicted the tumor was bening when it fact it was malignant ("False Negative" or "Type II Error")

We can convert the confusion matrix into useful statistics:

Precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier to not label a sample as positive if it is negative (i.e., avoid Type 1 errors).

Recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.

F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0.

We can access all of these statistics in the classification_report.

In [88]:

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.88      0.95      0.91        39
           1       0.97      0.93      0.95        75

    accuracy                           0.94       114
   macro avg       0.93      0.94      0.93       114
weighted avg       0.94      0.94      0.94       114

The receiver operating characteristic (ROC) curve is another common tool used with binary classifiers. The dotted line represents the ROC curve of a purely random classifier; a good classifier stays as far away from that line as possible (toward the top-left corner). Area under the ROC Curve (AUC) is a way of translating the figure into a statistic where bigger equals better: A model whose predictions are 100% wrong has an AUC of 0 while a model whose predictions are 100% correct has an AUC of 1.

As you will see, this classifier does better than random guesses.

In [95]:

from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

logit_roc_auc = roc_auc_score(y_test, logit.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, logit.predict_proba(X_test)[:,1])

fig, ax = plt.subplots(figsize=(9,9))

ax.plot(fpr, tpr, label='Logistic Regression (AUC = %0.2f)' % logit_roc_auc,)
ax.plot([0, 1], [0, 1],'r--')
ax.set_xlim([0.0, 1.0])
ax.set_ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

2. Multi-Class Classification Models (top)¶

Data may be discrete and also non-binary. In this section we discuss models to deal with multiclass discrete outcomes. Note also that any of these can be used in the binary case.

In a previous lecture we used the logit model to classify flower species based on the dimensions of its petal and sepal. Let's start there and then introduce additional tools.

2.1 Load and Explore Data on Flowers (top)¶

In [66]:

iris = sklearn_to_df(datasets.load_iris())
iris.rename(columns={'target':'species'},inplace=True)

iris.describe().T

Out[66]:

	count	mean	std	min	25%	50%	75%	max
sepal length (cm)	150.0	5.843333	0.828066	4.3	5.1	5.80	6.4	7.9
sepal width (cm)	150.0	3.057333	0.435866	2.0	2.8	3.00	3.3	4.4
petal length (cm)	150.0	3.758000	1.765298	1.0	1.6	4.35	5.1	6.9
petal width (cm)	150.0	1.199333	0.762238	0.1	0.3	1.30	1.8	2.5
species	150.0	1.000000	0.819232	0.0	0.0	1.00	2.0	2.0

In [67]:

g = sns.pairplot(iris, hue='species')

2.2 Multiclass Logistic Regression ¶

The Logistic model can be used in multicase contexts as well. Let's fit the model again to the flower data.

Practice ¶

Split the data into training and testing data sets. Set the test_size=0.2 and the random_state=10.

In [68]:

X_train, X_test, y_train, y_test = train_test_split(
                                                    iris.drop('species',axis=1), 
                                                    iris['species'], 
                                                    test_size=0.20,
                                                    random_state=10)

Fit the logistic model to the training data.

In [69]:

# Fit model based on training data
logit = LogisticRegression(random_state=0,solver='liblinear')
logit.fit(X_train, y_train)

Out[69]:

LogisticRegression(random_state=0, solver='liblinear')

Use the testing data to identify how well the model performs.

In [70]:

# making predictions on the testing set
y_pred = logit.predict(X_test)
  
# comparing actual response values (y_test) with predicted response values (y_pred)
print(f'The Logistic model correctly predicts flower species {100*metrics.accuracy_score(y_test, y_pred):.2f}% of the time.')

The Logistic model correctly predicts flower species 90.00% of the time.

Describe the observations which are misclassified. Are there any trends?

In [71]:

cm = confusion_matrix(y_test, y_pred)

fig, ax = plt.subplots(figsize=(9,9))

sns.set(font_scale=1.4) # for label size
sns.heatmap(cm, ax=ax,annot=True, annot_kws={"size": 16}) # font size

plt.xlabel('Predictions', fontsize=18)
plt.ylabel('Actuals', fontsize=18)
plt.title('Logit Model Confusion Matrix\n Iris Classification', fontsize=18)
plt.show()