Lecture 5: Decision Trees [SOLUTIONS]¶
Decision Trees are really powerful models because they mimic human decision-making so they're easy to understand. Plus, we'll see that the fitted model can be visualized as a flow chart which makes understanding the strengths and weaknesses of the model simple.
The agenda for today's lecture is as follows:
1. Decision Trees (top)¶
Like the linear penalization (e.g., LASSO) and classification (e.g., Logistic) models we tackled last week, a Decision Tree is a supervised machine learning algorithm which is to say we're fitting a model to make a prediction. Decision Trees are used in both classification and regression algorithms which makes them a jack-of-all-trades.
A Tree?¶
The decision tree is like a tree with with branches that connect at "nodes." Each node splits data into branches and the model creates nodes until it achieves some threshold value of "goodness of fit" where further splits don't help with prediction. A decision tree therefore consists of the root nodes, children nodes, and leaf nodes.
An Example¶
Let’s Understand the decision tree methods by Taking one Real-life Scenario.
Imagine that you play football every Sunday and you always invite your friend to come to play with you. Sometimes your friend actually comes and sometimes he doesn't.The factor on whether or not to come depends on numerous things, like weather, temperature, wind, and fatigue. We start to take all of these features into consideration and begin tracking them alongside your friend's decision whether to come for playing or not.
Decision Trees contain the following elements:
Root: This is the node where the first split takes place.
Nodes: It is The point where the tree splits according to the value of some attribute/feature of the dataset.
Edges: It directs the outcome of a split to the next node we can see in the figure above that there are nodes for features like outlook, humidity and windy. There is an edge for each potential value of each of those attributes/features.
Leaves: These are the terminal nodes that predict the outcome of the decision tree
2. How do we build a decision tree? (top)¶
A. At a given node.¶
Select the best Feature using Attribute Selection Measures(ASM) to split the records.
- Information Gain: How much information a feature/attribute provides us. A decision tree always tries to maximize the value of the information gain so a node/attribute having the highest value of the information gain will be split first. The information gain of X on Y is: $$ \text{IG}(Y,X) = E(Y) - E(Y|X) $$ Where $E(Y)$ is known as entropy which captures the measure of disorder or uncertainty which we want to reduce. Entropy is defined as $$ E(S) = -\sum_{i=1}^c p_i\log_2(p_i) $$ where $p_i$ is the probability of class $i$ happening in our data. In our example, $p_i$ is simply success (friend shows up) or failure (friend does not show up) so if we had a total of 100 data points in our dataset with 30 belonging to the positive class and 70 belonging to the negative class then "P+" would be 3/10 and "P-" would be 7/10 and the entropy measure would be $$ -\frac{3}{10}\times \log_2\big(\frac{3}{10}\big)-\frac{7}{10}\times \log_2\big(\frac{7}{10}\big)\approx 0.88 $$ Higher levels of entropy mean more disorder and entropy close to zero is something we would like because there is little randomness. Consider the following picture:
The x-axis measures the proportion of data points belonging to the positive class in each bubble and the y-axis axis measures their respective entropies. The inverted "U" shape of the graph (why we don't call this a hill in math defies logic btw) communicates that entropy is lowest at the extremes -- when the bubble either contains no positive instances or only positive instances. At this point, the data is predictable / ordered. Entropy is highest in the middle when the bubble is evenly split between positive and negative instances; i.e. chaos reigns supreme at entropy equal to 1.
An Example¶
Suppose we wanted to see how informative credit raining is on the likelihood of default. The Entropy score for default in our hypothetical data is 1 so there's a lot of chaos in the raw data. Introducing credit rating and solve for entropy yields an entropy score of $E(\text{default}|\text{rating})=0.625$. The IG gain is then
$$ \begin{align*} \text{IG}(Y,X) &= E(Y) - E(Y|X)\\ \equiv \text{IG}(\text{default},\text{credit rating}) &= E(\text{default}) - E(\text{default}|\text{credit rating})\\ \Rightarrow \text{IG}(\text{default},\text{credit rating}) &= 1 - 0.625 = 0.375 \end{align*} $$So including "credit rating" gave us information to better predict liability. At each node, the decision tree algorithm looks at all of the potential features and chooses the feature with the most informating gain; i.e., it chooses the feature with the most predictive feature.
B. Outer Loop¶
Build the tree by repeating this process recursively for each child until one of the following condition is being achieved :
All tuples belonging to the same attribute value.
There are no more of the attributes remaining.
There are no more instances remaining.
Another Way to Prioritize Features¶
Another simple way to evaluate how effective a feature is at predicting an outcome is the Gini Index:
$$ \text{Gini Index} = 1 - \sum_{i=1}^cp_i^2 $$where $p_i$ is the probability of class $i$ of the feature. A feature having a low Gini Index is preferred. Note that you've probably heard of a Gini Index before as they're popular measures of inequality.
We know a lot about flowers already so let's work on developing a decision tree model on our iris dataset.
from sklearn import datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load data as a dataframe
def sklearn_to_df(sklearn_dataset):
df = pd.DataFrame(sklearn_dataset.data, columns=sklearn_dataset.feature_names)
df['target'] = pd.Series(sklearn_dataset.target)
return df
iris = sklearn_to_df(datasets.load_iris())
iris.rename(columns={'target':'species'},inplace=True)
iris.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
sepal length (cm) | 150.0 | 5.843333 | 0.828066 | 4.3 | 5.1 | 5.80 | 6.4 | 7.9 |
sepal width (cm) | 150.0 | 3.057333 | 0.435866 | 2.0 | 2.8 | 3.00 | 3.3 | 4.4 |
petal length (cm) | 150.0 | 3.758000 | 1.765298 | 1.0 | 1.6 | 4.35 | 5.1 | 6.9 |
petal width (cm) | 150.0 | 1.199333 | 0.762238 | 0.1 | 0.3 | 1.30 | 1.8 | 2.5 |
species | 150.0 | 1.000000 | 0.819232 | 0.0 | 0.0 | 1.00 | 2.0 | 2.0 |
Always good to do some pictures to understand the variation. Of course, none of this is new to us so I threw in some optional arguments to change the picture just to spice things up. You're welcome.
sns.pairplot(iris,hue='species',palette='Set1')
<seaborn.axisgrid.PairGrid at 0x1a480eee4c0>
Now off to fitting the model. As usual, let's split the data so we can evaluate how well we do:
from sklearn.model_selection import train_test_split
x = iris.drop('species', axis = 1)
y = iris['species']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 10)
And now we're ready to fit a decision tree model using "information gain" as our metric (versus "gini") to choose feature order.
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(criterion='entropy') # alternative is 'gini' which is a different way to measure information gain
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
Let's see how we did by looking at the classification report. Recall the key components of the classification report from last lecture:
- Precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier to not label a sample as positive if it is negative (i.e., avoid Type 1 errors).
- Recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.
- F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0.
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
precision recall f1-score support 0 1.00 1.00 1.00 14 1 0.94 1.00 0.97 17 2 1.00 0.93 0.96 14 accuracy 0.98 45 macro avg 0.98 0.98 0.98 45 weighted avg 0.98 0.98 0.98 45
And of course we can visualize the Type 1 (false positives) and Type 2 (false negatives) errors directly via the confusion matrix.
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(7,7))
sns.set(font_scale=1.4) # for label size
sns.heatmap(cm, ax=ax,annot=True, annot_kws={"size": 16}) # font size
plt.xlabel('Predictions', fontsize=18)
plt.ylabel('Actuals', fontsize=18)
plt.title('Confusion Matrix', fontsize=18)
plt.show()
And now here is a cool new thing -- we can see the fitted decision tree directly!
from sklearn import tree
fig, ax = plt.subplots(figsize=(25,20))
tree.plot_tree(model,
feature_names=iris.drop('species', axis = 1).columns.values.tolist(),
class_names='species',
filled=True)
plt.show()
Practice: Kyphosis Correction in Children¶
The word kyphosis refers to the spinal curve that reflects an aberrantly rounded back that might happen as a result of distress, contagious disease, developmental anomalies, inherited and also sometimes iatrogenic disease. Kyphosis can occur to people of all ages and is commonly found in adolescents and children affected by malnutrition which are at times very difficult to predict or to confirm as they come in different postures and sizes. Our job is to develop a decision tree model to predict whether surgery will be successful based on characteristics of each case.
- Load
kyphosis.csv
. The kyphosis dataset contains 81 children who have had corrective spinal surgery and includes the following variables:
Kyphosis: Was the surgery successful? That is, was kyphosis "present" or "absent" after the operation.
Age: Age of the patient in months. Remember, these are children.
Number: Number of vertebrae involved in the operation.
Start: Number of the first or topmost vertebrae that was operated on. I think this means that the surgery starts at the top and goes down but that's only a guess.
df = pd.read_csv('./Data/kyphosis.csv')
print(df.shape)
df.head()
(81, 4)
Kyphosis | Age | Number | Start | |
---|---|---|---|---|
0 | absent | 71 | 3 | 5 |
1 | absent | 158 | 3 | 14 |
2 | present | 128 | 4 | 5 |
3 | absent | 2 | 5 | 1 |
4 | absent | 1 | 4 | 15 |
- Use
train_test_split
to separate the data into training and testing for the X and Y variables. Set the testing sample to 20% of the total and therandom_state
to ten. Remember, the outcome variable we want to predict is the "Kyphosis" variable.
x_train, x_test, y_train, y_test = train_test_split(df.drop('Kyphosis', axis = 1), df['Kyphosis'], test_size=0.2, random_state = 10)
print(x_train.shape, x_test.shape)
print(y_train.shape, y_test.shape)
(64, 3) (17, 3) (64,) (17,)
- Fit a Decision Tree model to the training data using the "entropy" criterion.
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(criterion='entropy') # alternative is 'gini' which is a different way to measure information gain
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
- Print out the fitted decision tree.
fig, ax = plt.subplots(figsize=(25,30))
tree.plot_tree(model,
feature_names=df.drop('Kyphosis', axis = 1).columns.values.tolist(),
class_names='Kyphosis',
filled=True)
plt.show()
- Use the testing data to evaluate the accuracy of the model.
from sklearn import metrics
# Model Accuracy, how often is the classifier correct?
y_pred = model.predict(x_test)
print('The accuracy of the Decision Tree is {0:4.2f} percent.'.format(metrics.accuracy_score(y_test, y_pred)))
The accuracy of the Decision Tree is 0.76 percent.
- Discuss the prevalence of false positives and negatives.
print(classification_report(y_test, y_pred))
precision recall f1-score support absent 0.82 0.82 0.82 11 present 0.67 0.67 0.67 6 accuracy 0.76 17 macro avg 0.74 0.74 0.74 17 weighted avg 0.76 0.76 0.76 17
cm = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(7,7))
sns.set(font_scale=1.4) # for label size
sns.heatmap(cm, ax=ax,annot=True, annot_kws={"size": 16}) # font size
plt.xlabel('Predictions', fontsize=18)
plt.ylabel('Actuals', fontsize=18)
plt.title('Confusion Matrix', fontsize=18)
plt.show()