Lecture 6: Random Forests [MY SOLUTIONS]¶
Random forests is a ML algorithm which can be used both for classification and regression. It is also very flexible and easy to use. Consequently, you will see it used in a variety of applications: e.g., recommending movies, classifying images, classifying loan applicants, identifying fraudulent activity, and predicting diseases.
What exactly is it? We know that a forest is comprised of trees and it is said that the more trees a forest has, the more robust a forest is. "Random Forests" combines the predictions of decision trees -- something we now understand -- using a simple voting rule. The genius of the algorithm is that these decision trees are generated by random subsets of the training data (both observations and features). Thus, a random forest model has the idea of cross-validation embedded in the algorithm. These models also provide a pretty good indicator of the feature (variable) importance so it pulls back the curtain on how the algorithm uses the data variation to make predictions.
The agenda for today's lecture is as follows:
We're alwqays worried that the models we're developing are based on "bad" data. Recall, that we overcame this in the past using cross-validation where implicitly we assumed the true data-generating process was constant so we could sample from the data several times to build the most efficient model (i.e., balance the bias-variance tradeoff). The low correlation between samples is the key. Further, this isn't a new idea. Mutual funds are comprised of investments with low correlations (stocks and bonds) to generate an object greater than the sum of its parts.
Random Forests apply this idea to the area of decision trees. The fundamental concept behind random forest is a simple but powerful one — the wisdom of crowds. In data science speak, the reason that the random forest model works so well is: A large number of relatively uncorrelated models (trees) operating as a committee will outperform any of the individual constituent models. Note that this is the same idea as cross-validation since samplying there was by-construction uncorrelated.
In Random Forests, the uncorrelated models produce "ensemble" predictions that are more accurate than any of the individual predictions. The reason for this wonderful effect is that the trees protect each other from their individual errors (as long as they don't constantly all err in the same direction). While some trees may be wrong, many other trees will be right, so as a group the trees are able to move in the correct direction.
1.1 A Simple Decision-making Example¶
Suppose you want to go on a trip and you would like to travel to a place which you will enjoy.
So what do you do to find a place that you will like? You can search online, read reviews on travel blogs and portals, or you can also ask your friends.
You decide to ask your friends and get recommendations from every friend. Now you have to make a list of those recommended places. Then, you ask them to vote (or select one best place for the trip) from the list of recommended places you made. The place with the highest number of votes will be your final choice for the trip.
There are twq parts in the above decision process:
- You ask your friends about their individual travel experience to get a recommendation out of the multiple places they have visited. This part is like using the decision tree algorithm. Here, each friend makes a selection of the places he or she has visited so far.
- You collect the recommendations and ask them to vote. You select the place which gets the most votes.
This whole process of getting recommendations from friends and voting on them to find the best place is known as the random forests algorithm.
1.2 Randomization¶
We already know about decision trees so we just need to think about how these trees are randomized.
There are two components of randomization
Bagging (Bootstrap Aggregation) Decisions trees are very sensitive to the data they are trained on — small changes to the training set can result in significantly different tree structures. Random forest takes advantage of this by allowing each individual tree to randomly sample from the dataset with replacement, resulting in different trees. Notice that with bagging we are not subsetting the training data into smaller chunks and training each tree on a different chunk. Rather, if we have a sample of size N, we are still feeding each tree a training set of size N
Feature Randomness In a normal decision tree, when it is time to split a node, we consider every possible feature and pick the one that produces the most separation between the observations in the left node vs. those in the right node (i.e., "Information Gain"). Each tree in a random forest, however, can pick only from a random subset of features so we have randomness in the features as well. This forces even more variation amongst the trees in the model and ultimately results in lower correlation across trees and more diversification.
1.3 The Actual Algorithm¶
Okay, so how does this actually work? The algorithm takes place in four steps:
- Generate $B$ sub-samples from training data $X$ and $Y$; call these $X_b$, $Y_b$.
- Sample (with replacement) to generate N observations
- Randomize over features
- Train decision tree $f_b$ on $X_b$, $Y_b$. Doe this for each sub-sample so we now have $B$ decision trees.
- Generate predictions using the test data $X^\prime$. How? Run $X^\prime$ through each of the $B$ trees and collect each tree's prediction. Create an aggregate prediction by averaging these $B$ predictions: $$ \hat {f}={\frac {1}{B}}\sum _{b=1}^{B}f_{b}(x') $$ or by taking the majority vote in the case of classification trees where the outcome is a discrete variable (e.g., the person either survives the sinking of the Titanic or does not).
For the visual learners out there:
import pandas as pd
import sklearn.cluster
import matplotlib.pyplot as plt
import seaborn as sns
iris = pd.read_csv('./Data/iris.csv')
g = sns.pairplot(iris, hue='species')
And we'll need to separate the data into training and testing data...
from sklearn.model_selection import train_test_split
X=iris[['sepal length', 'sepal width', 'petal length', 'petal width']] # Features
y=iris['species'] # Labels
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) # 70% training and 30% test
from sklearn.ensemble import RandomForestClassifier
#Create a Gaussian Classifier
clf=RandomForestClassifier(n_estimators=100,random_state=0)
#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
After training, check the accuracy using actual and predicted values.
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
# Model Accuracy, how often is the classifier correct?
print('The accuracy of the Random Forest is {0:4.2f}.'.format(metrics.accuracy_score(y_test, y_pred)))
The accuracy of the Random Forest is 0.96.
Which Features Are Most Important?¶
We can use the algorithm to observe the relative importance of the different features. To do this,
- Create a random forests model.
- Use the feature importance variable to see feature importance scores.
- Visualize the feature scores.
We already have done (1) so let's proceed to steps (2) and (3).
# Recover feature importance scores
feature_imp = pd.Series(clf.feature_importances_,index=['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']).sort_values(ascending=False)
# So what do we have?
feature_imp
petal length (cm) 0.474230 petal width (cm) 0.397897 sepal length (cm) 0.097186 sepal width (cm) 0.030687 dtype: float64
fig, ax = plt.subplots(figsize=(10,6))
sns.barplot(x=feature_imp, y=feature_imp.index,ax=ax)
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
sns.despine()
plt.show()
Looks like we can remove the "sepal width" feature because it has very low importance. Let's re-estimate the model with the three remaining features and compare our predictions.
# Split dataset into features and labels
X=iris[['petal length', 'petal width','sepal length']] # Removed feature "sepal length"
y=iris['species']
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.70, random_state=5) # 70% training and 30% test
#Create a Gaussian Classifier
clf=RandomForestClassifier(n_estimators=100,random_state=0)
#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(X_train,y_train)
# prediction on test set
y_pred=clf.predict(X_test)
# Model Accuracy, how often is the classifier correct?
print('The accuracy of the Random Forest is {0:4.2f}.'.format(metrics.accuracy_score(y_test, y_pred)))
The accuracy of the Random Forest is 0.95.
3. Predicting Survivors on the Titanic (top)¶
Another famous ML data set relates to the sinking of the Titanic in 1912 and was featured in the Kaggle competition "Titanic: Machine Learning from Disaster."
The RMS Titanic was a British passenger liner that sank in the North Atlantic Ocean in the early morning hours of 15 April 1912, after it collided with an iceberg during its maiden voyage from Southampton to New York City. There were an estimated 2,224 passengers and crew aboard the ship, and more than 1,500 died, making it one of the deadliest commercial peacetime maritime disasters in modern history. The RMS Titanic was the largest ship afloat at the time it entered service and was the second of three Olympic-class ocean liners operated by the White Star Line. The Titanic was built by the Harland and Wolff shipyard in Belfast. Thomas Andrews, her architect, died in the disaster.
The data set provides information on the fate of passengers on the Titanic, summarized according to economic status (class), sex, age and survival. Our task to predict whether a passenger on the titanic would have been survived or not. Here are the variables:
- Passenger ID to identifiy the passenger, numerical feature (Passenger ID/Ticket Number).
- Survived binary indicator: 1 if survived and 0 otherwise.
- Pclass is the Ticket class (1 = 1st (Upper), 2 = 2nd (Middle), 3 = 3rd (lower))
- Name is passenger name.
- Sex is passenger gender.
- Age is the age in years
- sibsp is the number of siblings / spouses aboard the Titanic
- parch is the number of parents / children aboard the Titanic
- ticket is the ticket number
- fare is the Passenger fare
- cabin is a modified cabin number. Entries with an X indicate no cabin number available.
- embarked means Port of Embarkation. C = Cherbourg, Q = Queenstown, S = Southampton
Let's start by loading the data. I've cleaned it up a lot since we don't want to spend a lot of time on that here.
data = pd.read_csv('./Data/titanic_train_clean.csv')
data.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | X | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | X | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | X | S |
Feature engineering¶
Feature engineering involves analyze the features and extract usefull information out of it, also creating new features out of existing ones. Let's start by doing some visualitazion
fig, axx = plt.subplots(1, 3, figsize=(20,5))
axx[0].set_title('Amounth of Siblins/Spouses')
sns.countplot(x='SibSp', data=data, ax=axx[0])
axx[1].set_title('Amounth of parents/children')
sns.countplot(x='Parch', data=data, ax=axx[1])
axx[2].set_title('Distribution of Classes')
sns.countplot(x='Pclass', data=data, ax=axx[2])
plt.tight_layout()
According to the graphics, we can see that most of the people were alone and most belonged to 3rd class (lower). This corresponds to waht we saw earlier with the Cabins and the fare, most people without a cabin assign had a small fare, makes sense they belong to class 3. We can create a new feature that specifies if the person was traveling alone or with family based on SibSp and Parch attributes, also the size of the family. Those attributes could be of interest.
def create_alone_feature(SibSp_Parch):
if (SibSp_Parch[0]+SibSp_Parch[1])==0:
return 1
else:
return 0
data['Alone'] = data[['SibSp','Parch']].apply(create_alone_feature, axis=1)
data['Familiars'] = 1 + data['SibSp'] + data['Parch']
data['Alone'] = data[['SibSp','Parch']].apply(create_alone_feature, axis=1)
data['Familiars'] = 1 + data['SibSp'] + data['Parch']
For classification is good to see the data in relation to the label.
fig, axx = plt.subplots(2, 3, figsize=(20,10))
axx[0,0].set_title('Survivors')
sns.countplot(x='Survived', data=data, ax=axx[0,0])
axx[0,1].set_title('Survivors by Sex')
sns.countplot(x='Survived', hue='Sex', data=data, ax=axx[0,1])
axx[0,2].set_title('Survivors by Pclass')
sns.countplot(x='Survived', hue='Pclass', data=data, ax=axx[0,2])
axx[1,0].set_title('Accompanied survivors')
sns.countplot(x='Survived', hue='Alone', data=data, ax=axx[1,0])
axx[1,1].set_title('Accompanied survivors')
sns.countplot(x='Familiars', hue='Survived', data=data, ax=axx[1,1])
axx[1,2].set_title('Alone members by Pclass')
sns.countplot(x='Pclass', hue='Alone', data=data, ax=axx[1,2])
plt.tight_layout()
From the countplots we observe:
- Most people died in the incident.
- Although most of the passengers were male, survivors tended to be women.
- Non-survivors tended to have 3rd class tickets (ie, relatively poor). They were probably evacuated last and possibly were located in parts of the ship with more difficult access.
- Most of those who died were alone. We observe that 3rd class passengers tended to be alone
Now let's see the age and the fare of the survivors:
fig, axx = plt.subplots(1, 3, figsize=(20,5))
axx[0].set_title('Age of Survivors')
sns.boxplot(x='Survived', y='Age', data=data, ax=axx[0])
axx[1].set_title('Survivors by Sex')
sns.boxplot(x='Survived', y='Age', hue='Sex', data=data, ax=axx[1])
axx[2].set_title('Survivors by Sex')
sns.boxplot(x='Survived', y='Age', hue='Pclass', data=data, ax=axx[2])
plt.tight_layout()
fig, axx = plt.subplots(1, 2, figsize=(16,5))
axx[0].set_title('Distribution of Fare of Non-survivors')
sns.distplot(a=data[data['Survived']==0]['Fare'], ax=axx[0], bins=30)
axx[1].set_title('Distribution of Fare of Survivors')
sns.distplot(a=data[data['Survived']==1]['Fare'], ax=axx[1], bins=30)
plt.tight_layout()
C:\Users\jt83241\Anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\Users\jt83241\Anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
Clearly the Fare of those ho survived were higher, we can see that in the distribution of each one. Those ho survived where a little bit younger than those ho died. We can also see that people in 1st class were older than the rest and people from 3rd class were younger. Next we can see the correlation between the features. Fare and Survived has som correlation between, but correlation doesn't take into account categoricals, so better to map features like Sex and Embarked to numbers.
plt.figure(figsize=(12,8))
sns.heatmap(data.corr(), annot=True)
plt.tight_layout()
Mapping Categoricals¶
Machine Learning algorithms work with numbers not categories (strings) so we need to find a way to map these categories into numbers. This is easy to do in Pandas using replace
and map()
methods.
data['Sex'].replace('female',1,inplace=True)
data['Sex'].replace('male',0,inplace=True)
categories = {"S": 1, "C": 2, "Q": 3}
data['Embarked']= data['Embarked'].map(categories)
categories = data.Cabin.unique()
data['Cabin'] = data.Cabin.astype("category").cat.codes
plt.figure(figsize=(14,8))
sns.heatmap(data.corr(), annot=True)
plt.tight_layout()
Remember, the goal is as simple a model as possible (bias-variance trade-off) so let's drop features which likely have no effect on probability of survival; ie, 'ticket number', 'name', and 'paggenger id'.
# dropping columns
data = data.drop(['Name','Ticket','PassengerId'], axis=1)
data.head()
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Cabin | Embarked | Alone | Familiars | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | 0 | 22.0 | 1 | 0 | 7.2500 | 8 | 1 | 0 | 2 |
1 | 1 | 1 | 1 | 38.0 | 1 | 0 | 71.2833 | 2 | 2 | 0 | 2 |
2 | 1 | 3 | 1 | 26.0 | 0 | 0 | 7.9250 | 8 | 1 | 1 | 1 |
3 | 1 | 1 | 1 | 35.0 | 1 | 0 | 53.1000 | 2 | 1 | 0 | 2 |
4 | 0 | 3 | 0 | 35.0 | 0 | 0 | 8.0500 | 8 | 1 | 1 | 1 |
Normalize the data¶
Many machine learning algorithms like Regression types and distance based ones can converge faster when the data is normalized, this is a key step in every machine learning situation. To do so i will use MinMaxScaler
library from Sklearn, but first we need to drop the label. When scaling, we only fit the scaler to the training dataset. This actually isn't so important for the random forest model but it's worthwhile to know for other algorithms which are more finicky.
from sklearn.preprocessing import MinMaxScaler
# Dropping label
y = data['Survived']
data = data.drop('Survived', axis=1) # Dropping label to normalize
scaler = MinMaxScaler()
scaled_train = scaler.fit_transform(data)
scaled_train = pd.DataFrame(scaled_train, columns=data.columns, index=data.index)
scaled_train.head()
Pclass | Sex | Age | SibSp | Parch | Fare | Cabin | Embarked | Alone | Familiars | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 0.0 | 0.271174 | 0.125 | 0.0 | 0.014151 | 1.00 | 0.0 | 0.0 | 0.1 |
1 | 0.0 | 1.0 | 0.472229 | 0.125 | 0.0 | 0.139136 | 0.25 | 0.5 | 0.0 | 0.1 |
2 | 1.0 | 1.0 | 0.321438 | 0.000 | 0.0 | 0.015469 | 1.00 | 0.0 | 1.0 | 0.0 |
3 | 0.0 | 1.0 | 0.434531 | 0.125 | 0.0 | 0.103644 | 0.25 | 0.0 | 0.0 | 0.1 |
4 | 1.0 | 0.0 | 0.434531 | 0.000 | 0.0 | 0.015713 | 1.00 | 0.0 | 1.0 | 0.0 |
Practice ¶
Now all of that EDA is done, let's get to work using a Random Forest model to predict who will survive the Titanic.
- Use
train_test_split
to separate the data into training and testing for the X and Y variables. Set the testing sample to 20% of the total andrandom_state
to zero.
X_train, X_test, y_train, y_test = train_test_split(scaled_train, y, test_size=0.2, random_state=0)
print(X_train.shape, X_test.shape)
print(y_train.shape, y_test.shape)
(712, 10) (179, 10) (712,) (179,)
- Fit a Random Forest model to the training data. Set
n_estimators
to 5 andrandom_state
to zero.
clf = RandomForestClassifier(n_estimators=5,random_state=0)
#Train the model
clf.fit(X_train, y_train)
RandomForestClassifier(n_estimators=5, random_state=0)
- Use the testing data to evaluate the accuracy of the model.
# Model Accuracy, how often is the classifier correct?
y_pred = clf.predict(X_test)
print('The accuracy of the Random Forest is {0:4.2f} percent.'.format(metrics.accuracy_score(y_test, y_pred)))
The accuracy of the Random Forest is 0.78 percent.
- Visualize feature importance. Are there any features (variables) we can exclude?
feature_imp = pd.Series(clf.feature_importances_, index=scaled_train.columns).sort_values(ascending=False)
plt.figure(figsize=(10,6))
sns.barplot(x=feature_imp, y=feature_imp.index)
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title('Visualizing Important Features')
sns.despine()
plt.tight_layout()
- Remove unimportant features. Your Choice.
# Removing less important features
new_train = scaled_train.drop(['Alone','Parch','Embarked','SibSp'], axis=1)
- Fit the model again using only the important features and use the testing data to evaluate the accuracy. Did the model improve?
X_train, X_test, y_train, y_test = train_test_split(new_train, y, test_size=0.2, random_state=0)
clf = RandomForestClassifier(n_estimators=5,random_state=0)
#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('The accuracy of the Random Forest is {0:4.2f} percent.'.format(metrics.accuracy_score(y_test, y_pred)))
The accuracy of the Random Forest is 0.82 percent.
- Try increasing the number of estimators (trees) to 10. Does the model do better?
clf = RandomForestClassifier(n_estimators=10,random_state=0)
#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('The accuracy of the Random Forest is {0:4.2f} percent.'.format(metrics.accuracy_score(y_test, y_pred)))
The accuracy of the Random Forest is 0.85 percent.