Lecture 23: ML with Linear Penalization (FINISHED)
This notebook continues our foray into machine learning. Our goals here are modest. We would like to:
- learn a bit about how machine learning is similar to, and different from, econometrics.
- introduce the scikit-learn package which is chock full of 'machine learning' tools.
- work on some validation methods, which are an important part of the machine learning toolkit.
- explore popular linear penalization models: the ridge and lasso regression models
We've done (1) and (2) so we'll focus on (3) and (4) in this notebook. These are methods that put discipline on the importance of the independent variables of the regression.
The agenda for today's lecture is as follows:
- Practical Set Validation
- Variable Selection with OLS
- Shrinkage Methods
- Ridge Regression
- Least Absolute Shrinkage and Selection Operator (LASSO)
- Elastic Net
Class Announcements
PS5 is on eLC and is due end-of-day Sunday, November 20, 2022.
1. Reminder: The Bias-Variance Trade-off (top)¶
Our objective in ML is to fit a model
$$ y = f(X) + \epsilon.$$such that when we get new data ($X$) we can make good predictions $(\hat y)$. Doing this requires specification of $f()$. As we discussed before, the fundamental constraint in machine learning is the bias-variance tradeoff; ie, our estimated/fit model can be decomposed into
$$\begin{align*} \frac{1}{N}\sum_{n=1}^N(y_n-\hat{y}_n)^2 & = \left(E\left[\,\hat{f}(X)\right]-f(X) \right)^2 + E\,\left(\,\hat{f}(X)-E\left[\,\hat{f}(X)\right]\right)^2 + \text{var}(\epsilon)\\ &= \text{bias}^2 + \text{variance} + \text{var}(\epsilon).\\ \end{align*}$$Bias and variance trade-off¶
Note that all three terms are positive. We can't do anything about the third term, it is the irreducible error. The other two terms we would like to make as small as possible. Unfortunately, shrinking one of the two terms tends to increase the other term. Roughly speaking, there is no free lunch.
The bias term tells us, on average, how close we get to the true model. Are we systematically over- or under-predicting $y$?
The variance term tells us how much our estimates vary as we use different training datasets to estimate the model.
We would like to minimize the bias and the variance of the left-hand side -- the Mean Squared Error (mse). How can we do so?
In the linear models we are considering, complexity increases as we add more variables to $X$. This includes adding polynomials of our independent variables, interactions, etc. How does complexity influence the testing error?
As we increase the complexity of our model $f(\;)$ the squared bias tends to decrease.
As we increase the complexity of our model $f(\;)$ the variance tends to increase.
This is the tradeoff. As we add features to the model, the bias decreases, but the variance increases. This gives rise to a u-shaped mse. This figure from James et al. is a good illustration.
Overfitting¶
Behind the bias-variance tradeoff is the idea of overfitting. The more complex the model, the more it will capture variation in $y$ due to the random error ($\epsilon$).
- This makes the model fit the data better (lower bias). We are capturing $y$ behavior from both $f(\;)$ and $\epsilon$.
- This makes the model more variable. A new set of training data will have different $\epsilon$s. The estimate will change to match these new values of $\epsilon$. Since $\epsilon$ is not related to $f(\;)$, we are making the estimate noisier.
We'll take the ying and yang of this by combining the set validation approach (training and testing) we've already seen with model shrinkage.
An Example of Our Proposed Approach¶
Remember the DGP we generated last lecture. We want a process where we not only say whether the model is good or not but we instead build the model with enough complexity / flexibility while also preventing over-fitting. The idea will be to sample from the data repeatedly an keep models for which the mean squared error is low.
Graphically, this would be like building a model which matches both samples from the DGP we simulated in the previous lecture.
2. Practical "Set Validation" (top)¶
In the last lecture we'll learned about set validation:
- We break our dataset up into two subsets. Call one the training set and the other the testing set.
- We estimate our model on the training data. We measure how well our model fits the data by looking at the mean squared error of the regression or, alternatively, the r-squared, which normalizes the mse by the total squared error. We call these the training mse or the training $r^2$. This is usually what we care about when we are doing inference.
- Using the X data in the testing data, we use the estimated $f(\;)$ from step 2 to predict value for y
- We then compare the predicted ($\hat{y}$) values with the actual ($y$) values in the training data. Again, we can use mse or r-squared to judge how well our predicted data matched the actual data. Call this the test mse or test $r^2$. This is what we usually care about when we are interested in prediction.
You should have felt a bit unsatisfied with this approach. Yes, it enables us to ask: "Is my model good?" but "good" is not well-defined and the validation technique didn't really add any kind of test of discipline to the over-fitting problem. In the practice problems of that lecture, however, you did set validation repeatedly to see if the variance would be high. It would be nice if these repeated set validations somehow affected the choice of model parameters to fit the model to data; hence, repeated set validation would prevent over-fitting.
Today, we'll do something similar (repeated set validation) and learn how to practically implement set validation in a ML framework (i.e., the repeated validation will discipline the ML model). We'll call this repeated set validation cross-validation.
2.1 Leave one out cross-validation (loocv)¶
Leave one out cross-validation (loocv) addresses the concerns with set validation. The loocv algorithm is
For each observation in the data i = 0,...n
- The test data is observation i
- The training data is all the data except observation i
- Estimate the model on the training data
- Evaluate the model on the testing data, this generates a mean squared error associated with 'round' i
The test mse is the average over the n mses computed: $$ CV_{(n)} = \frac{1}{n}\sum_{i=1}^nmse_{i}$$
Notice that this is similar in spirit to what we were doing last class. The difference is that we are doing are repeating the estimate exactly n times and that the split into test and training sets is not random.
Let's try it out. Rather than write code to implement the algorithm above, scikit
will do it for us.
First, load up some automobile data.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# There are a few missing values coded as '?'.
auto = pd.read_csv('./Data/auto.csv', na_values='?')
auto = auto.dropna()
print(auto.info())
auto.head()
<class 'pandas.core.frame.DataFrame'> Int64Index: 392 entries, 0 to 396 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 mpg 392 non-null float64 1 cylinders 392 non-null int64 2 displacement 392 non-null float64 3 horsepower 392 non-null float64 4 weight 392 non-null int64 5 acceleration 392 non-null float64 6 year 392 non-null int64 7 origin 392 non-null int64 8 name 392 non-null object dtypes: float64(4), int64(4), object(1) memory usage: 30.6+ KB None
mpg | cylinders | displacement | horsepower | weight | acceleration | year | origin | name | |
---|---|---|---|---|---|---|---|---|---|
0 | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 70 | 1 | chevrolet chevelle malibu |
1 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 70 | 1 | buick skylark 320 |
2 | 18.0 | 8 | 318.0 | 150.0 | 3436 | 11.0 | 70 | 1 | plymouth satellite |
3 | 16.0 | 8 | 304.0 | 150.0 | 3433 | 12.0 | 70 | 1 | amc rebel sst |
4 | 17.0 | 8 | 302.0 | 140.0 | 3449 | 10.5 | 70 | 1 | ford torino |
We will need the linear_model
package to run the regression and the cross_val_score
function from scikit-learn.
# From scikit-learn import the cross_val_score function
from sklearn.model_selection import cross_val_score
# From scikit-learn, peel off the linear_model package
from sklearn import linear_model
The cross_val_score()
function (docs) takes as arguments an estimator
, in our case OLS, the x data
, the y data
, a scoring
criterion, and the number of splits to make to the data, cv
.
Two comments:
- The r-squared is not a good metric of fit (score) in this case, because we are testing on a single observation. The r-squared will be zero always. Instead, we will use the mean squared error. In scikit learn, the negative of the mean squared error is returned. This is simply a convention: The package creators want all the scoring criterion to be increasing in fit. (e.g., a larger r-squared is better, a larger (closer to zero) negative mse is better).
- We set the
cv
parameter to the number of observations in the data. We want to split the data into $n$ chunks (each chunk with one observation) and leave one chunk out each iteration.
scores = cross_val_score(linear_model.LinearRegression(), auto[['horsepower', 'weight']], auto['mpg'],
scoring='neg_mean_squared_error', cv=len(auto['mpg']))
That was easy! The cross_val_score
function returns an array with the negative mse for each iteration.
print('Scores is of type:', type(scores))
print('Length of scores:', len(scores))
Scores is of type: <class 'numpy.ndarray'> Length of scores: 392
To compute the test mean squared error, $$ CV_{(n)} = \frac{1}{n}\sum_{i=1}^nmse_{i}$$ we just compute the mean of scores. I will multiply by -1, so that we have the mse.
test_mse = scores.mean()
print('The test mse is {0:5.3f}'.format(-test_mse))
The test mse is 18.113
2.2 K-fold Cross-Validation¶
k-fold cross-validation is a generalization of loocv
. An issue with loocv
is that we need to estimate the model $n$ times, once for each observation. If there are a lot of data, or if the model is hard to estimate, this approach is may be computationally intensive. k-fold cross-validation is the shortcut.
The solution to this is to break the data up into k sets (or folds) of length n/k, each set containing n/k randomly assigned observations. We then proceed as in loocv, but estimating the model only $i=1,...,k$ times and leaving out set $i$ as the test set. This figure from James, et al. summarizes it nicely.
In this procedure, the test mse is $$ CV_{(k)} = \frac{1}{k}\sum_{i=1}^k mse_{i}$$
Clearly, loocv is just k-fold cv with $k=n$.
How big is $k$? In practice, people tend to use 5 or 10.
[There are additional reasons dealing with bias and variance to use $k<n$, but that discussion is outside the scope of this class.]
# estimator = LinearRegression, # X_data = auto[['horsepower', 'weight']], y_data = auto['mpg']
scores_10 = cross_val_score(linear_model.LinearRegression(), auto[['horsepower', 'weight']], auto['mpg'],
scoring='neg_mean_squared_error', cv=10)
print(scores_10)
print('\nThe test mse is {0:5.3f}'.format(-scores_10.mean()))
[-12.08976149 -18.98987088 -25.43307056 -13.78096989 -9.1564195 -6.60875451 -11.73671376 -16.18958302 -57.10638935 -35.20884108] The test mse is 20.630
Practice¶
We will use k-fold cross-validation to compare models with higher-degree polynomials in horsepower.
$$\begin{align} mpg &= \beta_0 + \beta_1 hp +\epsilon\\ mpg &= \beta_0 + \beta_1 hp + \beta_2 hp^2 + \epsilon\\ mpg &= \beta_0 + \beta_1 hp + \beta_2 hp^2 + \beta_3 hp^3 + \epsilon\\ mpg &= \beta_0 + \beta_1 hp + \beta_2 hp^2 + \beta_3 hp^3 + \beta_4 hp^4 + \epsilon\\ mpg &= \beta_0 + \beta_1 hp + \beta_2 hp^2 + \beta_3 hp^3 + \beta_4 hp^4 + \beta_5 hp^5 + \epsilon \end{align}$$- Create columns named 'hp2' through 'hp5' that correspond to the powers of horsepower. You can do this manually or via a loop.
# Manual
auto['hp2'] = auto['horsepower']**2
auto['hp3'] = auto['horsepower']**3
auto['hp4'] = auto['horsepower']**4
auto['hp5'] = auto['horsepower']**5
# With a loop
for i in range(2,6):
auto['hp' + str(i)] = auto['horsepower']**i
auto.head()
mpg | cylinders | displacement | horsepower | weight | acceleration | year | origin | name | hp2 | hp3 | hp4 | hp5 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 70 | 1 | chevrolet chevelle malibu | 16900.0 | 2197000.0 | 285610000.0 | 3.712930e+10 |
1 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 70 | 1 | buick skylark 320 | 27225.0 | 4492125.0 | 741200625.0 | 1.222981e+11 |
2 | 18.0 | 8 | 318.0 | 150.0 | 3436 | 11.0 | 70 | 1 | plymouth satellite | 22500.0 | 3375000.0 | 506250000.0 | 7.593750e+10 |
3 | 16.0 | 8 | 304.0 | 150.0 | 3433 | 12.0 | 70 | 1 | amc rebel sst | 22500.0 | 3375000.0 | 506250000.0 | 7.593750e+10 |
4 | 17.0 | 8 | 302.0 | 140.0 | 3449 | 10.5 | 70 | 1 | ford torino | 19600.0 | 2744000.0 | 384160000.0 | 5.378240e+10 |
- Compute the k-fold validation mse with k=10 $$ CV_{(k)} = \frac{1}{k}\sum_{i=1}^k mse_{i}$$ for each of the models. Save the mse for each model in a list.
test_mse = [] # Initialize empty list to record values
score = cross_val_score(linear_model.LinearRegression(), auto[['horsepower']], auto['mpg'],
scoring='neg_mean_squared_error', cv=10)
test_mse.append(score.mean()) # calculate mean and add to list
test_mse.append(cross_val_score(linear_model.LinearRegression(), auto[['horsepower', 'hp2']], auto['mpg'],
scoring='neg_mean_squared_error', cv=10).mean() )
test_mse.append(cross_val_score(linear_model.LinearRegression(), auto[['horsepower', 'hp2', 'hp3']], auto['mpg'],
scoring='neg_mean_squared_error', cv=10).mean() )
test_mse.append(cross_val_score(linear_model.LinearRegression(), auto[['horsepower', 'hp2', 'hp3', 'hp4']], auto['mpg'],
scoring='neg_mean_squared_error', cv=10).mean() )
test_mse.append(cross_val_score(linear_model.LinearRegression(), auto[['horsepower', 'hp2', 'hp3', 'hp4', 'hp5']], auto['mpg'],
scoring='neg_mean_squared_error', cv=10).mean() )
# The score function reports negative mse, convert to positive
test_mse = [-1*i for i in test_mse]
print(test_mse)
[27.439933652339867, 21.23584005580223, 21.33660618322732, 21.353886981375943, 20.905640766798633]
- Plot the mse (not the negative mse) from each model on the y-axis of a scatter plot, with [1, 2, 3, 4, 5] on the x-axis. Which degree of polynomial would you include in your model?
fig, ax = plt.subplots(figsize=(15,6))
ax.plot([1,2,3,4,5], test_mse, color='black')
sns.despine(ax=ax)
# Only have ticks on the integers
ax.set_xticks([1,2,3,4,5])
ax.set_title('MSE for models with different powers of horsepower',size=24)
ax.set_xlabel('degree polynomial of horsepower',size=18)
ax.set_ylabel('mean squared error',size=18)
ax.set_ylim(18,28)
plt.show()
3. Shrinkage Methods (top)¶
The bias-variance tradeoff says that we want to constrain our model's complexity. We don't want the model to memorize the training dataset, we want a model that generalizes well to new, unseen data. There are different ways to do this. For linear models, we do this by penalizing "large" $\beta$ coefficients in order to prevent that our model from picking up "peculiarities" or "noise" in the training data.
In econometrics, an uninformative independent variable may still have a large though insignificant (ie, large standard error) coefficient. We don't have standard errors in ML so we need some way of converting the large standard error into the coefficient. This process is known as regularization and amounts to adding a penalty to our minimization function to our minimization routine:
$$ \begin{align*} \hat\beta=\min_{\beta}& \Big[\sum_{i=1}^n\big(y_i-\hat{y}_i\big)^2\Big] + \text{Penalty}(\beta) \end{align*} $$In intuitive terms, we can think of regularization as a penalty against complexity. Increasing the regularization strength penalizes "large" weight coefficients. Can a coefficient still be "large"? Yes, but only if it does a good job of explaining the data -- it does a good job of minimizing $\sum_{i=1}^n(y_i-\hat{y}_i)^2$.
Two common and easy to grasp methods are the ridge and the lasso regression. We'll add to that the chameleon regression known as elastic net which nests ridge, lasso, and OLS. These models differ only the form of "Penalty($\beta$)" they use.
Let's see how they work.
import pandas as pd # data handling
import numpy as np # numerical methods
import matplotlib.pyplot as plt # plots
Load data on baseball players. Each row is a player. The variable we would like to predict is salary.
base = pd.read_csv('./Data/Hitters.csv')
base.head()
Unnamed: 0 | AtBat | Hits | HmRun | Runs | RBI | Walks | Years | CAtBat | CHits | ... | CRuns | CRBI | CWalks | League | Division | PutOuts | Assists | Errors | Salary | NewLeague | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -Andy Allanson | 293 | 66 | 1 | 30 | 29 | 14 | 1 | 293 | 66 | ... | 30 | 29 | 14 | A | E | 446 | 33 | 20 | NaN | A |
1 | -Alan Ashby | 315 | 81 | 7 | 24 | 38 | 39 | 14 | 3449 | 835 | ... | 321 | 414 | 375 | N | W | 632 | 43 | 10 | 475.0 | N |
2 | -Alvin Davis | 479 | 130 | 18 | 66 | 72 | 76 | 3 | 1624 | 457 | ... | 224 | 266 | 263 | A | W | 880 | 82 | 14 | 480.0 | A |
3 | -Andre Dawson | 496 | 141 | 20 | 65 | 78 | 37 | 11 | 5628 | 1575 | ... | 828 | 838 | 354 | N | E | 200 | 11 | 3 | 500.0 | N |
4 | -Andres Galarraga | 321 | 87 | 10 | 39 | 42 | 30 | 2 | 396 | 101 | ... | 48 | 46 | 33 | N | E | 805 | 40 | 4 | 91.5 | N |
5 rows × 21 columns
The data look okay, but we only have salary for 263 observations. Let's drop them.
base = base.dropna()
base.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 263 entries, 1 to 321 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 263 non-null object 1 AtBat 263 non-null int64 2 Hits 263 non-null int64 3 HmRun 263 non-null int64 4 Runs 263 non-null int64 5 RBI 263 non-null int64 6 Walks 263 non-null int64 7 Years 263 non-null int64 8 CAtBat 263 non-null int64 9 CHits 263 non-null int64 10 CHmRun 263 non-null int64 11 CRuns 263 non-null int64 12 CRBI 263 non-null int64 13 CWalks 263 non-null int64 14 League 263 non-null object 15 Division 263 non-null object 16 PutOuts 263 non-null int64 17 Assists 263 non-null int64 18 Errors 263 non-null int64 19 Salary 263 non-null float64 20 NewLeague 263 non-null object dtypes: float64(1), int64(16), object(4) memory usage: 45.2+ KB
3.1 Ordinary Least Squares (OLS)¶
Let's start with OLS to get a feel for how cross-validation can help us identify the "best" (ie, least risk of over-fitting) model. Start by loading the linear model package:
from sklearn import linear_model # ols, ridge, lasso,
Let's choose some variables that are potentially useful for predicting salary. We are purposely making this set large. The goal is determine how to constrain our choices. Let's look at some descriptive statistics:
var_list = ['Hits', 'RBI', 'HmRun', 'Walks', 'Errors', 'Years', 'Assists', 'AtBat', 'Runs', 'CAtBat', 'CHits', 'CRuns', 'CWalks']
base[var_list].describe()
Hits | RBI | HmRun | Walks | Errors | Years | Assists | AtBat | Runs | CAtBat | CHits | CRuns | CWalks | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 263.000000 | 263.000000 | 263.000000 | 263.000000 | 263.000000 | 263.000000 | 263.000000 | 263.000000 | 263.000000 | 263.000000 | 263.000000 | 263.000000 | 263.000000 |
mean | 107.828897 | 51.486692 | 11.619772 | 41.114068 | 8.593156 | 7.311787 | 118.760456 | 403.642586 | 54.745247 | 2657.543726 | 722.186312 | 361.220532 | 260.266160 |
std | 45.125326 | 25.882714 | 8.757108 | 21.718056 | 6.606574 | 4.793616 | 145.080577 | 147.307209 | 25.539816 | 2286.582929 | 648.199644 | 331.198571 | 264.055868 |
min | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 19.000000 | 0.000000 | 19.000000 | 4.000000 | 2.000000 | 1.000000 |
25% | 71.500000 | 30.000000 | 5.000000 | 23.000000 | 3.000000 | 4.000000 | 8.000000 | 282.500000 | 33.500000 | 842.500000 | 212.000000 | 105.500000 | 71.000000 |
50% | 103.000000 | 47.000000 | 9.000000 | 37.000000 | 7.000000 | 6.000000 | 45.000000 | 413.000000 | 52.000000 | 1931.000000 | 516.000000 | 250.000000 | 174.000000 |
75% | 141.500000 | 71.000000 | 18.000000 | 57.000000 | 13.000000 | 10.000000 | 192.000000 | 526.000000 | 73.000000 | 3890.500000 | 1054.000000 | 497.500000 | 328.500000 |
max | 238.000000 | 121.000000 | 40.000000 | 105.000000 | 32.000000 | 24.000000 | 492.000000 | 687.000000 | 130.000000 | 14053.000000 | 4256.000000 | 2165.000000 | 1566.000000 |
We'll eventualliy be using the ridge regression which works best if the $X$ variables are on the same scale. The StandardScaler()
normalizes the variables. Doing this now makes the analsysis across models clearer. BTW we saw something like this when we did classification.
from sklearn.preprocessing import StandardScaler # normalize variables to have mu=0, std=1
# Standardize the X vars so they have mean = 0 and std = 1
X = StandardScaler().fit_transform(base[var_list])
df=pd.DataFrame(X)
df.columns = var_list
Let's visualize the data after the scaling:
fig, ax = plt.subplots(figsize=(15,10))
sns.boxplot(ax=ax,data=df,)
ax.set_ylabel('Value',size=20)
ax.set_xlabel('\nRHS Data Variable',size=20)
ax.set_title('Visualizing the Standard Scaler',size=32)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.yticks(fontsize=16)
ax.set_xticklabels([x for x in var_list],fontsize=16)
plt.show()
OK, they're now in the same scale but there's still a lot of variation. Estimate the OLS regression. Don't worry about the l2 norm stuff yet.
res_ols = linear_model.LinearRegression().fit(X, base['Salary'])
coef_norm_ols = np.linalg.norm(res_ols.coef_, ord=2)
print(res_ols.coef_)
print('The l2 norm of the ols coefficients is {0:5.1f}.'.format(coef_norm_ols))
[ 342.86853172 56.76095882 31.72386749 150.31756567 -10.80513874 -28.76438817 25.9039893 -299.7189859 -95.89253928 -346.33866351 257.07761972 448.22175687 -157.44988341] The l2 norm of the ols coefficients is 810.4.
3.2 Ridge Regression¶
Whereas we specificed the $X$ variables in OLS presumably based on some economic theory (b/c OLS makes sense in econometrics), the ridge regression is linear penalization regression which modifies the $X$ variables to best fit the data. Specifically, we choose $f(\;)$ to minimize the residual sum of squares (RSS) plus a penalty function:
$$ \begin{align*} \min_{\beta}& \Bigg\{\underbrace{\sum_{i=1}^n\big(y_i-\hat{y}_i\big)^2}_{\text{RSS}} + \alpha \times\left(\sum_{j=1}^p \beta_j^2\right)^{1/2}\Bigg\} \end{align*} $$where $p$ corresponds to the number of variables in $X$. Since OLS minimizes RSS, ridge collapses to OLS when $\alpha=0$. We call $\alpha$ the tuning parameter. When $\alpha>0$, models are penalized for how large their coefficients are. The term multiplying $\alpha$ is the l2 (read "ell two") norm of the coefficient vector.
The Ridge()
function is part of linear_models
in scikit-learn (docs).
First, let's estimate the model with $\alpha=0$. This should return the OLS estimate.
res_ridge_0 = linear_model.Ridge(alpha = 0.0).fit(X, base['Salary'])
coef_norm_r0 = np.linalg.norm(res_ridge_0.coef_, ord=2)
pd.DataFrame({'Variable': var_list, 'OLS': res_ols.coef_ , 'Ridge 0':res_ridge_0.coef_})
Variable | OLS | Ridge 0 | |
---|---|---|---|
0 | Hits | 342.868532 | 342.868532 |
1 | RBI | 56.760959 | 56.760959 |
2 | HmRun | 31.723867 | 31.723867 |
3 | Walks | 150.317566 | 150.317566 |
4 | Errors | -10.805139 | -10.805139 |
5 | Years | -28.764388 | -28.764388 |
6 | Assists | 25.903989 | 25.903989 |
7 | AtBat | -299.718986 | -299.718986 |
8 | Runs | -95.892539 | -95.892539 |
9 | CAtBat | -346.338664 | -346.338664 |
10 | CHits | 257.077620 | 257.077620 |
11 | CRuns | 448.221757 | 448.221757 |
12 | CWalks | -157.449883 | -157.449883 |
print('The l2 norm of the ridge(a=0) coefficients is {0:5.1f}.'.format(coef_norm_r0))
The l2 norm of the ridge(a=0) coefficients is 810.4.
Now estimate the ridge model with $\alpha=100$. This adds a penalty to each coefficient that is not zero.
res_ridge_100 = linear_model.Ridge(alpha = 100.0).fit(X, base['Salary'])
coef_norm_r100 = np.linalg.norm(res_ridge_100.coef_, ord=2)
print(res_ridge_100.coef_)
print('The l2 norm of the ridge(a=100) coefficients is {0:5.1f}.'.format(coef_norm_r100))
[ 52.15138012 38.4847365 8.50483461 50.13721432 -11.04140017 5.05574661 -2.28073256 2.33576236 26.24338138 38.80799151 59.32436779 59.90473347 20.04176666] The l2 norm of the ridge(a=100) coefficients is 129.0.
That decreased the norm of the coefficients quite a bit. We can keep going...
res_ridge_800 = linear_model.Ridge(alpha = 800.0).fit(X, base['Salary'])
coef_norm_r800 = np.linalg.norm(res_ridge_800.coef_, ord=2)
print(res_ridge_800.coef_)
print('The l2 norm of the ridge(a=800) coefficients is {0:5.1f}.'.format(coef_norm_r800))
[24.26292247 22.88938261 14.84319012 25.27605131 -2.84991212 17.94244586 -0.10430381 17.54807161 21.14484914 27.12005935 30.11332925 30.81759733 23.94825642] The l2 norm of the ridge(a=800) coefficients is 78.9.
pd.DataFrame({'Variable': var_list, 'OLS': res_ols.coef_ ,
'Ridge 100':res_ridge_100.coef_,
'Ridge 800':res_ridge_800.coef_ })
Variable | OLS | Ridge 100 | Ridge 800 | |
---|---|---|---|---|
0 | Hits | 342.868532 | 52.151380 | 24.262922 |
1 | RBI | 56.760959 | 38.484737 | 22.889383 |
2 | HmRun | 31.723867 | 8.504835 | 14.843190 |
3 | Walks | 150.317566 | 50.137214 | 25.276051 |
4 | Errors | -10.805139 | -11.041400 | -2.849912 |
5 | Years | -28.764388 | 5.055747 | 17.942446 |
6 | Assists | 25.903989 | -2.280733 | -0.104304 |
7 | AtBat | -299.718986 | 2.335762 | 17.548072 |
8 | Runs | -95.892539 | 26.243381 | 21.144849 |
9 | CAtBat | -346.338664 | 38.807992 | 27.120059 |
10 | CHits | 257.077620 | 59.324368 | 30.113329 |
11 | CRuns | 448.221757 | 59.904733 | 30.817597 |
12 | CWalks | -157.449883 | 20.041767 | 23.948256 |
What size penalty?¶
We see that increasing $\alpha$ decreases the norm of the coefficients. How big should $\alpha$ be? We want the $\alpha$ that gives us the best test mse. As we discussed above, cross-validation provides a methodology to evaluate test MSEs.
The scikit package has a method that combines the ridge estimation with cross validation. It is called RidgeCV()
(docs).
We pass to RidgeCV()
a list of the alpha values to try.
alpha_grid = [1e-3, 1e-2, 1e-1, 1, 1e2, 1e3, 1e4]
# Setting 'store_cv_values' to true will hang on to all the test mses from the CV. Otherwise, it only keep the best one.
model = linear_model.RidgeCV(alphas=alpha_grid, store_cv_values = True).fit(X,base['Salary'])
The function returns an object with useful attributes and methods. Let's look at a few.
# The .alpha_ holds the best alpha
print('The best alpha from the candidate alphas is {0}.'.format(model.alpha_))
The best alpha from the candidate alphas is 1.0.
best_coef_ridge = pd.DataFrame({'Variable':var_list, 'Ridge': model.coef_})
print(best_coef_ridge)
Variable Ridge 0 Hits 309.382723 1 RBI 61.508414 2 HmRun 21.763078 3 Walks 140.610067 4 Errors -12.050349 5 Years -47.933631 6 Assists 19.984472 7 AtBat -277.165125 8 Runs -70.493630 9 CAtBat -173.607372 10 CHits 177.955937 11 CRuns 355.839427 12 CWalks -139.569671
Since I set 'store_cv_values' to true, I have the matrix of all the test mse. Columns correspond to alpha values, and there is one row for each observation (ie, the function uses loocv).
print(model.cv_values_)
print(model.cv_values_.shape)
[[3.02843609e+04 3.01781068e+04 2.91875911e+04 ... 2.44419200e+00 1.35781601e+03 3.14392936e+03] [4.90961822e+04 4.90869411e+04 4.89995087e+04 ... 1.81722250e+04 9.46378842e+03 4.23064769e+03] [3.58334242e+05 3.58009759e+05 3.54643288e+05 ... 1.34723883e+05 5.49162558e+04 6.19574577e+03] ... [3.57189181e+03 3.58819796e+03 3.73565465e+03 ... 1.18305515e+04 1.32823134e+04 2.04707709e+04] [8.11076387e+04 8.10128280e+04 8.01251664e+04 ... 4.62882943e+04 8.80650163e+04 1.58284001e+05] [2.87812484e+04 2.89069399e+04 3.01449649e+04 ... 4.67532198e+04 9.66306892e+04 1.87097272e+05]] (263, 7)
# The mean test mse for each value of alpha
model.cv_values_.mean(axis=0)
array([124401.99686713, 124327.02083352, 123702.9384668 , 121624.58267113, 125267.96774173, 136848.44181746, 184530.51723419])
3.3 LASSO¶
The LASSO (least absolute shrinkage and selection operator) regression works like the ridge, but with a different penalty function.
$$ \begin{align*} \min_{\beta}& \Bigg\{\sum_{i=1}^n\big(y_i-\hat{y}_i\big)^2 + \alpha \times\sum_{j=1}^p | \beta_j|\Bigg\} \end{align*} $$The penalty function here is the l1 norm, or the sum of the absolute values of the absolute values of the coefficient. The major difference between the ridge and the lasso is that the lasso can generate coefficients that are zero, dropping them from the model and making it simpler.
The Lasso
regression is part of linear_models
in scikit-learn (docs).
res_lasso_0 = linear_model.Lasso(alpha = 0.0,max_iter=10000).fit(X, base['Salary'])
coef_norm_l0 = np.linalg.norm(res_lasso_0.coef_, ord=1)
C:\Users\jt83241\AppData\Local\Temp\ipykernel_31832\366927581.py:1: UserWarning: With alpha=0, this algorithm does not converge well. You are advised to use the LinearRegression estimator res_lasso_0 = linear_model.Lasso(alpha = 0.0,max_iter=10000).fit(X, base['Salary']) C:\Users\jt83241\Anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:647: UserWarning: Coordinate descent with no regularization may lead to unexpected results and is discouraged. model = cd_fast.enet_coordinate_descent( C:\Users\jt83241\Anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:647: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.375e+07, tolerance: 5.332e+03 Linear regression models with null weight for the l1 regularization term are more efficiently fitted using one of the solvers implemented in sklearn.linear_model.Ridge/RidgeCV instead. model = cd_fast.enet_coordinate_descent(
pd.DataFrame({'Variabl': var_list, 'OLS': res_ols.coef_ , 'LASSO 0':res_lasso_0.coef_})
Variabl | OLS | LASSO 0 | |
---|---|---|---|
0 | Hits | 342.868532 | 342.868532 |
1 | RBI | 56.760959 | 56.760959 |
2 | HmRun | 31.723867 | 31.723867 |
3 | Walks | 150.317566 | 150.317566 |
4 | Errors | -10.805139 | -10.805139 |
5 | Years | -28.764388 | -28.764388 |
6 | Assists | 25.903989 | 25.903989 |
7 | AtBat | -299.718986 | -299.718986 |
8 | Runs | -95.892539 | -95.892539 |
9 | CAtBat | -346.338664 | -346.338664 |
10 | CHits | 257.077620 | 257.077620 |
11 | CRuns | 448.221757 | 448.221757 |
12 | CWalks | -157.449883 | -157.449883 |
print(f'The l2 norm of the LASSO(a=0) coefficients is {coef_norm_l0:,.1f}.')
The l2 norm of the LASSO(a=0) coefficients is 2,251.8.
res_lasso_2 = linear_model.Lasso(alpha = 2.0).fit(X, base['Salary'])
coef_norm_2 = np.linalg.norm(res_lasso_2.coef_, ord=1)
pd.DataFrame({'Variable':var_list, 'OLS': res_ols.coef_ , 'LASSO 2': res_lasso_2.coef_})
Variable | OLS | LASSO 2 | |
---|---|---|---|
0 | Hits | 342.868532 | 289.542400 |
1 | RBI | 56.760959 | 68.101418 |
2 | HmRun | 31.723867 | 2.138459 |
3 | Walks | 150.317566 | 123.811873 |
4 | Errors | -10.805139 | -3.454889 |
5 | Years | -28.764388 | -47.045848 |
6 | Assists | 25.903989 | 4.797088 |
7 | AtBat | -299.718986 | -260.218269 |
8 | Runs | -95.892539 | -41.705288 |
9 | CAtBat | -346.338664 | -0.000000 |
10 | CHits | 257.077620 | 10.262085 |
11 | CRuns | 448.221757 | 336.280067 |
12 | CWalks | -157.449883 | -123.029009 |
print(f'The l1 norm of the LASSO(a=2) coefficients is {coef_norm_2:,.1f}.')
The l1 norm of the LASSO(a=2) coefficients is 1,310.4.
Practice¶
- Try estimating a lasso regression when $\alpha = 5$. Compare the coefficient vector to the lasso with $\alpha=2$ from above.
res_lasso_5 = linear_model.Lasso(alpha = 5.0).fit(X, base['Salary'])
coef_norm_5 = np.linalg.norm(res_lasso_5.coef_, ord=1)
pd.DataFrame({'Variable':var_list,
'OLS': res_ols.coef_ ,
'LASSO 2': res_lasso_2.coef_,
'LASSO 5': res_lasso_5.coef_})
Variable | OLS | LASSO 2 | LASSO 5 | |
---|---|---|---|---|
0 | Hits | 342.868532 | 289.542400 | 187.192908 |
1 | RBI | 56.760959 | 68.101418 | 53.895913 |
2 | HmRun | 31.723867 | 2.138459 | 0.000000 |
3 | Walks | 150.317566 | 123.811873 | 84.479341 |
4 | Errors | -10.805139 | -3.454889 | -4.387499 |
5 | Years | -28.764388 | -47.045848 | -25.720740 |
6 | Assists | 25.903989 | 4.797088 | -0.000000 |
7 | AtBat | -299.718986 | -260.218269 | -144.987736 |
8 | Runs | -95.892539 | -41.705288 | -0.000000 |
9 | CAtBat | -346.338664 | -0.000000 | -0.000000 |
10 | CHits | 257.077620 | 10.262085 | 66.157744 |
11 | CRuns | 448.221757 | 336.280067 | 180.220618 |
12 | CWalks | -157.449883 | -123.029009 | -32.203122 |
- Estimate lasso regressions with $\alpha$ equal to 10 and 200. What is happening to the coefficients? What is happening to the norm of the coefficients? Does this make sense?
res_lasso_10 = linear_model.Lasso(alpha = 10.0).fit(X, base['Salary'])
coef_norm_10 = np.linalg.norm(res_lasso_10.coef_, ord=1)
res_lasso_200 = linear_model.Lasso(alpha = 200.0).fit(X, base['Salary'])
coef_norm_200 = np.linalg.norm(res_lasso_200.coef_, ord=1)
pd.DataFrame({'Variable':var_list,
'OLS': res_ols.coef_ ,
'LASSO 2': res_lasso_2.coef_,
'LASSO 5': res_lasso_5.coef_,
'LASSO 10': res_lasso_10.coef_,
'LASSO 200': res_lasso_200.coef_,
})
Variable | OLS | LASSO 2 | LASSO 5 | LASSO 10 | LASSO 200 | |
---|---|---|---|---|---|---|
0 | Hits | 342.868532 | 289.542400 | 187.192908 | 75.623355 | 0.000000 |
1 | RBI | 56.760959 | 68.101418 | 53.895913 | 39.965107 | 0.000000 |
2 | HmRun | 31.723867 | 2.138459 | 0.000000 | 0.000000 | 0.000000 |
3 | Walks | 150.317566 | 123.811873 | 84.479341 | 62.899989 | 0.000000 |
4 | Errors | -10.805139 | -3.454889 | -4.387499 | -8.154722 | 0.000000 |
5 | Years | -28.764388 | -47.045848 | -25.720740 | -0.000000 | 0.000000 |
6 | Assists | 25.903989 | 4.797088 | -0.000000 | -0.000000 | 0.000000 |
7 | AtBat | -299.718986 | -260.218269 | -144.987736 | -0.000000 | 0.000000 |
8 | Runs | -95.892539 | -41.705288 | -0.000000 | 0.000000 | 0.000000 |
9 | CAtBat | -346.338664 | -0.000000 | -0.000000 | 0.000000 | 0.000000 |
10 | CHits | 257.077620 | 10.262085 | 66.157744 | 56.927126 | 0.000000 |
11 | CRuns | 448.221757 | 336.280067 | 180.220618 | 135.248380 | 53.351392 |
12 | CWalks | -157.449883 | -123.029009 | -32.203122 | -0.000000 | 0.000000 |
norms = pd.DataFrame([coef_norm_2, coef_norm_5, coef_norm_10, coef_norm_200], index=[2, 5, 10, 200])
norms.rename(columns={0:'Coefficient Norm'},inplace=True)
norms.head()
Coefficient Norm | |
---|---|
2 | 1310.386692 |
5 | 779.245621 |
10 | 378.818680 |
200 | 53.351392 |
- Use the
LassoCV()
function (docs) to estimate LASSO regressions for the following grid of alphas:Set thealpha_grid = [1, 1.5, 2, 3, 4, 5, 6, 8, 10]
cv
parameter to 10. This means the method will use k-fold cross validation with k=10.
alpha_grid = [1, 1.5, 2, 3, 4, 5, 6, 8, 10]
model = linear_model.LassoCV(alphas=alpha_grid, cv=10).fit(X,base['Salary'])
- Which $\alpha$ worked best?
print(f'The best alpha for this model is {model.alpha_:4.2f}.')
The best alpha for this model is 2.00.
- Use the
.mse_path_
attribute of the object returned byLassoCV()
to return a (9,10) sized array. The 9 corresponds to the number of alphas and the 10 corresponds to the test mses from the 10-fold cross validation.
MSE = model.mse_path_.mean(axis=1) # Average MSE for each alpha value
ALPHA = model.alphas_ # Set of alphas use in the grid
- Create a plot with $\alpha$ on the x-axis and the average test mse on the y-axis. You will have to take an average of the MSEs using the
.mean()
attribute. To make sure the mse correspond to the correct alpha, use the.alphas_
attribute of the results object for the x values.
fig, ax = plt.subplots(figsize=(15,6))
ax.plot(ALPHA, MSE, marker = 'o', color='blue')
ax.set_xlabel('Tuning Parameter ($\\alpha$)',size=16)
ax.set_ylabel('Average Test MSE (10-fold cv)',size=16)
ax.set_title('Determining the Optimal Tuning Parameter ($\\alpha$)',size=24)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
- Now let the computer sort out the best tuning parameter. Call
LassoCV()
as before but do not use thealphas
argument. The algorithm will search for the best $\alpha$ using cross-validation to prevent over-fitting. Set the k-fold cross-validation to 10 (ie, include argumentcv=10
).
model_2 = linear_model.LassoCV(alphas=None,cv=10).fit(X,base['Salary'])
- What is the optimal $\alpha$? How many $\alpha$ values did the algorithm try? [Use the
.alphas_
attribute.]
print(f'The best alpha for this model is {model_2.alpha_:4.2f}.')
print(f'The algorithm tried {len(model_2.alphas_):d} different alpha values.')
The best alpha for this model is 2.06. The algorithm tried 100 different alpha values.
- Let's visualize the relationship between $\alpha$ and mse. Using the
mse_path_
andalphas_
attributes, create a plot with $\alpha$ on the x-axis and the average test mse on the y-axis.
fig, ax = plt.subplots(figsize=(15,6))
ax.plot(model_2.alphas_, model_2.mse_path_.mean(axis=1), marker = 'o', color='blue')
ax.set_xlabel('Tuning Parameter ($\\alpha$)',size=16)
ax.set_ylabel('Average Test MSE (10-fold cv)',size=16)
ax.set_title('Determining the Optimal Tuning Parameter ($\\alpha$)',size=24)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
- Display the optimal coefficient vector alongside the OLS and the best ridge coefficients. Did the lasso eliminate any variables? How do the models compare?
best_coef_lasso = pd.DataFrame({'Variable':var_list, 'LASSO': model_2.coef_})
Results = pd.DataFrame({'Variable': var_list, 'OLS': res_ols.coef_})
Results = Results.merge(best_coef_ridge,on='Variable',how='left')
Results = Results.merge(best_coef_lasso,on='Variable',how='left')
Results.head()
Variable | OLS | Ridge | LASSO | |
---|---|---|---|---|
0 | Hits | 342.868532 | 309.382723 | 286.786874 |
1 | RBI | 56.760959 | 61.508414 | 68.325339 |
2 | HmRun | 31.723867 | 21.763078 | 1.474243 |
3 | Walks | 150.317566 | 140.610067 | 122.818539 |
4 | Errors | -10.805139 | -12.050349 | -3.312956 |
3.4 Elastic Net¶
Let's have our cake and eat it too. Enter the "Elastic Net". Two models enter (Ridge and LASSO) but only one model leaves. Not quite Bloodsport) but close.
The Elastic Net combines characteristics of both Ridge and Lasso; ie, it reduces the impact of different features while not eliminating all of the features. The Elastic Net is defined as:
$$ \begin{align*} \min_{\beta}& \Bigg\{\underbrace{\sum_{i=1}^n\big(y_i-\hat{y}_i\big)^2}_{\text{RSS}} + \lambda\times\bigg[\frac{(1-\alpha)}{2} \times\left(\sum_{j=1}^p \beta_j^2\right) + \alpha \times\sum_{j=1}^p | \beta_j|\Bigg\}\bigg] \end{align*} $$The parameter $\alpha$ is the mixing parameter between Ridge ($\alpha=0$) and Lasso ($\alpha=1$). The $\lambda$ parameter is the penalty parameter where $\lambda=0$ generates OLS.
Confusingly:
- We set the alpha hyperparameter via the
l1_ratio
argument which controls the contribution of the L1 and L2 penalties. - We set the lambda hyperparameter using
alpha
argument which controls the contribution of the sum of both penalties to the loss function.
Note that the default is an equal balance for the alpha hyperparamter (l1_ratio=0.5
) and a weight of 1.0 used for the penality $\lambda$ parameter (alpha=1
).
res_elastic = linear_model.ElasticNet(l1_ratio=0.5,alpha=1.0).fit(X, base['Salary'])
pd.DataFrame({'Variable':var_list, 'Elastic Net': res_elastic.coef_})
Variable | Elastic Net | |
---|---|---|
0 | Hits | 46.741287 |
1 | RBI | 35.993662 |
2 | HmRun | 10.472049 |
3 | Walks | 46.321603 |
4 | Errors | -9.678005 |
5 | Years | 9.412563 |
6 | Assists | -1.885578 |
7 | AtBat | 8.004799 |
8 | Runs | 26.898680 |
9 | CAtBat | 38.156859 |
10 | CHits | 54.245400 |
11 | CRuns | 54.932313 |
12 | CWalks | 23.347402 |
Practice¶
- Use
ElasticNetCV
to estimate an elastic net model with 10-fold cross-validation. Compare the results to the Ridge and LASSO results. Set thealpha
values using the argumentl1_ratio=l1_grid
where we create a grid of potential $\alpha$ values:
l1_grid = np.arange(.1,1,1e-2).tolist()
l1_grid = np.arange(.1,1,1e-2).tolist()
# Fit the model. I set the number of alpha candidates for each lambda value to 1000 (the default is 100)
res_elastic = linear_model.ElasticNetCV(l1_ratio=l1_grid,n_alphas=1000,cv=10).fit(X, base['Salary'])
# Return the (lambda,alpha) pair that minimizes the test mse
print('The best lambda for this model is {0:6.4f}.'.format(res_elastic.alpha_))
print('The best alpha for this model is {0:6.4f}.'.format(res_elastic.l1_ratio_))
The best lambda for this model is 0.2899. The best alpha for this model is 0.9300.
best_coef_elastic = pd.DataFrame({'Variable':var_list, 'Elastic Net': res_elastic.coef_})
Results = pd.DataFrame({'Variable': var_list, 'OLS': res_ols.coef_})
Results = Results.merge(best_coef_ridge,on='Variable',how='left')
Results = Results.merge(best_coef_lasso,on='Variable',how='left')
Results = Results.merge(best_coef_elastic,on='Variable',how='left')
Results
Variable | OLS | Ridge | LASSO | Elastic Net | |
---|---|---|---|---|---|
0 | Hits | 342.868532 | 309.382723 | 286.786874 | 201.956558 |
1 | RBI | 56.760959 | 61.508414 | 68.325339 | 66.608628 |
2 | HmRun | 31.723867 | 21.763078 | 1.474243 | 3.222321 |
3 | Walks | 150.317566 | 140.610067 | 122.818539 | 109.878949 |
4 | Errors | -10.805139 | -12.050349 | -3.312956 | -13.959146 |
5 | Years | -28.764388 | -47.933631 | -46.653438 | -55.848516 |
6 | Assists | 25.903989 | 19.984472 | 4.403061 | 8.721780 |
7 | AtBat | -299.718986 | -277.165125 | -257.913655 | -180.999441 |
8 | Runs | -95.892539 | -70.493630 | -39.978953 | -17.765845 |
9 | CAtBat | -346.338664 | -173.607372 | -0.000000 | -16.116714 |
10 | CHits | 257.077620 | 177.955937 | 11.966717 | 137.178701 |
11 | CRuns | 448.221757 | 355.839427 | 332.473667 | 193.934717 |
12 | CWalks | -157.449883 | -139.569671 | -121.022429 | -80.818871 |
coeff_var = pd.DataFrame([res_ols.coef_.var(),
best_coef_ridge['Ridge'].var(),
best_coef_lasso['LASSO'].var(),
best_coef_elastic['Elastic Net'].var()], index=['OLS', 'Ridge', 'LASSO', 'Elastic Net'])
coeff_var.rename(columns={0:'Variance of Coefficients'},inplace=True)
coeff_var.head()
Variance of Coefficients | |
---|---|
OLS | 49692.522353 |
Ridge | 33497.511737 |
LASSO | 23976.706027 |
Elastic Net | 12270.408073 |
Comments The Elastic net looks closer to LASSO but uses all of the variables. We also observe a reduction in variance of coefficients which is smallest.